Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and...

400

Transcript of Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and...

Page 1: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 2: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Analysis of Survey Data

Edited by

R. L. CHAMBERS and C. J. SKINNERUniversity of Southampton, UK

Page 3: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 4: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Analysis of Survey Data

Page 5: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

WILEY SERIES IN SURVEY METHODOLOGY

Established in part by WALTER A. SHEWHART AND SAMUEL S. WILKS

Editors : Robert M. Groves, Graham Kalton, J. N. K. Rao, Norbert Schwarz,Christopher Skinner

A complete list of the titles in this series appears at the end of this volume.

Page 6: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Analysis of Survey Data

Edited by

R. L. CHAMBERS and C. J. SKINNERUniversity of Southampton, UK

Page 7: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Copyright # 2003 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, WestSussex PO19 8SQ, England

Telephone (+ 44) 1243 779777

Email (for orders and customer service enquiries): [email protected] our Home Page on www.wileyeurope.com or www.wiley.com

All Rights Reserved. No part of this publication may be reproduced, stored in a retrievalsystem or transmitted in any form or by any means, electronic, mechanical, photocopying,recording, scanning or otherwise, except under the terms of the Copyright, Designs andPatents Act 1988 or under the terms of a licence issued by the Copyright Licensing AgencyLtd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writingof the Publisher. Requests to the Publisher should be addressed to the Permissions Depart-ment, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO198SQ, England, or emailed to [email protected], or faxed to (+44) 1243 770620.

This publication is designed to provide accurate and authoritative information in regard tothe subject matter covered. It is sold on the understanding that the Publisher is not engagedin rendering professional services. If professional advice or other expert assistance isrequired, the services of a competent professional should be sought.

Other Wiley Editorial Offices

John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA

Jossey-Bass, 989 Market Street, San Francisco, CA 94103±1741, USA

Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany

John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia

John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02±01, Jin Xing Distripark, Singapore129809

John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1

Wiley also publishes its books in a variety of electronic formats. Some content that appearsin print may not be available in electronic books.

Library of Congress Cataloging-in-Publication Data

Analysis of survey data / edited by R.L. Chambers and C.J. Skinner.p. cm. – (Wiley series in survey methodology)

Includes bibliographical references and indexes.ISBN 0-471-89987-9 (acid-free paper)1. Mathematical statistics–Methodology. I. Chambers, R. L. (Ray L.) II. Skinner, C. J.

III. Series.

QA276 .A485 2003001.4022–dc21

2002033132

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

ISBN 0 471 89987 9

Typeset in 10/12 pt Times by Kolam Information Services, Pvt. Ltd, Pondicherry, IndiaPrinted and bound in Great Britain by Biddles Ltd, Guildford, Surrey.This book is printed on acid-free paper responsibly manufactured from sustainable forestryin which at least two trees are planted for each one used for paper production.

Page 8: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

To T. M. F. Smith

Page 9: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 10: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Contents

Preface xvList of Contributors xviii

Chapter 1 Introduction 1R. L. Chambers and C. J. Skinner1.1. The analysis of survey data 11.2. Framework, terminology and specification

of parameters 21.3. Statistical inference 61.4. Relation to Skinner, Holt and Smith (1989) 81.5. Outline of this book 9

PART A APPROACHES TO INFERENCE 11

Chapter 2 Introduction to Part A 13R. L. Chambers2.1. Introduction 132.2. Full information likelihood 142.3. Sample likelihood 202.4. Pseudo-likelihood 222.5. Pseudo-likelihood applied to analytic inference 232.6. Bayesian inference for sample surveys 262.7. Application of the likelihood principle in

descriptive inference 27

Chapter 3 Design-based and Model-based Methods for EstimatingModel Parameters 29David A. Binder and Georgia R. Roberts3.1. Choice of methods 293.2. Design-based and model-based linear estimators 31

3.2.1. Parameters of interest 323.2.2. Linear estimators 323.2.3. Properties of 33b and b

Page 11: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

3.3. Design-based and total variances of linear estimators 343.3.1. Design-based and total variance of 343.3.2. Design-based mean squared error of and

its model expectation 363.4. More complex estimators 37

3.4.1. Taylor linearisation of non-linear statistics 373.4.2. Ratio estimation 373.4.3. Non-linear statistics � explicitly

defined statistics 393.4.4. Non-linear statistics � defined implicitly

by score statistics 403.4.5. Total variance matrix of for non-negligible

sampling fractions 423.5. Conditional model-based properties 42

3.5.1. Conditional model-based properties of 423.5.2. Conditional model-based expectations 433.5.3. Conditional model-based variance for

and the use of estimating functions 433.6. Properties of methods when the assumed

model is invalid 453.6.1. Critical model assumptions 453.6.2. Model-based properties of 453.6.3. Model-based properties of 463.6.4. Summary 47

3.7. Conclusion 48

Chapter 4 The Bayesian Approach to Sample Survey Inference 49Roderick J. L ittle4.1. Introduction 494.2. Modeling the selection mechanism 52

Chapter 5 Interpreting a Sample as Evidence about a Finite Population 59Richard Royall5.1. Introduction 595.2. The evidence in a sample from a finite population 62

5.2.1. Evidence about a probability 625.2.2. Evidence about a population proportion 625.2.3. The likelihood function for a population

proportion or total 635.2.4. The probability of misleading evidence 655.2.5. Evidence about the average count in a finite

population 665.2.6. Evidence about a population mean under a

regression model 69

viii CONTENTS

b

b

b

b

b

b

b

Page 12: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

5.3. Defining the likelihood function for a finitepopulation 70

PART B CATEGORICAL RESPONSE DATA 73

Chapter 6 Introduction to Part B 75C. J. Skinner6.1. Introduction 756.2. Analysis of tabular data 76

6.2.1. One-way classification 766.2.2. Multi-way classifications and log�linear

models 776.2.3. Logistic models for domain proportions 80

6.3. Analysis of unit-level data 816.3.1. Logistic regression 816.3.2. Some issues in weighting 83

Chapter 7 Analysis of Categorical Response Data from ComplexSurveys: an Appraisal and Update 85J. N. K. Rao and D. R. Thomas7.1. Introduction 857.2. F itting and testing log�linear models 86

7.2.1. Distribution of the Pearson and likelihoodratio statistics 86

7.2.2. Rao�Scott procedures 887.2.3. Wald tests of model fit and their variants 917.2.4. Tests based on the Bonferroni inequality 927.2.5. Fay�s jackknifed tests 94

7.3. F inite sample studies 977.3.1. Independence tests under cluster sampling 977.3.2. Simulation results 987.3.3. Discussion and final recommendations 100

7.4. Analysis of domain proportions 1017.5. Logistic regression with a binary response variable 1047.6. Some extensions and applications 106

7.6.1. Classification errors 1067.6.2. Biostatistical applications 1077.6.3. Multiple response tables 108

Chapter 8 Fitting Logistic Regression Models in Case�ControlStudies with Complex Sampling 109Alastair Scott and Chris Wild8.1. Introduction 109

CONTENTS ix

Page 13: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

8.2. Simple case�control studies 1118.3. Case�control studies with complex sampling 1138.4. Efficiency 1158.5. Robustness 1178.6. Other approaches 120

PART C CONTINUOUS AND GENERAL RESPONSE DATA 123

Chapter 9 Introduction to Part C 125R. L. Chambers9.1. The design-based approach 1259.2. The sample distribution approach 1279.3. When to weight? 130

Chapter 10 Graphical Displays of Complex Survey Data through KernelSmoothing 133D. R. Bellhouse, C. M . Goia, and J. E. Stafford10.1. Introduction 13310.2. Basic methodology for histograms and smoothed

binned data 13410.3. Smoothed histograms from the Ontario Health

Survey 13810.4. Bias adjustment techniques 14110.5. Local polynomial regression 14610.6. Regression examples from the Ontario Health

Survey 147

Chapter 11 Nonparametric Regression with Complex Survey Data 151R. L. Chambers, A. H. Dorfman and M . Y u. Sverchkov11.1. Introduction 15111.2. Setting the scene 152

11.2.1. A key assumption 15211.2.2. What are the data? 15311.2.3. Informative sampling and ignorable

sample designs 15411.3. Reexpressing the regression function 155

11.3.1. Incorporating a covariate 15611.3.2. Incorporating population information 157

11.4. Design-adjusted smoothing 15811.4.1. Plug-in methods based on sample data only 15811.4.2. Examples 159

x CONTENTS

Page 14: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

11.4.3. Plug-in methods which use populationinformation 160

11.4.4. The estimating equation approach 16211.4.5. The bias calibration approach 163

11.5. Simulation results 16311.6. To weight or not to weight? (With apologies to

Smith, 1988) 17011.7. Discussion 173

Chapter 12 Fitting Generalized Linear Models under InformativeSampling 175Danny Pfeffermann and M . Y u. Sverchkov12.1. Introduction 17512.2. Population and sample distributions 177

12.2.1. Parametric distributions of sample data 17712.2.2. Distinction between the sample and the

randomization distributions 17812.3. Inference under informative probability

sampling 17912.3.1. Estimating equations with application

to the GLM 17912.3.2. Estimation of 18212.3.3. Testing the informativeness of the

sampling process 18312.4. Variance estimation 18512.5. Simulation results 188

12.5.1. Generation of population and sampleselection 188

12.5.2. Computations and results 18912.6. Summary and extensions 194

PART D LONGITUDINAL DATA 197

Chapter 13 Introduction to Part D 199C. J. Skinner13.1. Introduction 19913.2. Continuous response data 20013.3. Discrete response data 202

Chapter 14 Random Effects Models for Longitudinal Survey Data 205C. J. Skinner and D. J. Holmes14.1. Introduction 205

CONTENTS xi

Es(wtjxt)

Page 15: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

14.2. A covariance structure approach 20714.3. A multilevel modelling approach 21014.4. An application: earnings of male employees in

Great Britain 21314.5. Concluding remarks 218

Chapter 15 Event History Analysis and Longitudinal Surveys 221J. F. Lawless15.1. Introduction 22115.2. Event history models 22215.3. General observational issues 22515.4. Analytic inference from longitudinal survey data 23015.5. Duration or survival analysis 232

15.5.1. Non-parametric marginal survivorfunction estimation 232

15.5.2. Parametric models 23415.5.3. Semi-parametric methods 235

15.6. Analysis of event occurrences 23615.6.1. Analysis of recurrent events 23615.6.2. Multiple event types 238

15.7. Analysis of multi-state data 23915.8. Illustration 23915.9. Concluding remarks 241

Chapter 16 Applying Heterogeneous Transition Models inLabour Economics: the Role of Youth Training inLabour Market Transitions 245Fabrizia Mealli and Stephen Pudney16.1. Introduction 24516.2. YTS and the LCS dataset 24616.3. A correlated random effects transition model 249

16.3.1. Heterogeneity 25116.3.2. The initial state 25216.3.3. The transition model 25316.3.4. YTS spells 25416.3.5. Bunching of college durations 25516.3.6. Simulated maximum likelihood 255

16.4. Estimation results 25516.4.1. Model selection 25516.4.2. The heterogeneity distribution 25616.4.3. Duration dependence 25716.4.4. Simulation strategy 25916.4.5. The effects of unobserved heterogeneity 263

xii CONTENTS

Page 16: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

16.5. Simulations of the effects of YTS 26516.5.1. The effects of YTS participation 26616.5.2. Simulating a world without YTS 267

16.6. Concluding remarks 270

PART E INCOMPLETE DATA 275

Chapter 17 Introduction to Part E 277R. L. Chambers17.1. Introduction 27717.2. An example 27917.3. Survey inference under nonresponse 28017.4. A model-based approach to estimation under

two-phase sampling 28217.5. Combining survey data and aggregate data in

analysis 286

Chapter 18 Bayesian Methods for Unit and Item Nonresponse 289Roderick J. L ittle18.1. Introduction and modeling framework 28918.2. Adjustment-cell models for unit nonresponse 291

18.2.1. Weighting methods 29118.2.2. Random effects models for the weight

strata 29418.3. Item nonresponse 297

18.3.1. Introduction 29718.3.2. MI based on the predictive distribution

of the missing variables 29718.4. Non-ignorable missing data 30318.5. Conclusion 306

Chapter 19 Estimation for Multiple Phase Samples 307Wayne A. Fuller19.1. Introduction 30719.2. Regression estimation 308

19.2.1. Introduction 30819.2.2. Regression estimation for two-phase

samples 30919.2.3. Three-phase samples 31019.2.4. Variance estimation 31219.2.5. Alternative representations 313

19.3. Regression estimation with imputation 315

CONTENTS xiii

Page 17: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

19.4. Imputation procedures 31819.5. Example 319

Chapter 20 Analysis Combining Survey and GeographicallyAggregated Data 323D. G. Steel, M . Tranmer and D. Holt20.1. Introduction and overview 32320.2. Aggregate and survey data availability 32620.3. Bias and variance of variance component

estimators based on aggregate and survey data 32820.4. Simulation studies 33420.5. Using auxiliary variables to reduce aggregation

effects 33820.5.1. Adjusted aggregate regression 339

20.6. Conclusions 343

References 345T. M. F. Smith: Publications up to 2002 361Author Index 367Subject Index 373

xiv CONTENTS

Page 18: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Preface

The book is dedicated to T. M. F. (Fred) Smith, and marks his �official�retirement from the University of Southampton in 1999. Fred�s deep influenceon the ideas presented in this book is witnessed by the many references to hiswork. His publications up to 2002 are listed at the back of the book. Fred�smost important early contributions to survey sampling were made from the late1960s into the 1970s, in collaboration with Alastair Scott, a colleague of Fred,when he was a lecturer at the London School of Economics between 1960 and1968. Their joint papers explored the foundations and advanced understandingof the role of models in inference from sample survey data, a key element ofsurvey analysis. Fred�s review of the foundations of survey sampling in Smith(1976), read to the Royal Statistical Society, was a landmark paper.

Fred moved to a lectureship position in the Department of Mathematics atthe University of Southampton in 1968, was promoted to Professor in 1976 andhas stayed there until his recent retirement. The 1970s saw the arrival of TimHolt in the University�s Department of Social Statistics and the beginning of anew collaboration. Fred and Tim�s paper on poststratification (Holt and Smith,1979) is particularly widely cited for its discussion of the role of conditionalinference in survey sampling. Fred and Tim were awarded two grants forresearch on the analysis of survey data between 1977 and 1985, and the grantssupported visits to Southampton by a number of authors in this book, includ-ing Alastair Scott, Jon Rao, Wayne Fuller and Danny Pfeffermann. Theresearch undertaken was disseminated at a conference in Southampton in1986 and via the book edited by Skinner, Holt and Smith (1989).

Fred has clearly influenced the development of survey statistics through hisown publications, listed here, and by facilitating other collaborations, such asthe work of Alastair Scott and Jon Rao on tests with survey data, started ontheir visit to Southampton in the 1970s. From our University of Southamptonperspective, however, Fred�s support of colleagues, visitors and students hasbeen equally important. He has always shown tremendous warmth and encour-agement towards his research students and to other colleagues and visitorsundertaking research in survey sampling. He has also always promoted inter-actions and cooperation, whether in the early 1990s through regular informaldiscussion sessions on survey sampling in his room or, more recently, with theincrease in numbers interested, through regular participation in Friday lunch-time workshops on sample survey methods. Fred is well known as an excellentand inspiring teacher at both undergraduate and postgraduate levels and his

Page 19: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

own research seminars and lectures have always been eagerly attended, not onlyfor their subtle insights, but also for their self-deprecating humour. We lookforward to many more years of his involvement.

Fred�s positive approach and his interested support of others ranges farbeyond his interaction with colleagues at Southampton. He has been a strongsupporter of his graduate students and through conferences and meetings hasinteracted widely with others. Fred has a totally open approach and while heloves to argue and debate a point he is never defensive and always open topersuasion if the case can be made. His positive commitment and openness wasreflected in his term as President of the Royal Statistical Society � which hecarried out with great distinction.

This book originates from a conference on �Analysis of Survey Data� held inhonour of Fred in Southampton in August 1999. All the chapters, with theexception of the introductions, were presented as papers at that conference.

Both the conference and the book were conceived of as a follow-up toSkinner, Holt and Smith (1989) (referred to henceforth as �SHS�). That bookaddressed a number of statistical issues arising in the application of methods ofstatistical analysis to sample survey data. This book considers a somewhatwider set of statistical issues and updates the discussion, in the light of morerecent research in this field. The relation between these two books is describedfurther in Chapter 1 (see Section 1.4).

The book is aimed at a statistical audience interested in methods of analysingsample survey data. The development builds upon two statistical traditions,first the tradition of modelling methods, such as regression modelling, used inall areas of statistics to analyse data and, second, the tradition of surveysampling, used for sampling design and estimation in surveys. It is assumedthat readers will already have some familiarity with both these traditions. Anunderstanding of regression modelling methods, to the level of Weisberg(1985), is assumed in many chapters. Familiarity with other modelling methodswould be helpful for other chapters, for example categorical data analysis(Agresti, 1990) for Part B, generalized linear models (McCullagh and Nelder,1989) in Parts B and C, survival analysis (Lawless, 2002; Cox and Oakes, 1984)for Part D. As to survey sampling, it is assumed that readers will be familiarwith standard sampling designs and related estimation methods, as described inSarndal, Swensson and Wretman (1992), for example. Some awareness ofsources of non-sampling error, such as nonresponse and measurement error(Lessler and Kalsbeek, 1992), will also be relevant in places, for example inPart E.

As in SHS, the aim is to discuss and develop the statistical principles andtheory underlying methods, rather than to provide a step-by-step guide on howto apply methods. Nevertheless, we hope the book will have uses for researchersonly interested in analysing survey data in practice.

F inally, we should like to acknowledge support in the preparation of thisbook. First, we thank the Economic and Social Research Council for supportfor the conference in 1999. Second, our thanks are due to Anne Owens, Jane

xvi PREFACE

Page 20: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Schofield, Kathy Hooper and Debbie Edwards for support in the organizationof the conference and handling manuscripts. F inally, we are very grateful to thechapter authors for responding to our requests and putting up with the delaybetween the conference and the delivery of the final manuscript to Wiley.

Ray Chambers and Chris SkinnerSouthampton, July 2002

PREFACE xvii

Page 21: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Contributors

D. R. BellhouseDepartment of Statistical andActuarial SciencesUniversity of Western OntarioLondonOntario N6A 5B7Canada

David A. BinderMethodology BranchStatistics Canada120 Parkdale AvenueOttawaOntario K1A 0T6Canada

R. L. ChambersDepartment of Social StatisticsUniversity of SouthamptonSouthamptonSO17 1BJUK

A. H. DorfmanOffice of Survey Methods ResearchBureau of Labor Statistics2 Massachusetts Ave NEWashington, DC 20212-0001USA

Wayne A. FullerStatistical Laboratory and Departmentof StatisticsIowa State UniversityAmesIA 50011USA

C. M. GoiaDepartment of Statistical andActuarial SciencesUniversity of Western OntarioLondonOntario N6A 5B7Canada

D. J. HolmesDepartment of Social StatisticsUniversity of SouthamptonSouthamptonSO17 1BJUK

D. HoltDepartment of Social StatisticsUniversity of SouthamptonSouthamptonSO17 1BJUK

Page 22: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

J. F. LawlessDepartment of Statistics and ActuarialScienceUniversity of Waterloo200 University Avenue WestWaterlooOntario N2L 3G1Canada

Roderick J. LittleDepartment of BiostatisticsUniversity of Michigan School ofPublic Health1003 M4045 SPH II WashingtonHeightsAnn ArborM I 48109�2029USA

Fabrizia MealliDipartamento di StatisticaUniversita di FirenzeViale Morgagni50134 FlorenceItaly

Danny PfeffermannDepartment of StatisticsHebrew UniversityJerusalemIsraelandDepartment of Social StatisticsUniversity of SouthamptonSouthamptonSO17 1BJUK

Stephen PudneyDepartment of EconomicsUniversity of LeicesterLeicesterLE1 7RHUK

J. N. K. RaoSchool of Mathematics and StatisticsCarleton UniversityOttawaOntario K1S 5B6Canada

Georgia R. RobertsStatistics Canada120 Parkdale AvenueOttawaOntario K1A 0T6Canada

Richard RoyallDepartment of BiostatisticsSchool of Hygiene and Public HealthJohns Hopkins University615 N. Wolfe StreetBaltimoreMD 21205USA

Alastair ScottDepartment of StatisticsUniversity of AucklandPrivate Bag 92019AucklandNew Zealand

C. J. SkinnerDepartment of Social StatisticsUniversity of SouthamptonSouthamptonSO17 1BJUK

J. E. StaffordDepartment of Public Health SciencesMcMurrich BuildingUniversity of Toronto12 Queen�s Park Crescent WestTorontoOntario M5S 1A8Canada

CONTRIBUTORS xixCONTRIBUTORS xix

Page 23: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

D. G. SteelSchool of Mathematics and AppliedStatisticsUniversity of WollongongNSW 2522Australia

M. Yu. SverchkovBurean of Labor Statistics2 Massachusetts Ave NEWashington, DC 20212-0001USA

D. R. ThomasSchool of BusinessCarleton UniversityOttawaOntario K1S 5B6Canada

M. TranmerCentre for Census and SurveyResearchFaculty of Social Sciences and LawUniversity of ManchesterManchesterM13 9PLUK

Chris WildDepartment of StatisticsUniversity of AucklandPrivate Bag 92019AucklandNew Zealand

xx CONTRIBUTORSxx CONTRIBUTORSxx CONTRIBUTORS

Page 24: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 1

IntroductionR. L. Chambers and C. J. Skinner

1.1. THE ANALYSIS OF SURVEY DATA the analysis of survey data

Many statistical methods are now used to analyse sample survey data. Inparticular, a wide range of generalisations of regression analysis, such asgeneralised linear modelling, event history analysis and multilevel modelling,are frequently applied to survey microdata. These methods are usually formu-lated in a statistical framework that is not specific to surveys and indeed thesemethods are often used to analyse other kinds of data. The aim of this book isto consider how statistical methods may be formulated and used appropriatelyfor the analysis of sample survey data. We focus on issues of statistical infer-ence which arise specifically with surveys.

The primary survey-related issues addressed are those related to sampling. Theselection of samples in surveys rarely involves just simple random sampling.Instead, more complex sampling schemes are usually employed, involving, forexample, stratification and multistage sampling. Moreover, these complexsampling schemes usually reflect complex underlying population structures,for example the geographical hierarchy underlying a national multistage sam-pling scheme. These features of surveys need to be handled appropriately whenapplied statistical methods. In the standard formulations of many statisticalmethods, it is assumed that the sample data are generated directly from thepopulation model of interest, with no consideration of the sampling scheme. Itmay be reasonable for the analysis to ignore the sampling scheme in this way,but it may not. Moreover, even if the sampling scheme is ignorable, thestochastic assumptions involved in the standard formulation of the methodmay not adequately reflect the complex population structures underlying thesampling. For example, standard methods may assume that observations fordifferent individuals are independent, whereas it may be more realistic to allowfor correlated observations within clusters. Survey data arising from complexsampling schemes or reflecting associated underlying complex populationstructures are referred to as complex survey data.

While the analysis of complex survey data constitutes the primary focus ofthis book, other methodological issues in surveys also receive some attention.Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner

2003 John Wiley & Sons, Ltd#

Page 25: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

In particular, there will be some discussion of nonresponse and measurementerror, two aspects of surveys which may have important impacts on estimation.Analytic uses of surveys may be contrasted with descriptive uses. The latter

relate to the estimation of summary measures for the population, such as means,proportions and rates. This book is primarily concerned with analytic uses whichrelate to inference about the parameters of models used for analysis, for exampleregression coefficients. For descriptive uses, the targets of inference are taken tobe finite population characteristics. Inference about these parameters could inprinciple be carried out with certainty given a �perfect� census of the population.In contrast, for analytic uses the targets of inference are usually taken to beparameters of models, which are hypothesised to have generated the values in thesurveyed population. Even under a perfect census, it would not be possible tomake inference about these parameters with certainty. Inference for descriptivepurposes in complex surveys has been the subject of many books in surveysampling (e.g. Cochran, 1977; Sarndal, Swensson and Wretman, 1992; Valliant,Dorfman and Royall, 2000). Several of the chapters in this book will build onthat literature when addressing issues of inference for analytic purposes.

The survey sampling literature, relating to the descriptive uses of surveys,provides one key reference source for this book. The other key source consistsof the standard (non-survey) statistical literature on the various methods ofanalysis, for example regression analysis or categorical data analysis. Thisliterature sets out what we refer to as standard procedures of analysis. Theseprocedures will usually be the ones implemented in the most commonly usedgeneral purpose statistical software. For example, in regression analysis, ordin-ary least squares methods are the standard procedures used to make inferenceabout regression coefficients. For categorical data analysis, maximum likeli-hood estimation under the assumption of multinomial sampling will often bethe standard procedure. These standard methods will typically ignore thecomplex nature of the survey. The impact of ignoring features of surveys willbe considered and ways of building on standard procedures to develop appro-priate methods for survey data will be investigated.

After setting out some statistical foundations of survey analysis in Sections1.2 and 1.3, we outline the contents of the book and its relation to Skinner, Holtand Smith (1989) (referred to henceforth as SHS) in Sections 1.4 and 1.5.

1.2. FRAMEWORK, TERMINOLOGY AND SPECIFICATION OFPARAMETERS framework, terminology and specification of parameters

In this section we set out some of the basic framework and terminology andconsider the definition of the parameters of interest.

A finite population, U, consists of a set of N units, labelled 1, . . . ,N . We writeU {1, . . . ,N }. Mostly, it is assumed that U is fixed, but more generally, forexample in the context of a longitudinal survey, we may wish to allow thepopulation to change over time. A sample, s, is a subset of U. The surveyvariables, denoted by the 1 J vector Y , are variables which are measured in

2 INTRODUCTION

Page 26: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the survey and which are of interest in the analysis. It is supposed that an aimof the survey is to record the value of Y for each unit in the sample. Manychapters assume that this aim is realised. In practice, it will usually not bepossible to record Y without error for all sample units, either because of non-response or because of measurement error, and approaches to dealing with theseproblems are also discussed. The values which Y takes in the finite populationare denoted y1, . . . , yN . The process whereby these values are transformed intothe data available to the analyst will be called the observation process. It willinclude both the sampling mechanism as well as the nonresponse and measure-ment processes. Some more complex temporal features of the observationprocess arising in longitudinal surveys are discussed in Chapter 15 by Lawless.

For the descriptive uses of surveys, the definition of parameters is generallystraightforward. They consist of specified functions of y1, . . . , yN , for examplethe vector of finite population means of the survey variables, and are referred toas finite population parameters.

In analytic surveys, parameters are usually defined with respect to a specifiedmodel. This model will usually be a superpopulation model, that is a model forthe stochastic process which is assumed to have generated the finite populationvalues y1, . . . , yN . Often this model will be parametric or semi-parametric, thatis fully or partially characterised by a finite number of parameters, denoted bythe vector , which defines the parameter vector of interest. Sometimes themodel will be non-parametric (see Chapters 10 and 11) and the target ofinference may be a regression function or a density function.

In practice, it will often be unreasonable to assume that a specified paramet-ric or semi-parametric model holds exactly. It may therefore be desirable todefine the parameter vector in such a way that it is equal to if the model holds,but remains interpretable under (mild) misspecification of the model. Someapproaches to defining the parameters in this way are discussed in Chapter 3 byBinder and Roberts and in Chapter 8 by Scott and Wild. In particular, oneapproach is to define a census parameter, U , which is a finite populationparameter and is �close� to according to some metric. This provides a linkwith the descriptive uses of surveys. There remains the issue of how to definethe metric and Chapters 3 and 8 consider somewhat different approaches.

Let us now consider the nature of possible superpopulation models and theirrelation to the sampling mechanism. Writing yU as the N J matrix with rowsy1, . . . , yN and f (.) as a generic probability density function or probability massfunction, a basic superpopulation model might be expressed as f ( yU ; ). Here itis supposed that yU is the realisation of a random matrix, Y U , the distributionof which is governed by the parameter vector .

It is natural also to express the sampling mechanism probabilistically, espe-cially if the sample is obtained by a conventional probability sampling design. Itis convenient to represent the sample by a random vector with the same numberof rows as Y U . To do this, we define the sample inclusion indicator, it , fort 1, . . . ,N , by

it 1 if t s, it 0 otherwise.

FRAMEWORK, TERMINOLOGY AND SPECIFICATION OF PARAMETERS 3

�� � �

Page 27: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The N values i1, . . . , iN form the elements of the N 1 vector iU . Since iUdetermines the set s and the set s determines iU , the sample may be representedalternatively by s or iU . We denote the sampling mechanism by f (iU ), with iUbeing a realisation of the random vector IU . Thus, f (iU ) specifies the probabilityof obtaining each of the 2N possible samples from the population. Under theassumption of a known probability sampling design, f (iU ) is known for allpossible values of iU and thus no parameter is included in this specification.When the sampling mechanism is unknown, some parametric dependence off (iU ) might be desirable.

We thus have expressions, f ( yU ; ) and f (iU ), for the distributions of thepopulation values Y U and the sample units IU . If we are to proceed to use thesample data to make inference about it is necessary to be able to represent( yU , iU ) as the joint outcome of a single process, that is the joint realisation ofthe random matrix (Y U , IU ). How can we express the joint distribution of(Y U , IU ) in terms of the distributions f ( yU ; ) and f (iU ), which we have con-sidered so far? Is it reasonable to assume independence and write f ( yU ; ) f (iU )as the joint distribution? To answer these questions, we need to think morecarefully about the sampling mechanism.

At the simplest level, we may ask whether the sampling mechanism dependsdirectly on the realised value yU of Y U . One situation where this occurs is incase�control studies, discussed by Scott and Wild in Chapter 8. Here theoutcome yt is binary, indicating whether unit t is a case or a control, and thecases and controls define separate strata which are sampled independently. Theway in which the population is sampled thus depends directly on yU . In thiscase, it is natural to indicate this dependence by writing the sampling mechan-ism as f (iU Y U U ). We may then write the joint distribution of Y U and IU asf (iU Y U yU ) f ( U ; ), where it is necessary not only to specify the modelf ( yU ; ) for Y U but also to �model� what the sampling design f (iU Y U ) wouldbe under alternative outcomes Y U than the observed one yU . Sampling schemeswhich depend directly on yU in this way are called informative samplingschemes. Sampling schemes, for which we may write the joint distribution ofY U and IU as f ( yU ; ) f (iU ), are called noninformative. An alternative butrelated definition of informative sampling will be used in Section 11.2.3 andin Chapter 12. Sampling is said there to be informative with respect to Y if the�sample distribution� of Y differs from the population distribution of Y , wherethe idea of �sample distribution� is introduced in Section 2.3.

Schemes where sampling is directly dependent upon a survey variable ofinterest are relatively rare. It is very common, however, for sampling to dependupon some other characteristics of the population, such as strata. These char-acteristics are used by the sample designer and we refer to them as designvariables. The vector of values of the design variables for unit t is denoted ztand the matrix of all population values z1, . . . , zN is denoted zU . Just as thematrix yU is viewed as a realisation of the random matrix Y U , we may view zUas a realisation of a random matrix Z U . To emphasise the dependence of thesampling design on zU , we may write the sampling mechanism asf (iU Z U zU ). If we are to hold Z U fixed at its actual value zU when specifying

4 INTRODUCTION

��

� �� �� �

� �

� �

Page 28: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the sampling mechanism f (iU Z U zU ), then we must also hold it fixed whenwe specify the joint distribution of IU and Y U . We write the distribution of Y Uwith Z U fixed at zU as f ( yU Z U zU ; ) and interpret it as the conditionaldistribution of Y U given Z U zU . The distribution is indexed by the parametervector , which may differ from , since this conditional distribution may differfrom the original distribution f ( yU ; ). Provided there is no additional directdependence of sampling on yU , it will usually be reasonable to express the jointdistribution of Y U and IU (with zU held fixed) as f (IU Z U zU )f (Y U Z U zU ; ), that is to assume that Y U and IU are conditionally in-dependent given Z U zU . In this case, sampling is said to be noninformativeconditional on zU .

We see that the need to �condition� on zU when specifying the model for Y Uhas implications for the definition of the target parameter. Conditioning on zUmay often be reasonable. Consider, for illustration, a sample survey of individ-uals in Great Britain, where sampling in England and Wales is independent ofsampling in Scotland, that is these two parts of Great Britain are separatestrata. Ignoring other features of the sample selection, we may thus conceive ofzt as a binary variable identifying these two strata, Suppose that we wish toconduct a regression analysis with some variables in this survey. The require-ment that our model should condition on zU in this context means essentiallythat we must include zt as a potential covariate (perhaps with interaction terms)in our regression model. For many socio-economic outcome variables it maywell be scientifically sensible to include such a covariate, if the distribution ofthe outcome variable varies between these regions.

In other circumstances it may be less reasonable to condition on zU whendefining the distribution of Y U of interest. The design variables are chosen toassist in the selection of the sample and their nature may reflect administrativeconvenience more than scientific relevance to possible data analyses. Forexample, in Great Britain postal geography is often used for sample selectionin surveys of private households involving face-to-face interviews. The designvariables defining the postal geography may have little direct relevance topossible scientific analyses of the survey data. The need to condition on thedesign variables used for sampling involves changing the model for Y U fromf ( yU ; ) to f ( yU Z U zU ; ) and changing the parameter vector from to .This implies that the method of sampling is actually driving the specification ofthe target parameter, which seems inappropriate as a general approach. Itseems generally more desirable to define the target parameter first, in thelight of the scientific questions of interest, before considering what bearingthe sampling scheme may have in making inferences about the target parameterusing the survey data.

We thus have two possible broad approaches, which SHS refer to asdisaggregated and aggregated analyses. A disaggregated analysis conditionson the values of the design variables in the finite population with the targetparameter. In many social surveys these design variables define populationsubgroups, such as strata and clusters, and the disaggregated analysisessentially disaggregates the analysis by these subgroups, specifying models

FRAMEWORK, TERMINOLOGY AND SPECIFICATION OF PARAMETERS 5

� �

� � ��

��

� �� � �

� � � � � �

Page 29: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

which allow for different patterns within and between subgroups. Part C of SHSprovides illustrations.

In an aggregated analysis the target parameters are defined in a way that isunrelated to the design variables. For example, one might be interested in afactor analysis of a set of attitude variables in the population. For analyticinference in an aggregated analysis it is necessary to conceive of zU as arealisation of a random matrix Z U with distribution f (zU ; ) indexed by afurther parameter vector and, at least conceptually, to model the samplingmechanism f (IU Z U ) for different values of Z U than the realised value zU .Provided the sampling is again noninformative conditional on zU , the jointdistribution of IU , Y U and Z U is given by f (iU zU ) f ( yU zU ; ) f (zU ; ). Thetarget parameter characterises the marginal distribution of Y U :

Aggregated analysis may therefore alternatively be referred to as marginalmodelling and the distinction between aggregated and disaggregated analysis isanalogous, to a limited extent, to the distinction between population-averagedand subject-specific analysis, widely used in biostatistics (Diggle et al., 2002,Ch. 7) when clusters of repeated measurements are made on subjects. In thisanalogy, the design variables zt consist of indicator variables or random effectsfor these clusters.

1.3. STATISTICAL INFERENCE statistical inference

In the previous section we discussed the different kinds of parameters ofinterest. We now consider alternative approaches to inference about theseparameters, referring to inference about finite population parameters as descrip-tive inference and inference about model parameters, our main focus, as analyticinference.

Descriptive inference is the traditional concern of survey sampling and abasic distinction is between design-based and model-based inference. Underdesign-based inference the only source of random variation considered is thatinduced in the vector iU by the sampling mechanism, assumed to be a knownprobability sampling design. The matrix of finite population values yU istreated as fixed, avoiding the need to specify a model which generates yU .A frequentist approach to inference is adopted. The aim is to find a pointestimator which is approximately unbiased for and has �good efficiency�,both bias and efficiency being defined with respect to the distribution ofinduced by the sampling mechanism. Point estimators are often formed usingsurvey weights, which may incorporate auxiliary population information per-haps based upon the design variables zt , but are usually not dependent upon thevalues of the survey variables yt (e.g. Deville and Sarndal, 1992). Large-samplearguments are often then used to justify a normal approximationAn estimator is then sought for , which enables interval estimation and

6 INTRODUCTION

��

� � � ��

� 2 �� 5 �9 ��

� 2 �� � � 5�9 � 2 � 5�9� � �

�� ���

�� � �2�$�9��� �

Page 30: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

testing of hypotheses about to be conducted. In the simplest approach, issought such that inference statements about , based upon the assumptions that

N ( , ) and is known, remain valid, to a reasonable approximation, ifis replaced by . The design-based approach is the traditional one in manysampling textbooks such as Cochran (1977). Models may be used in thisapproach to motivate the choice of estimators, in which case the approachmay be called model-assisted (Sarndal, Swensson and Wretman, 1992).

The application of the design-based approach to analytic inference is lessstraightforward, since the parameters are necessarily defined with respect to amodel and hence it is not possible to avoid the use of models entirely.A common approach is via the notion of a census parameter, referred to inthe previous section. This involves specifying a finite population parameter Ucorresponding to the model parameter and then considering design-based(descriptive) inference about U . These issues are discussed further in Chapter 3by Binder and Roberts and in Chapter 2. As noted in the previous section, acritical issue is the specification of the census parameter.

In classical model-based inference the only source of random variationconsidered is that from the model which generates yU . The sample, representedby the vector iU , is held fixed, even if it has been generated by a probabilitysampling design. Model-based inference may, like design-based inference,follow a frequentist approach. For example, a point estimator of a givenparameter might be obtained by maximum likelihood, tests about mightbe based upon a likelihood ratio approach and interval estimation aboutmight be based upon a normal approximation N ( , ), justified by large-sample arguments. More direct likelihood or Bayesian approaches might alsobe adopted, as discussed in the chapters of Part A.

Classical model-based inference thus ignores the probability distributioninduced by the sampling design. Such an approach may be justified, undercertain conditions, by considering a statistical framework in which stochasticvariation arises from both the generation of yU and the sampling mechanism.Conditions for the ignorability of the sampling design may then be obtained bydemonstrating, for example, that the likelihood is free of the design or that thesample outcome iU is an ancillary statistic, depending on the approach tostatistical inference (Rubin, 1976; Little and Rubin, 1989). These issues arediscussed by Chambers in Chapter 2 from a likelihood perspective in a generalframework, which allows for stochastic variation not only from the samplingmechanism and a model, but also from nonresponse.

From a likelihood-based perspective, it is necessary first to consider thenature of the data. Consider a framework for analytic inference, in the notationintroduced above, where IU , Y U and Z U have a distribution given by

Suppose that the rows of yU corresponding to sampled units are collectedinto the matrix yobs and that the remaining rows form the matrix ymis. Suppos-ing that iU , yobs and zU are observed and that ymis is unobserved, the dataconsist of (iU , yobs, zU ) and the likelihood for ( , ) is given by

STATISTICAL INFERENCE 7

� ���

�� � � � � ���

��

� ��

�� � � �

� 2�� ��� $ � 9 � 2 �� � � 5�9 � 2 � 5�9�

� �

Page 31: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

If sampling is noninformative given zU so that f (iU yU , zU ) f (iU zU ) andif this design f (iU zU ) is known and free of ( ) then the term f (iU yU , zU )may be dropped from the likelihood above since it is only a constant withrespect to ( , ). Hence, under these conditions sampling is ignorable forlikelihood-based inference about ( , ), that is we may proceed to make likeli-hood-based inference treating iU as fixed. Classical model-based inference isthus justified in this case.

Chambers refers in Chapter 2 to information about the sample, iU , thesample observations, yobs, as well as the population values of the design vari-ables, zU , as the �full information� case. The likelihood based upon this infor-mation is called the �full information likelihood�. In practice, all the populationvalues, zU , will often not be available to the survey data analyst, as discussedfurther by Chambers, Dorfman and Sverchkov in Chapter 11 and by Cham-bers, Dorfman and Wang (1998). The only unit-level information available tothe analyst about the sampling design might consist of the inclusion probabil-ities t of each sample unit and further information, for example identifiers ofstrata and primary sampling units, which enables suitable standard errors to becomputed for descriptive inference. In these circumstances, a more limited setof inference options will be available. Even this amount of unit-level infor-mation about the design may not be available to some users and SHS describesome simpler methods of adjustment, such as the use of approximate designeffects.

1.4. RELATION TO SKINNER, HOLT AND SMITH (1989) relation to skinner, holt and smith (1989)

As indicated in the Preface, this work is conceived of as a follow-up to Skinner,Holt and Smith (1989) (referred to as SHS). This book updates and extendsSHS, but does not replace it. The discussion is intended to be at a similar levelto SHS, focusing on statistical principles and general theory, rather than on thedetails of how to implement specific methods in practice. Some chapters, mostnotably Chapter 7 by Rao and Thomas, update SHS by including discussion ofdevelopments since 1989 of methods covered in SHS. Many chapters extend thetopic coverage of SHS. For example, there was almost no reference to longitu-dinal surveys in SHS, whereas the whole of Part D of this book is devoted tothis topic. There is also little overlap between SHS and many of the otherchapters. In particular, SHS focused only on the issue of complex survey data,referred to in Section 1.1, whereas this book makes some reference to additionalmethodological issues in the analysis of survey data, such as nonresponse andmeasurement error.

There remains, however, much in SHS not covered here. In particular, onlylimited coverage is given to investigating the effects of complex designs andpopulation structure on standard procedures or to simple adjustments to

8 INTRODUCTION

(2�$�9 ��

� 2�� ��� $ � 9 � 2 �� � � 5�9 � 2 � 5�9������

� � �� � � �

� �� �

Page 32: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

standard procedures to compensate for these effects, two of the major object-ives of SHS. One reason we focus less on standard procedures is that appropri-ate methods for the analysis of complex survey data have increasingly becomeavailable in standard software since 1989. The need to discuss the properties ofinappropriate standard methods may thus have declined, although we stillconsider it useful that measures such as misspecification effects (meffs), intro-duced in SHS (p. 24) to assess properties of standard procedures, have foundtheir way into software, such as Stata (Stata Corporation, 2001). More import-antly, we have not attempted to produce a second edition of SHS, that is wehave chosen not to duplicate the discussion in SHS. Thus this book is intendedto complement SHS, not to supersede it. References will be made to SHS whereappropriate, but we attempt to make this book self-contained, especially via theintroductory chapters to each part, and it is not essential for the reader to haveaccess to SHS to read this book.

Given its somewhat different topic coverage, the structure of this bookdiffers from that of SHS. That book was organised according to two particularfeatures. F irst, a distinction was drawn between aggregated and disaggregatedanalysis (see Section 1.2). This feature is no longer used for organising thisbook, partly because methods of disaggregated analysis receive only limitedattention here, but this distinction will be referred to in places in individualchapters. Second, there was a separation in SHS between the discussion of firstpoint estimation and bias and second standard errors and tests. Again, thisfeature is no longer used for organising the chapters in this book. The structureof this book is set out and discussed in the next section.

1.5. OUTLINE OF THIS BOOK outline of this book

Basic issues regarding the statistical framework and the approach to inferencecontinue to be fundamental to discussions of appropriate methods of surveyanalysis. These issues are therefore discussed first, in the four chapters ofPart A. These chapters adopt a similar finite population framework, contrast-ing analytic inference about a model parameter with descriptive inference abouta finite population parameter. The main difference between the chapters con-cerns the approach to statistical inference. The different approaches are intro-duced first in Section 1.3 of this chapter and then by Chambers in Chapter 2, asan introduction to Part A. A basic distinction is between design-based andmodel-based inference (see Section 1.3). Binder and Roberts compare the twoapproaches in Chapter 3. The other chapters focus on model-based approaches.Little discusses the Bayesian approach in Chapter 4. Royall discusses anapproach to descriptive inference based upon the likelihood in Chapter 5.Chambers focuses on a likelihood-based approach to analytic inference inChapter 2, as well as discussing the approaches in Chapters 3�5.

Following Part A, the remainder of the book is broadly organised according tothe type of survey data. Parts B and C are primarily concerned with the analysisof cross-sectional survey data, with a focus on the analysis of relationships

OUTLINE OF THIS BOOK 9

Page 33: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

between variables, and with the separation between the two parts correspond-ing to the usual distinction between discrete and continuous response variables.Rao and Thomas provide an overview of methods for discrete response data inChapter 7, including both the analysis of multiway tables and regressionmodels for microdata. Scott and Wild deal with the special case of logisticregression analysis of survey data from case�control studies in Chapter 8.Regression models for continuous responses are discussed in Part C, especiallyin Chapter 11 by Chambers, Dorfman and Sverchkov and in Chapter 12 byPfeffermann and Sverchkov. These include methods of non-parametric regres-sion, and non-parametric graphical methods for displaying continuous data arealso covered in Chapter 10 by Bellhouse, Goia and Stafford.

The extension to the analysis of longitudinal survey data is considered in PartD. Skinner and Holmes discuss the use of random effects models in Chapter 14for data on continuous variables recorded at repeated waves of a panel survey.Lawless and Mealli and Pudney discuss the use of methods of event historyanalysis in Chapters 15 and 16.

Finally, Part E is concerned with data structures which are more complexthan the largely �rectangular� data structures considered in previous chapters.Little discusses the treatment of missing data from nonresponse in Chapter 18.Fuller considers the nested data structures arising from multiphase sampling inChapter 19. Steel, Tranmer and Holt consider analyses which combine surveymicrodata and geographically aggregated data in Chapter 20.

10 INTRODUCTION

Page 34: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

PART A

Approaches to Inference

Page 35: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 36: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 2

Introduction to Part AR. L. Chambers

2.1. INTRODUCTION introduction

Recent developments have served to blur the distinction between the design-based and model-based approaches to survey sampling inference. As noted inthe previous chapter, it is now recognised that the distribution induced by thesampling process has a role to play in model-based inference, particularly wherethis distribution depends directly on the distribution of the population values ofinterest, or where there is insufficient information to justify the assumption thatthe sampling design can be ignored in inference. Similarly, the increased use ofmodel-assisted methods (Sarndal, Swensson and Wretman, 1992) has meantthat the role of population models is now well established in design-basedinference.

The philosophical division between the two approaches has not disappearedcompletely, however. This becomes clear when one reads the following threechapters that make up Part A of this book. The first, by Binder and Roberts,argues for the use of design-based methods in analytic inference, while thesecond, by Little, presents the Bayesian approach to model-based analyticand descriptive inference. The third chapter, by Royall, focuses on applicationof the likelihood principle in model-based descriptive inference.

The purpose of this chapter is to provide a theoretical background for thesechapters, and to comment on the arguments put forward in them. In particularwe first summarise current approaches to survey sampling inference by describ-ing three basic, but essentially different, approaches to analytic inference fromsample survey data. All three make use of likelihood ideas within a frequentistinferential framework (unlike the development by Royall in Chapter 5), butdefine or approximate the likelihood in different ways. The first approach,described in the next section, develops the estimating equation and associatedvariance estimator for what might be called a full information maximumlikelihood approach. That is, the likelihood is defined by the probability ofobserving all the relevant data available to the survey analyst. The second,described in Section 2.3, develops the maximum sample likelihood estimator.

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 37: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

This estimator maximises the likelihood defined by the sample data only, exclud-ing population information. The third approach, described in Section 2.4, isbased on the maximum pseudo-likelihood estimator (SHS, section 3.4.4), wherethe unobservable population-level likelihood of interest is estimated usingmethods for descriptive inference. In Section 2.5 we then explore the link be-tween the pseudo-likelihood approach and the total variation concept thatunderlies the approach to analytic inference advocated by Binder and Robertsin Chapter 3. In contrast, in Chapter 4 Little advocates a traditional Bayesianapproach to sample survey inference, and we briefly set out his arguments inSection 2.6. F inally, in Section 2.7 we summarise the arguments put forward byRoyall for extending likelihood inference to finite population inference.

2.2. FULL INFORMATION LIKELIHOOD full information likelihood

The development below is based on Breckling et al. (1994). We start by reiter-ating the important point made in Section 1.3, i.e. application of the likelihoodidea to sample survey data first requires one to identify what these data are.Following the notation introduced in Section 1.2, we let Y denote the vector ofsurvey variables of interest, with matrix of population values yU , and let sdenote the set of �labels� identifying the sampled population units, with asubscript of obs denoting the subsample of these units that respond. Eachunit in the population, if selected into the sample, can be a respondent or anon-respondent on any particular component of Y . We model this behaviourby a multivariate zero�one response variable R of the same dimension as Y .The population values of R are denoted by the matrix rU . Similarly, we define asample inclusion indicator I that takes the value one if a unit is selected into thesample and is zero otherwise. The vector containing the population values of Iis denoted by iU . Also, following convention, we do not distinguish between arandom variable and its realisation unless this is necessary for making a point.In such cases we use upper case to identify the random variable, and lower caseto identify its realisation.

If we assume the values in yU can be measured without error, then the surveydata are the array of respondents� values yobs in yU , the matrix rs correspondingto the values of R associated with the sampled units, the vector iU and thematrix zU of population values of a multivariate design variable Z . It isassumed that the sample design is at least partially based on the values in zU .

We note in passing that this situation is an ideal one, corresponding to thedata that would be available to the survey analyst responsible for both selectingthe sample and analysing the data eventually obtained from the respondingsample units. In many cases, however, survey data analysts are secondaryanalysts, in the sense that they are not responsible for the survey design andso, for example, do not have access to the values in zU for both non-respondingsample units as well as non-sampled units. In some cases they may not evenhave access to values in zU for the sampled units. Chambers, Dorfman andWang (1998) explore likelihood-based inference in this limited data situation.

14 INTRODUCTION TO PART A

Page 38: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

In what follows we use fU to denote a density defined with respect to astochastic model for a population of interest. We assume that the target ofinference is the parameter characterising the marginal population densityfU ( yU ; ) of the population values of Y and our aim is to develop a maximumlikelihood estimation procedure for . In general will be a vector. In order to doso we assume that the random variablesY , R, I and Z generating yU , rU , iU and zUare (conceptually) observable over the entire population, representing the densityof their joint distribution over the population by fU ( yU , rU , iU , zU ; ), whereis the vector of parameters characterising this joint distribution. It follows that

either is a component of or can be obtained by a one-to-one transformationof components of . In either case if we can calculate the maximum likeli-hood estimate (MLE) for we can then calculate the MLE for .

This estimation problem can be made simpler by using the following equiva-lent factorisations of fU ( yU , rU , iU , zU ; ) to define :

fU ( yU , rU , iU , zU ) fU (r U , iU , zU ) fU (i U , zU ) fU ( y U ) fU (zU ) (2.1)

fU (r U , iU , zU ) fU (i U , zU ) fU (z U ) fU ( yU ) (2.2)

fU (i U , rU , zU ) fU (r U , zU ) fU ( y U ) fU (zU ). (2.3)

The choice of which factorisation to use depends on our knowledge andassumptions about the sampling method and the population generating pro-cess. Two common simplifying assumptions (see Section 1.4) are

Noninformative sampling given zU :

Noninformative nonresponse given zU :

Here upper case denotes a random variable, denotes independence anddenotes conditioning. Under noninformative sampling fU (iU yU , zU )

fU (iU zU ) in (2.1) and (2.2) and so iU is ancillary as far as inference about isconcerned. Similarly, under noninformative nonresponse fU (rU yU , iU , zU )fU (rU iU , zU ), in which case rU is ancillary for inference about . Under

both noninformative sampling and noninformative nonresponse both rU andiU are ancillary and is defined by the joint population distribution of just yUand zU . When nonresponse becomes informative (but sampling remains non-informative) we see from (2.3) that our survey data distribution is now the jointdistribution of yU , rU and zU , and so parameterises this distribution. Wereverse the roles of rU and iU in (2.3) when sampling is informative andnonresponse is not. F inally, when both sampling and nonresponse are informa-tive we have no choice but to model the full joint distribution of yU , rU , iU andzU in order to define .

The likelihood function for is then the function Ls( ) fU ( yobs, rs, iU , zU ; ).The MLE of is the value that maximises this function. In order to calculatethis MLE and to construct an associated confidence interval for the value of , wenote two basic quantities: the score function for , i.e. the first derivative withrespect to of log (Ls( )), and the information function for , defined as thenegative of the first derivative with respect to of this score function. The MLE

FULL INFORMATION LIKELIHOOD 15

��

� �

� �

��

� �

���

� � � � � �

� � � � � � � � �� � � � � �

� ��

� ��

����� ��� � �� �

����� ��� � �� �

�� �

��

��

�� �

� ��

�� � �

�� �

Page 39: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

of is the value where the score function is zero, while the inverse of the value ofthe information function evaluated at this MLE (often referred to as the observedinformation) is an estimate of its large-sample variance�covariance matrix.

Let g be a real-valued differentiable function of a vector-valued argument x ,with g denoting the vector of first-order partial derivatives of g(x ) withrespect to the components of x , and g denoting the matrix of second-orderpartial derivatives of g(x ) with respect to the components of x . Then the scorefunction scs( ) for generated by the survey data is the conditional expectation,given these data, of the score for generated by the population data. That is,

Similarly, the information function infoss( ) for generated by the survey datais the conditional expectation, given these data, of the information forgenerated by the population data minus the corresponding conditional varianceof the population score. That is,

We illustrate the application of these results using two simple examples. Ournotation will be such that population quantities will subscripted by U, whiletheir sample and non-sample equivalents will be subscripted by s and U srespectively and Es and vars will be used to denote expectation and varianceconditional on the survey data.

Example 1

Consider the situation where Z is univariate and Y is bivariate, with compon-ents Y and X , and where Y , X and Z are independently and identicallydistributed as N ( ) over the population of interest. We assume

is unknown but

is known. Suppose further that there is full response and sampling is noninfor-mative, so we can ignore iU and rU when defining the population likelihood. Thatis, in this case the survey data consist of the sample values of Y and X and thepopulation values of Z , and The population score for is easily seen to be

where yt is the value of Y for population unit t, with x t and zt defined similarly,and the summation is over the N units making up the population of interest.Applying (2.4), the score for defined by the survey data is

16 INTRODUCTION TO PART A

� ��

�����

���%�* � '� E�� ��� �� %�� "�� " �� "�� A �*� ����" ��" �� " �� F� (2.4)

� ��

�����%�* �'� E � ��� ��� �� %�� "�� " �� "�� A �*� ����" ��" �� " �� F

���� E�� ��� �� %�� "�� " �� "�� A �*� ����" ��" �� " �� F� (2.5)

�"��� � %�� " �( " ��*

� ���� ��( ����(� �(( �(���� ��� ���

��

��

� � �� �

��� %�* � ��(�)�(

� � ��� � �(� � ��

��

��

Page 40: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

using well-known properties of the normal distribution. The MLE for isobtained by setting this score to zero and solving for

Turning now to the corresponding information for we note that the popula-tion information for this parameter is infoU ( . Applying (2.5), theinformation for defined by the survey data is

where

C

Now suppose that the target of inference is the regression of Y on X in thepopulation. This regression is defined by the parameters of thehomogeneous linear model defined by E (YSince

and since is assumed known, the only parameter that needs to be estimated isthe intercept . Using the invariance of the maximum likelihood approach wesee that the MLE for this parameter is

The first term in brackets on the right hand side above is the face valueMLE for. That is, it is the estimator we would use if we ignored the information in Z .

From (2.7), the asymptotic variance of above is then

FULL INFORMATION LIKELIHOOD 17

���%�* � ��(�)�(

'

� � ��� � �(� � ��

���

���� ��" ��" �

��

� �

� ��( �

��� � ����� � �(��� � ��

���

���� %) � �*

���

����(����

���

���%��������*

���

���

�� ������(���

���

��� �

��� � �����(��%��� � ���*

��� � �(���(��%��� � ���*

���

���

���� (2.6)

�* � )��(

�C

�"

�����%�* � '�%����� %�** � ���%��� %�** � )��( �� ( � �

)

� �+

� ���( (2.7)

����

(

� ���

� �2

2

2 2 2

����

���� �

��� � �����(����� ��( � �����(

����( 2

�(� � �(���(����� �(( � �(���(

����( 2

2 2 2

���

����

�" � ��� ��� �(�( * � �� �( ��� ��%� �( * � ��

� �( �

�� � �� ��(��� � ��

� �( � ���((

��( � ��((

�� � ��(

((��(� � �� � ��(��� �( � ��� � ���((

��

� � (2.8)

��

�� � ��� � ��(�((

��( � ��� ���(�((

���

� �� ��� � ��(

�((

�(�

� ���� � ������

� ��

���

Page 41: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

What about if is not known? The direct approach would be then to use (2.4)and (2.5) simultaneously to estimate and . However, since the notationrequired for this is rather complex, we adopt an indirect approach. To start,from (2.8) we can write

We also observe that, under the joint normality of Y , X and Z ,

Combining (2.4) and (2.5) with the factorisation fU ( yU , xU , zU ) fU ( yU ,xU zU ) fU (zU ), one can see that the MLEs of the parameters in (2.9) are justtheir face value MLEs based on the sample data vectors ys, x s and zs,

where SsAB denotes the sample covariance of the variables A and B. Also, it isintuitively obvious (and can be rigorously shown) that the MLEs of z and ZZare the mean and variance SUZZ of the values in the known populationvector zU . Consequently, invoking the invariance properties of the MLE oncemore, we obtain MLEs for and by substituting the MLEs for theparameters of (2.9) as well as the MLEs for the marginal distribution of Z intothe identities preceding (2.9). This leads to the estimators

These estimators are often referred to as the Pearson-adjusted MLEs forand , and have a long history in the statistical literature. See the discussionin SHS, section 6.4.

18 INTRODUCTION TO PART A

���%��* � )�( ( � ��(�((

2

� �� �� ( � �

)

� �+

� ��(

� ( � ��(�((

2

� ��

���

� � E'� %��� %( ��** � ��� %'� %( ��**F�(E'� %���� %( "� ��**

� ���� %'� %( ��*"'� %� ��**F

� � '� %'� %� ��** � '� %'� %( ��**�

��� �( � E'� %��� %� ��** � ��� %'� %� ��**F � ��E'� %��� %( ��**

� ��� %'%( ��**F�

'��

(

� ���

� �� �2� � �(��

�2( � �((�

� ���� ��

(��

� �� ��� �(�

�(� �((

� �� (2.9)

��(� � ��(�������

��(( � ��(������(

��2� � ��� � ��(���� ��2( � ��� � ��((���

���� � ���� � ������(������� ��(( � ��(( � ��(��

�(������(

��(� � ��(� � ��(���(�������

��

� �

�� � ��(� � ��(���(���H���� � ����I��(

�������

��(( � ��(���(���H���� � ����I��(

������(

�� � %��� � ��(������� %��� � ���** � %��� � ��(

������( %��� � ���**��

���� �( � E���� � �����

�(���%���� � ����*��(

������� F

� ���E��(( � ��(���(���%���� � ����*��(

������( F

���

�" � ��� �(

�" ���� �(

Page 42: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Example 2

This example is not meant to be practical, but it is useful for illustrating theimpact of informative sampling on maximum likelihood estimation. We assumethat the population distribution of the survey variable Y is one-parameterexponential, with density

fU ( y;

The aim is to estimate the population mean E (Y ) 1 of this variable.We further assume that the sample itself has been selected in such a way thatthe sample inclusion probability of a population unit is proportional to itsvalue of Y . The survey data then consist of the vectors ys and s, where thelatter consists of the sample inclusion probabilities of thesampled population units.

Given s, it is clear that the value yU of the finite population mean of Y isdeducible from sample data. Consequently, these survey data are actually thesample values of Y , the sample values of and the value of yU . Again, applying(2.4), we see that the score for is then

so the MLE for is just yU . Since this score has zero conditional variabilitygiven the sample data, the information for is the same as the populationinformation for , infoU ( ) , so the estimated variance of is NFinally, since , the estimated variance of is N

An alternative informative sampling scheme is where the sample is selectedusing cut-off sampling, so that ( yt > K ), for known K. In this case

so the score for becomes

scs(

There is no closed form expression for the MLE in this case, but it is relativelyeasy to calculate its value using numerical approximation. The correspondingestimating equation for the MLE for is

It is easiest to obtain the information for by direct differentiation of its score.This leads to

infos(

The estimated variance of is then 4info 1s (

FULL INFORMATION LIKELIHOOD 19

�* � ��/� � ���

� � � ��

� ���)���

���%�* ��)�(

'%��( � �������� �� �* � )%��( � '�% ��� ** � )%�� ��� *

� �

�� � � )��� �� �(����

� � (�� �� �(���� � )�(���� �

� �

'�%��* �(

)%���� � %) � �*'� %� �� � 1** � (

)���� � %) � �*

(

�� 1���1

( � ���1

� �� �

�* � )(

�� '�% ��*

� �� �

(

�� ���

� �� %) � �*

1���1

( � ���1

� ��

��

�� � ��� �) � �

� �1��� �1

( � ����1

� ��

�* � ���� � ) � �1����1 ( � ���1� ���

�� ��� � ��*�

Page 43: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

2.3. SAMPLE LIKELIHOOD sample likelihood

Motivated by inferential methods used with size-biased sampling, Krieger andPfeffermann (1992, 1997) introduced an alternative model-based approach toanalytic likelihood-based inference from sample survey data. Their basic idea isto model the distribution of the sample data by modelling the impact of thesampling method on the population distribution of interest. Once this sampledistribution is defined, standard likelihood-based methods can be used to obtainan estimate of the parameter of interest. See Chapters 11 and 12 in Part C ofthis book for applications based on this approach.

In order to describe the fundamental idea behind the sample distribution, weassume full response, supposing the population values of the survey variable Yand the covariate Z correspond to N independent and identically distributedrealisations of a bivariate random variable with density fU ( y ) fU (z; ). Here

and are unknown parameters. A key assumption is that sample inclusionfor any particular population unit is independent of that of any other unit andis determined by the outcome of a zero�one random variable I, whose distribu-tion depends only on the unit�s values of Y and Z . It follows that the samplevalues ys and zs of Y and Z are realisations of random variables that, condi-tional on the outcome iU of the sampling procedure, are independent andidentically distributed with density parameterised by and :

fs( y, z;

This leads to a sample likelihood for and ,

Ls(

that can be maximised with respect to and to obtain maximum samplelikelihood estimates of these parameters. Maximum sample likelihood estima-tors of other parameters (e.g. the parameter characterising the marginalpopulation distribution of Y ) are defined using the invariance properties ofthe maximum likelihood approach.

Example 2 ( continued)

We return to the one-parameter exponential population model of Example 2, butnow assume that the sample was selected using a probability proportional to Zsampling method, where Z is a positive-valued auxiliary variable correlatedwith Y . In this case Pr(It zt , so Pr(It 1) E (Z ) , say. Wefurther assume that the conditional population distribution of Y given Z can bemodelled, and let fU ( y ) denote the resulting conditional population densityof Y given Z . The population marginal density of Z is denoted fU (z; ), so

). The joint population density of Y and Z is then fU ( y z; ) fU (z; ),and the logarithm of the sample likelihood for and defined by this approach is

20 INTRODUCTION TO PART A

��A� �

�"* � �� % �" ��� � (* � .�%� � (�� � �"� � �* �� % ���A �* �� %�A*

.�%� � (A �"*

�"* �!��

.�%� � (�� � �"� � �* �� % ���A �* �� %�A*

.�%� � (A �"*

� (�� � �* � � � � �

��A �

� � �% � ��

Page 44: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

ln (Ls(

It is easy to see that the value maximising the second term in this samplelikelihood is the face valueMLE of this parameter (i.e. the estimator that wouldresult if we ignored the method of sample selection and just treated the samplevalues of Y as independent draws from the conditional population distributionof Y given Z ). However, it is also easy to see that the value maximising thethird and fourth terms is not the face value estimator of . In fact, it is definedby the estimating equation

s

Recollect that our aim here is estimation of the marginal population expect-ation of Y . The maximum sample likelihood estimate of this quantity is then

This can be calculated via numerical integration.Now suppose Y Z . In this case Pr(It 1 Y t yt) yt , so Pr(It 1)

E (Y t , and the logarithm of the sample likelihood becomes

ln (Ls(

The value of maximising this expression is so the maximum samplelikelihood estimator of is This can be compared with the fullinformation MLE, which is the known population mean of Y .

F inally we consider cut-off sampling, where Pr(It 1) Pr(Y t > K ) e K .Here

ln (Ls(

It is easy to see that this function is maximised when K . Again,this is not the full information MLE, but it is unbiased for since

EU (

SAMPLE LIKELIHOOD 21

�"** � ��!��

��� % ��� � �A �* �� %�A*

�%*

� �

���

�� %�* ���

�� % �� % ��� � �A �**

���

�� % �� %�A** � � �� �%*�

���

��

� � �� %�A*

�� %�A*� �

��%*

�%*� 2�

��� �" "

��� % ���A ���* �� %�A ��*�����

� � � �� � �* � (��

�** � ��!��

.�%� � (�� � �* �� % �A �*.�%� � (A �*

� �� ��

!��

�������

� �

���

�� % �* � �� �� �� ������

� ��� � �������� � �������

� � � ��

�** � ��!��

����% ��1*

� �� � �� �� ��% ��� � 1*�

���(� � ��� � ��� �

���* � '� %'�%���** � '� %'%���� � � 1 A � �* � 1* � ��

Page 45: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

2.4. PSEUDO-LIKELIHOOD pseudo-likelihood

This approach is now widely used, forming as it does the basis for the methodsimplemented in a number of software packages for the analysis of complexsurvey data. The basic idea had its origin in Kish and Frankel (1974), withBinder (1983) and Godambe and Thompson (1986) making major contribu-tions. SHS (section 3.4.4) provides an overview of the method.

Essentially, pseudo-likelihood is a descriptive inference approach to likeli-hood-based analytic inference. Let fU ( yU ; ) denote a statistical model for theprobability density of the matrix yU corresponding to the N population valuesof the survey variables of interest. Here is an unknown parameter and theaim is to estimate its value from the sample data. Now suppose that yU isobserved. The MLE for would then be defined as the solution to an estimat-ing equation of the form scU ( ) 0, where scU ( ) is the score function fordefined by yU . However, for any value of , the value of scU ( ) is also a finitepopulation parameter that can be estimated using standard methods. In par-ticular, let ( ) be such an estimator of scU ( ). Then the maximum pseudo-likelihood estimator of is the solution to the estimating equation ( ) 0.Note that this estimator is not unique, depending on the method used toestimate scU ( ).

Example 2 ( continued)

Continuing with our example, we see that the population score function in thiscase is

scU (

which is the population total of the variable Y . We can estimate this totalusing the design-unbiased Horvitz�Thompson estimator

Here t is the sample inclusion probability of unit t. Setting this estimator equalto zero and solving for , and hence (by inversion) , we obtain the Horvitz�Thompson maximum pseudo-likelihood estimator of . This is

which is the Hajek estimator of the population mean of Y . Under probabilityproportional to Z sampling this estimator reduces to

22 INTRODUCTION TO PART A

�� � ��

� �

� �� �

������

�* ��)�(

%��( � �*

��( �

��78 %�* ���

�( %��( � �*�

� �

��78 ���

�( �

#��

�(

��78 ���

��(

� ��(��

���(

Page 46: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

while for the case of size-biased sampling (Y Z ) it reduces to the harmonicmean of the sample Y -values

This last expression can be compared with the full information maximumlikelihood estimator (the known population mean of Y ) and the maximumsample likelihood estimator (half the sample mean of Y ) for this case. As anaside we note that where cut-off sampling is used, so population units with Ygreater than a known constant K are sampled with probability one with theremaining units having zero probability of sample inclusion, no design-unbiased estimator of scU ( ) can be defined and so no design-based pseudo-likelihood estimator exists.

Inference under pseudo-likelihood can be design based or model based.Thus, variance estimation is usually carried out using a combination of aTaylor series linearisation argument and an appropriate method (designbased or model based) for estimating the variance of ( ) (see Binder, 1983;SHS, section 3.4.4). We write

where N is defined by sU ( N ) 0. Let var( ) denote an appropriatevariance for the estimation error . Then

var(

which leads to a �sandwich� variance estimator of the form

where is a corresponding estimator of the variance ofsU ( ).

2.5. PSEUDO-LIKELIHOOD APPLIED TO ANALYTIC INFERENCEpseudo-likelihood applied to analytic inference

F rom the development in the previous section, one can see that pseudo-likelihood is essentially a method for descriptive, rather than analytic, infer-ence, since the target of inference is the census value of the parameter of interest(sometimes referred to as the census parameter). However, as Binder andRoberts show in Chapter 3, one can develop an analytic perspective thatjustifies the pseudo-likelihood approach. This perspective takes account ofthe total variation from both the population generating process and the sampleselection process.

PSEUDO-LIKELIHOOD APPLIED TO ANALYTIC INFERENCE 23

��78 � ���

��(

� ��(

����

2 � ��� %��* ��� %�) * � %��� �)*������

� ����)

� � � ��� �)��� �)

��� �)* ������

$$$$���)

% &�(

��E��� %�) * � �� %�)*F������

$$$$���)

% &�(

�9 %��* ������

� ��(

�9 E��� %�* � �� %�*F������

� ��(' (

����)

(2.8)

�9 E��� %�* � �� %�*F ��� %�*��

Page 47: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

In particular, Binder and Roberts explore the link between analytic (i.e.model-based) and design-based inference for a class of linear parameters cor-responding to the expected values of population sums (or means) under anappropriate model for the population. From a design-based perspective,design-unbiased or design-consistent estimators of these population sumsshould then be good estimators of these expectations for large-sample sizes,and so should have a role to play in analytic inference. Furthermore, since asolution to a population-level estimating equation can usually be approximatedby such a sum, the class of pseudo-likelihood estimators can be represented inthis way. Below we show how the total variation theory developed by Binderand Roberts applies to these estimators.

Following these authors, we assume that the sampling method is noninfor-mative given Z . That is, conditional on the values zU of a known populationcovariate Z , the population distributions of the variables of interest Y and thesample inclusion indicator I are independent. An immediate consequence isthat the joint distribution of Y and I given zU is the product of their twocorresponding �marginal� (i.e. conditional on zU ) distributions.

To simplify notation, conditioning on the values in iU and zU is denoted by asubscript , and conditioning on the values in yU and zU by a subscript p. Thesituation where conditioning is only with respect to the values in zU is denotedby a subscript p and we again do not distinguish between a random variableand its realisation. Under noninformative sampling, the p-expectation of afunction g( yU , iU ) of both yU and iU (its total expectation) is

E

since the random variable inside the square brackets on the right hand side onlydepends on yU and zU , and so its expectation given zU and its expectationconditional on iU and zU are the same. The corresponding total variance forg( yU , iU ) is

var

Now suppose the population values of Y are mutually independent given Z ,with the conditional density of Y given Z over the population parameterised by

. The aim is to estimate the value of this parameter from the sample data. AHorvitz Thompson maximum pseudo-likelihood approach is used for thispurpose, so is estimated by , where

Here sct( ) is the contribution of unit t to the population score function for .Let BN denote the population maximum likelihood estimator of . Then thedesign-based consistency of for BN allows us to write

24 INTRODUCTION TO PART A

��

� E"% �� " �� *F � '� E'� %"% �� " �� *� �� " �� *��� F � '�E' %"% �� " �� **F

� E"% �� " �� *F � ��� E'� %"% �� " �� *� �� " �� *��� F

� '� E��� %"% �� " �� *� �� " �� *��� F

� ���E' %"% �� " �� **F � '�E�� %"% �� " �� **�

� �-

��

�( ��% �-* � 2�

� ��

�-

Page 48: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where

Ut(BN )

so that the design expectation of is

Ep(

The total expectation of is therefore

E

when we apply the same Taylor series linearisation argument to show that theexpected value of BN under the assumed population model is plus a term oforder N . The corresponding total variance of is

var

where and we have used the fact that population unitsare mutually independent under and that E (Ut(BN )) 0. Now let (BN )denote the large-sample limit of the estimator (2.8) when it estimates the designvariance of . Then p(BN ) is approximately design unbiased for the designvariance of the Horvitz�Thompson estimator of the population total of theUt(BN ),

Ep(

and so is unbiased for the total variance of to the same degree of approxima-tion,

E

Furthermore p(BN ) is also approximately model-unbiased for the predictionvariance of under . To show this, put tu cov (Ut(BN ),Uu(BN )). Then

PSEUDO-LIKELIHOOD APPLIED TO ANALYTIC INFERENCE 25

�-� -) ���

�( %%����% �**��-)

% &�(��

�( ��%-) * � �%���*

���

�( �%-)* � �%���*

��)�(

%����% �**��-)

% &�(

��%-) *

�-

�-* � -) ��)�(

�%-)* � �%���* � -) � �%���*�

�-

� % �-* � '�%-) * � �%���* � �� �%���*

��� �-

� % �-* � ��� -) ���

�( �%-) * � �%���*

% &

��)�(

����%�%-) ** � �%��(*

�� � ��� %��( " ���(

� *

� � � �9

�- �9

�9 %-) ** ��)�(

�)��(

���%-)*��%-) * � �%��(*

�-

� % �9 %-)** ��)�(

����%�%-)** � �%��(*�

�9�- � � �

Page 49: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

var

and so the standard design-based estimator (2.8) of the design variance of thepseudo-likelihood estimator is also an estimator of its model variance.

So far we have assumed noninformative sampling and a correctly specifiedmodel. What if either (or both) of these assumptions are wrong? It is often saidthat design-based inference remains valid in such a situation because it does notdepend on model assumptions. Binder and Roberts justify this claim in Chapter3, provided one accepts that the expected value E (BN ) of the finite populationparameter BN under the true model remains the target of inference under suchmis-specification. It is interesting to speculate on practical conditions underwhich this would be the case.

2.6. BAYESIAN INFERENCE FOR SAMPLE SURVEYS bayesian inference for sample surveys

So far, the discussion in this chapter has been based on frequentist arguments.However, the Bayesian method has a strong history in survey sampling theory(Ericson, 1969; Ghosh and Meeden, 1997) and offers an integrated solution toboth analytic and descriptive survey sampling problems, since no distinction ismade between population quantities (e.g. population sums) and model param-eters. In both cases, the inference problem is treated as a prediction problem.The Bayesian approach therefore has considerable theoretical appeal. Unfortu-nately, its practical application has been somewhat limited to date by the needto specify appropriate priors for unknown parameters, and by the lack ofclosed form expressions for estimators when one deviates substantially fromnormal distribution population models. Use of improper noninformative priorsis a standard way of getting around the first problem, while modern, computa-tionally intensive techniques like Markov chain Monte Carlo methods nowallow the fitting of extremely sophisticated non-normal models to data. Conse-quently, it is to be expected that Bayesian methods will play an increasinglysignificant role in survey data analysis.

In Chapter 4 Little gives an insight into the power of the Bayesian approachwhen applied to sample survey data. Here we see again the need to model thejoint population distribution of the values yU of the survey variables and thesample inclusion indicator iU when analysing complex survey data, with

26 INTRODUCTION TO PART A

�% �-� -)* ��)�(

�)��(

����( �(

� �

��)�(

�)�(

' %����( �(

� * � � �%��(*

��)�(

�)��(

%�� � ' %��( *' %��

�(� ** � � �%��(*

��)�(

� � �%��(*

�-

Page 50: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

analytic inference about the parameters ( ) of the joint population distribu-tion of these values then based on their joint posterior distribution. Thisposterior distribution is defined as the product of the joint prior distributionof these parameters given the values zU of the design variable Z times theirlikelihood, obtained by integrating the joint population density of yU and iUover the unobserved non-sample values in yU. Descriptive inference about acharacteristic Q of the finite population values of Y (e.g. their sum) is thenbased on the posterior density of Q. This is the expected value, relative to theposterior distribution for and , of the conditional density of Q given thepopulation values zU and iU and the sample values ys.

Following Rubin (1976), Little defines the selection process to be ignorablewhen the marginal posterior distribution of the parameter characterising thepopulation distribution of yU obtained from this joint posterior reduces to theusual posterior for obtained from the product of the marginal prior andmarginal likelihood for this parameter (i.e. ignoring the outcome of the sampleselection process). Clearly, a selection process corresponding to simple randomsampling does not depend on yU or on any unknown parameters and istherefore ignorable. In contrast, stratified sampling is only ignorable whenthe population model for yU conditions on the stratum indicators. Similarly,a two-stage sampling procedure is ignorable provided the population modelconditions on cluster information, which in this case corresponds to clusterindicators, and allows for within-cluster correlation.

2.7. APPLICATION OF THE LIKELIHOOD PRINCIPLE INDESCRIPTIVE INFERENCE the likelihood principle in descriptive inference

Section 2.2 above outlined the maximum likelihood method for analytic infer-ence from complex survey data. In Chapter 5, Royall discusses the relatedproblem of applying the likelihood method to descriptive inference. In particu-lar, he develops the form of the likelihood function that is appropriate when theaim is to use the likelihood principle (rather than frequentist arguments) tomeasure the evidence in the sample data for a particular value for a finitepopulation characteristic.

To illustrate, suppose there is a single survey variable Y and the descriptiveparameter of interest is its finite population total T and a sample of size n istaken from this population and values of Y observed. Let ys denote the vectorof these sample values. Using an argument based on the fact that the ratio ofvalues taken by the likelihood function for one value of a parameter comparedwith another is the factor by which the prior probability ratio for these valuesof the parameter is changed by the observed sample data to yield the corres-ponding posterior probability ratio, Royall defines the value of likelihoodfunction for T at an arbitrary value q of this parameter as proportional to thevalue of the conditional density of ys given T q. That is, using LU (q) todenote this likelihood function,

THE LIKELIHOOD PRINCIPLE IN DESCRIPTIVE INFERENCE 27

�"�

� �

Page 51: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

LU (q) fU ( ys T q)fU (q ys) fU ( ys)

fU (q)

Clearly, this definition is easily extended to give the likelihood for any well-defined function of the population values of Y . In general LU (q) will depend onthe values of nuisance parameters associated with the various densities above.That is, LU (q) LU (q; ) where is unknown. Here Royall suggests calculationof a profile likelihood, defined by

LprofileU (q) max LU (q; )

Under the likelihood-based approach described by Royall in Chapter 5, theconcept of a confidence interval is irrelevant. Instead, one can define regionsaround the value where the likelihood function is maximised that correspond toalternative parameter values whose associated likelihood values are not toodifferent from the maximum. Conversely, values outside this region are thenviewed as rather unlikely candidates for being the actual parameter value ofinterest. For example, Royall suggests that a value for T whose likelihood ratiorelative to the maximum likelihood value MLE is less than 1/8 should beconsidered as the value for which the strength of the evidence in favour ofMLE being the correct value is moderately strong. A value whose likelihood

ratio relative to the MLE is less than 1/32 is viewed as one where the strength ofthe evidence in favour of the MLE being the true value is strong.

Incorporation of sample design information (denoted by the known popula-tion matrix zU ) into this approach is conceptually straightforward. One justreplaces the various densities defining the likelihood function LU by condi-tional densities, where the conditioning is with respect to zU . Also, although notexpressly considered by Royall, the extension of this approach to the case ofinformative sampling and/or informative nonresponse would require the natureof the informative sampling method and the nonresponse mechanism also to betaken account of explicitly in the modelling process. In general, the conditionaldensity of T given ys and the marginal density of ys in the expression for LU (q)above would be replaced by the conditional density of T given the actual surveydata, i.e. yobs, rs, iU and zU , and the joint marginal density of these data.

28 INTRODUCTION TO PART A

� �� � ��

� �

�� �

�8

�8

Page 52: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 3

Design-based andModel-based Methodsfor Estimating ModelParametersDavid A. Binder and Georgia R. Roberts

3.1. CHOICE OF METHODS choice of methods

One of the first questions an analyst asks when fitting a model to data that havebeen collected from a complex survey is whether and how to account forthe survey design in the analysis. In fact, there are two questions that theanalyst should address. The first is whether and how to use the samplingweights for the point estimates of the unknown parameters; the second is howto estimate the variance of the estimators required for hypothesis testing andfor deriving confidence intervals. (We are assuming that the sample size issufficiently large that the sampling distribution of the parameter estimators isapproximately normal.)

There are a number of schools of thought on these questions. The puremodel-based approach would demand that if the model being fitted is true,then one should use an optimal model-based estimator. Normally this wouldresult in ignoring the sample design, unless the sample design is an inherent partof the model, such as for a stratified design where the model allows for differentparameter values in different strata. The Bayesian approach discussed in Little(Chapter 4) is an example of this model-based perspective.

As an example of the model-based approach, suppose that under the model itis assumed that the sample observations, y1, . . . , yn, are random variableswhich, given x 1, . . . , xn, satisfy

yt

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd

� ����� ��) ��� � � 2) � � � ) �) (3.1)

Page 53: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where t has mean 0, variance 2, and is uncorrelated with t for tStandard statistical theory would imply that the ordinary least squares estima-tor for is the best linear unbiased estimator. In particular, standard theoryyields the estimator

with variance

where Xs and ys are based on the sample observations.

Example 1. Estimating the mean

We consider the simplest case of model (3.1) where x t is a scalar equal to one forall t. In this case, , a scalar, is the expected value of the random variablesyt(t 1, . . . , n) and is ys, the unweighted mean of the observed sampley-values. Here we ignore any sample design information used for obtainingthe n units in our sample. Yet, from the design-based perspective, we know thatthe unweighted mean can be biased for estimating the finite population mean,yU . Is this contradictory? The answer lies both in what we are assuming to bethe parameter of interest and in what we are assuming to be the randomisationmechanism.

In the model-based approach, where we are interested in making inferencesabout , we assume that we have a conceptual infinite superpopulation ofy-values, each with scalar mean, , and variance 2. The observations,y1, . . . , yn, are assumed to be independent realisations from this superpopula-tion. The model-based sample error is , which has mean 0 and variance

2 n. The sample design is assumed to be ignorable as discussed in Rubin(1976), Scott (1977a) and Sugden and Smith (1984).

In the design-based approach, on the other hand, where the parameter ofinterest is yU , the finite population mean, we assume that the n observations area probability sample from the finite population, y1, . . . , yN . There is no refer-ence to a superpopulation. The randomisation mechanism is dictated by thechosen sampling design, which may include unequal probabilities of selection,clustering, stratification, and so on. We denote by It the 0�1 random variableindicating whether or not the tth unit is in the sample. In the design-basedapproach, all inferences are made with respect to the properties of the randomvariables, It (t 1, . . . ,N ), since the quantities yt (t 1, . . . ,N ), are assumedto be fixed, rather than random, as would be assumed in the model-basedsetting. From the design-based perspective, the finite population mean isregarded as the descriptive parameter to be estimated. Considerations, suchas asymptotic design-unbiasedness, would lead to including the samplingweights in the estimate of yU . Much of the traditional sampling theory adoptsthis approach, since it is the finite population mean (or total) that is of primaryinterest to the survey statistician.

30 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

� � �

�� � � ����

� ��2� �

���) (3.2)

���+��. � �5+� �� ��.

�2) (3.3)

� � ���

� �� �

�� �

��� �� �

� �

Page 54: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

At this point, it might appear that the design-based and model-basedapproaches are irreconcilable. We will show this not to be true. In Section 3.2we introduce linear parameters and their estimators in the design-based andmodel-based contexts. We review the biases of these estimators under both thedesign-based and model-based randomisation mechanisms.

In Section 3.3 we introduce design-based and total variances and we comparethe results under the design-based and model-based approaches. The totalvariance is based on considering the variation due to both the model and thesurvey design. This comparison leads to some general results on the similaritiesand differences of the two approaches. We show that the design-based ap-proach often leads to variances that are close to the total variance, even thougha model is not used in the design-based approach. These results are similar tothose in Molina, Smith and Sugden (2001).

The basic results for linear parameters and linear estimators are extended tomore complex cases in Section 3.4. The traditional model-based approach isexplored in Section 3.5, where it is assumed that the realisation of the random-isation mechanism used to select the units in the sample may be ignored formaking model-based inferences. We also introduce an estimating functionframework that leads to a closer relationship between the pure model-basedand the pure design-based approaches. In Section 3.6 we study the implicationsof taking the �wrong� approach and we discuss why the design-based approachmay be more robust for model-based inferences, at least for large samples andsmall sampling fractions. We summarise our conclusions in Section 3.7.

3.2. DESIGN-BASED AND MODEL-BASED LINEAR ESTIMATORSdesign-based and model-based linear estimators

We consider the case where there is a superpopulation model which generatesthe finite population values, y1, . . . , yN . We suppose that, given the realisedfinite population, a probability sample is selected from it using some, possiblycomplex, sampling scheme. This sampling scheme may contain stratification,clustering, unequal probabilities of selection, and so on. Our sample can thus beviewed as a second phase of sampling from the original superpopulation.

In our discussions about asymptotics for finite populations, we are assumingthat we have a sequence of finite populations of sizes increasing to infinity andthat for each finite population, we take a sample such that the sample size alsoincreases to infinity. In the case of multi-stage (or multi-phase) sampling, it isthe number of first-stage (or first-phase) units that we are assuming to beincreasing to infinity. We do not discuss the regularity conditions for theasymptotic normality of the sampling distributions; instead we simply assumethat for large samples, the normal distribution is a reasonable approximation.For more details on asymptotics from finite populations, see, for example, Sen(1988). In this and subsequent sections, we make frequent use of the o-notationto denote various types of convergence to zero. F irst of all, we use

mean 0 almost surely as . Next, todistinguish among various types of convergence in probability under different

DESIGN-BASED AND MODEL-BASED LINEAR ESTIMATORS 31

�� � �� � +�� . � ��� � ��� � � � ���

Page 55: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

randomisation mechanisms, we use the somewhat unconventional notationand for convergence in

probability under the p-, p- and -randomisations, respectively. (We definethese randomisation mechanisms in subsequent sections.)

We first focus on linear estimators. As we will see in Section 3.4, many of theresults for more complex estimators are based on the asymptotic distributionsof linear estimators. To avoid complexities of notation, we consider the sim-plest case of univariate parameters of interest.

3.2.1. Parameters of interest

Under the model, we define t as

where we use the subscript to denote expectation under the model. (To denotethe expectation under the design we will use the subscript p.) The modelparameter that we consider is

which is the expectation of the sample mean, , when the sampledesign is ignorable. Normally, modellers are interested in only when the t areall equal; however, to allow more generality for complex situations, we are notrestricting ourselves to that simple case here. Defining , we assumethat has the property that

that is, that converges to . This important assumption is reasonable whenthe sample design is ignorable. We discuss the estimation of for the case ofnonignorable sample designs in Section 3.6.

The design-based parameter of interest that is analogous to is the finitepopulation mean,

This is commonly considered as the finite population parameter of interest inthe design-based framework.

3.2.2. Linear estimators

In general, a model-based linear estimator of can be expressed as

where the ct are determined so that has good model-based properties. Inparticular, we assume that the ct are chosen so that

32 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

�� � �� � �+�� .) �� � �� � ��+�

� .) �� � �� � �+�� .

� �

�� ��+ ��.) (3.4)

� �

������) (3.5)

��� ��

������� �

�� �����

� � ��� +2.* (3.6)

� ���

� ��� ��

����� (3.7)

�� � ��!���) (3.8)

��

�� � �� �+2.) (3.9)

Page 56: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

which, when combined with (3.6), implies that

On the other hand, the usual design-based linear estimator of has the form

where the dt are chosen so that has good design-based properties. In particu-lar, we assume that

so that

3.2.3. Properties of and

Let us now consider and from the perspective of estimating b. It followsfrom (3.13) that

Also,

We see that is not necessarily asymptotically design unbiased for b; thecondition for this is that

We now show that this condition holds when the model is true.Since, by the strong law of large numbers (under the model),

it follows from (3.10) and (3.17) that

Again, by the strong law of large numbers (under the model) and (3.18), wehave

which means that condition (3.16) holds when the model is true.

DESIGN-BASED AND MODEL-BASED LINEAR ESTIMATORS 33

�� � ��� �+2.� (3.10)

�� � ��# �

��#���) (3.11)

��

��+��#�. � 2�� � +��2.) (3.12)

�� � �� �+2.� (3.13)

�� ��

�� ��

��+��. � �� +2. � ��� � +2.� (3.14)

��+��. ��

��+��!�.��� (3.15)

��

��+��. � �� +2.� (3.16)

� � ��� � ��+ ��� .� +2. � ��� +2.) (3.17)

��+��� �. � +2.� (3.18)

��+��.� � � ��D��+��.� �E� +2.

� ��D��+��� �.E� +2.

� +2.) (3.19)

Page 57: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

We conclude, therefore, that when the data have been generated according tothe assumed model, not only does have good model-based properties forestimating , but also is approximately design unbiased for b.

Using similar arguments, it may be shown that E (1), so that isapproximately model unbiased for the finite population quantity b. Addition-ally, we can see that the model expectations and the design expectations of both

and all converge to the same quantity,

Example 2

Suppose that under the model all the t are equal to , and the yt have equalvariance, 2, and are uncorrelated. The best linear unbiased estimator of is

ys, the unweighted sample mean. If our sample design is a stratifiedrandom sample from two strata with n1 units selected from the first stratum(stratum size N 1) and n2 units selected from the second stratum (stratum sizeN 2), then the usual design-based estimator for U is

where y1s and y2s are the respective stratum sample means. Now,

where y1U and y2U are the respective stratum means. In general, for dispropor-tionate sampling between strata, the design-based mean of is not equal to

We see here that the model-based estimator may be design inconsistent for thefinite population parameter of interest. However, when the model holds,

, so that E and, by the strong law of largenumbers, o(1). As well, under the model, E , so that by thestrong law of large numbers, b o(1). Thus, E (1), implying thatwhen the model holds, we do achieve asymptotic design unbiasedness for

Many modellers do not consider the design-based expectations to be relevantfor their analysis; these modellers are more interested in the �pure� model-basedexpectations which ignore the randomisation due to the sample design. Thissituation will be considered in Section 3.5.

3.3. DESIGN-BASED AND TOTAL VARIANCES OF LINEARESTIMATORS

3.3.1. Design-based and total variance of design-based and total variances of linear estimators

The observed units could be considered to be selected through two phases ofsampling � the first phase yielding the finite population generated by thesuperpopulation model, and the second phase yielding the sample selectedfrom the finite population. Analysts may then be interested in three types of

34 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

��� ��

�+��.� � � ��

�� �� ���

� ��

��� �

��� � +�2��2� ��5��5�.��) (3.20)

� �

� �

��+��. � +�2��2� � �5��5� .��) (3.21)

��+��. � +�2��2� ��5��5� .��� (3.22)

��+ ��2� . � ��+ ��5� . � � �D��+��.E � ���+��. � ��

��

�+�. � �� �� �+��. � ��

���

��

� �

Page 58: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

randomisation mechanisms. The pure design-based approach conditions on theoutcome of the first phase, so that the finite population values are consideredfixed constants. The pure model-based approach conditions on the sampledesign, so that the It are considered fixed; this approach will be discussed furtherin Section 3.5. An alternative to these approaches is to consider both phases ofsampling to be random. This case has been examined in the literature; see, forexample, Hartley and Sielken (1975) and Molina, Smith and Sugden (2001). Wedefine the total variance to be the variance over the two phases of sampling.

We now turn to examine the variance of . The pure design-based variance isgiven by

where denotes the design-based covariance of and .Under a wide set of models, it can be assumed that E

where O refers to the probability limit under the model. This assumption iscertainly true when the model is based on independent and identically distrib-uted observations with finite variance, but it is also true under weaker condi-tions. Therefore, assuming the sampling fraction, n N , to be o(1), and denotingthe total variance of by var we have

where

and

If we assume that is O(N 1) then

Note that assuming that is O(N 1) is weaker than the commonly usedassumption that the random variables associated with the finite populationunits are independent and identically distributed realisations with finite vari-ance from the superpopulation.

When is O(N 1), we see that var ) and E varp are both asymptotic-ally equal to . Thus, a design-based estimator of varp( ),given by vp , would be suitable for estimating var provided vp( )is asymptotically model unbiased for E varp( ). This is the case whenvp converges to var . For a more complete discussion ofdesign-based variance estimation, see Wolter (1985).

DESIGN-BASED AND TOTAL VARIANCES OF LINEAR ESTIMATORS 35

��

����+��. � ����

���#���

� ��

����� ����� � ) (3.23)

��� � ��#� �� #�

�+�� ��� . � %�+���.)

�����+��. � ������+

��.� ������+��.

� ����+ ��� .� ��

����� ����� �

� �� +��2.

� �� ���

��� �+��� � � ���� � .� +��2. (3.24)

��� � !��+ ��) �� �. (3.25)

�� ��

��� ���5 � ����+ ��� .� (3.26)

�� ��+��

�� �

�����+��. � ��� � +��� � � ���� �.� +��2.� (3.27)

��

��

.)

��

������ � +��� � � ���� �.�

�+��. � ��

��� ����� �

��+�� �

��+ .

����+��.) �+

����

+��.

.)

� ��

Page 59: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

3.3.2. Design-based mean squared error of and its model expectation

Design-based survey statisticians do not normally use to estimate b, since isoften design-inconsistent for b. However, design-based survey statisticiansshould not discard without first considering its other properties such as itsmean squared error. Rather than examining the mean squared error in general,we continue with the situation introduced in Example 2.

Example 3

In Example 2, we had

the unweighted sample mean. Our sample design was a stratified randomsample from two strata with n1 units selected from the first stratum (stratumsize N1) and n2 units selected from the second stratum (stratum size N 2). Since

, the design-based bias of in estimating b is

where

The design-based variance of , ignoring, for simplicity, the finite populationcorrection factors (by assuming the stratum sample sizes to be small relative tothe stratum sizes), is

where v1U and v2U are the respective stratum population variances. The design-based mean squared error of would therefore be smaller than thedesign-based variance of when

Condition (3.32) is generally not easy to verify; however, we may consider themodel expectation of the design-based mean squared error of and of thedesign-based variance of . Under the assumed model, we have

which is approximately equal to when N 1 and N 2 are large compared with n.On the other hand, the model expectation of the design-based variance isgiven by

36 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

��

�� ��

��

�� � ��� � +�2��2� � �5��5�.��) (3.28)

��+��. � +�2��2� � �5��5� .�� ��

��+��.� � � �+ ��2� � ��5� .) (3.29)

� � �2

���2

�� (3.30)

��

����+��. � +�2�2� � �5�5� .��5) (3.31)

����

�5+ ��2� � ��5� .5 � �2�2� � �5�5�

�5�

�52�2�

�5�2��5

5�5�

�5�5� (3.32)

����

������+��. � �� �5+ ��2� � ��5� .5 � �2�2� � �5�5�

�5

� �

� �5�52

�2� 2

�5

� �5

�) (3.33)

�5��

Page 60: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Thus, when the model is true and when the sampling fractions in the two strataare unequal, the design-based mean squared error of is expected to be smallerthan the design-based variance of .

Example 3 illustrates that when the model holds, may be better than as anestimator of b, from the point of view of having the smaller design-based meansquared error. However, as pointed out by Hansen, Madow and Tepping(1983), model-based methods can lead to misleading results even when thedata seem to support the validity of the model.

3.4. MORE COMPLEX ESTIMATORS

3.4.1. Taylor linearisation of non-linear statistics more complex estimators

In Sections 3.2 and 3.3 we restricted our discussion to the case of linearestimators of the parameter of interest. However, many methods of dataanalysis deal with more complex statistics. In this section we show that theproperties of many of these more complex quantities are asymptotically equiva-lent to the properties of linear estimators, so that the discussions in Sections 3.2and 3.3 can be extended to the more complex situations. We take a Taylorlinearisation approach here.

3.4.2. Ratio estimation

To introduce the basic notions, we start with the example of the estimation of aratio. The quantities yt and x t are assumed to have model expectations and

, respectively. We first consider the properties of the model-based estimator,, of the ratio parameter, . Assuming the validity of a first-order

Taylor expansion, such as when the y-values and x -values have finite modelvariances, we have

We now introduce the pseudo-variables, ), given by

(While the ut are defined for all N units in the finite population, only those forthe n units in the sample are required in the model-based framework.) Thesample mean of the ut has, asymptotically, the same model expectation andvariance as the left hand side of expression (3.35) since

MORE COMPLEX ESTIMATORS 37

������+��. � ��

�52�2�

�5�2��5

5�5�

�5�5

� �

� �5�52

�5�2� �5

5

�5�5

� �5

�� (3.34)

����

�� ��

�(��������� � � �(���

������

� � � 2

��+ ��� � ����.� �+�

��.� (3.35)

�+�� ) �(

�+�� ) �( . 2

���� �

�(��

��

� 2

��+ �� � ���.) ��� � � 2) � � � )�� (3.36)

Page 61: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

so that the model-based properties of are asymptotically equivalent tolinear estimators which are discussed in Sections 3.2 and 3.3. The design-basedexpectation of is

where t is the probability that the tth unit is in the sample. Since0 for t 1, . . . , n, then ), in turn, has model expect-

ation

Therefore, the p-expectation of converges to the parameter of interest, ,when the model holds.

Similarly, , a design-based estimator of the ratio ofthe finite population means, may be linearised as

where is defined analogously to in (3.36) for 1, . . . ,N .Since 0, the design-based expectation and variance ofare

and

We can now derive the total variance of . Assuming that varand that is negligible, we have

Therefore, the design-based variance of is asymptotically modelunbiased for its total variance.

38 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

������

� � � 2�

��� �+�� ) �( .� �+�

��.) (3.37)

�������

�������

��+ �������. � �� 2�

���+��. �+�� ) �( .� �+�

��.

� �� 2�

��� �+�� ) �( .� �+�

��.) (3.38)

���D �+�� ) �( .E � ��+ �������

���+ �������. � �� +���.� (3.39)

� ������� �

��#���# � ���#����

���#���

��+ ��#���#. � ������� ��

��+��#�. �+��� ) ��� .� +���.

� ������� � 2

� �+��� ) ��� .� +���.

� ������� � +���.) (3.41)

����+ ��#���#. � �����

��#� �+��� ) ��� .� �

� +��2.

���

��� � �+��� ) ��� . � �+��� ) ��� .� +��2.� (3.42)

�����+ ��#���#. ���

��� ���D �+��� ) ��� . � � +��� ) ��� .E� +��2.� (3.43)

��#��#

� ������

��

��#� �+��� ) ��� .� �+���.) (3.40)

�+��� ) ��� . �+�� ) �( . � �� �+��� ) ��� . � ��#���#

���+ ��#���#. �%+��2.

��#���#���

��#���#

Page 62: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

3.4.3. Non-linear statistics � explicitly defined statistics

The same linearisation method can be used for a wide class of non-linearstatistics. For example, suppose that we have a simple random sample ofsize n and that our estimator can be expressed as a differentiable function, g,of sample means, . Letting j be the model expectation of ytj,where ytj is the value of the jth variable for the tth unit, we have

where . We define pseudo-variables, 1, . . . ,N as

so that

This implies that inferences about are asymptotically equivalent toinferences on . For example,

and

sinceThe design-based analogues of and are and

respectively, where we denote the finite populationmeans by , for j 1, . . . ,m, and we let be the design-basedestimator of . We have

Taking the design-based mean, we have

and its model expectation is

For the design-based variance, we have

MORE COMPLEX ESTIMATORS 39

��� � + ��2�) � � � ) ����.�

�+���.� �+�. ���,�2

��

��,+ ��,� � �,.� �+�

��.) (3.44)

� � +�2) � � � ) ��.� �+�. � �

�+�. ��,�2

��

��,+ ��, � �,.) (3.45)

�+ ���.� �+�. ��

�� �+�.��� �+���.� (3.46)

�+ ���.� �+�.

���

� �+�. �

�� �+�.��

��D �+ ���.E � �+�.��

�� �+�.��� �+���. (3.47)

���D ��+ ���.E � �+�.� +���. (3.48)

��D �+�.E � 6���� � ��# � + ��2# ) � � � ) ���# .

��� � + ��2� ) � � � ) ���� .�

��,� � ��,# ���#���,

��,�

�+ ��#.� �+ ��� . ��

��#� �+ ��� .� �+���.� (3.49)

��D �+ ��#.E � �+ ��� .��

�+ ��� .�� � +���.

� �+ ��� .���,�2

����2

��

���,�+ ��, � ��,� .�� � +���.

� �+ ��� .� +���. (3.50)

���D �+ ��#.E � ��D �+ ��� .E� +���.

� �+�.� +���.� (3.51)

Page 63: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

For the case where n/N is negligible, and assuming that varthe total variance of s

since, under the model, . Therefore, the design-based vari-ance of is asymptotically model unbiased for its total variance.

3.4.4. Non-linear statistics � defined implicitly by score statistics

We can now extend our discussion to parameters that are defined implicitlythrough estimating equations. We suppose that, under the model, the vector-valued yt are independent with density function given by ft( yt ; ). A model-based approach using maximum likelihood estimation, assuming ignorablesampling, yields as the solution to the likelihood equations:

Taking a first-order Taylor expansion of the left hand side of expression (3.54)around and rearranging terms, we obtain

Since, for most models, we have

it follows that

Therefore, by defining vector-valued pseudo-variables,

we see that the sample mean of these pseudo-variables has the same asymptoticexpectation and variance as . In particular, we have

40 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

����D �+ ��#.E � �����

��#� �+ ��� .� �

� +��2.

���

��� � �+ ��� . � � + ��� .� +��2.� (3.52)

�����D �+ ��#.E ���

��� � �+�. � �+�.� +��2. (3.53)

���D �+ ��#.E � %+��2.)�+ ��#. �

��� � �� �+���.�

�+ ��#.

��

��2�

��� �� ��+ ��* ��.

���� 6� (3.54)

�� � �

��2�

���5 �� ��+ ��* �.

�����+��� �. � ���2

���� �� ��+ ��* �.

��� �+�

��.� (3.55)

��2�

���5 �� ��+ ��* �.

�����

� �� %�+2.) (3.56)

��� � � ��

���5 �� ��+ ��* �.

� �����

��2���� �� ��+ ��*�.

���� �+�

��.� (3.57)

�+�. � ��

���5 �� ��+ ��* �.

� �����

� ��2� �� ��+ ��* �.

��) (3.58)

��� �

��� � ��

�� �+�.��� �+���.� (3.59)

Page 64: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

This implies that inferences about are asymptotically equivalent to infer-ences on . For example, we have

and

when we assume that , which is one of the usual regularity condi-tions for maximum likelihood estimation.

For the case of design-based inference, following the approach in Binder(1983), our finite population parameter of interest is b, the solution to thelikelihood equations (3.54) when all the population units are observed. Ourestimate, , of the finite population parameter of interest, b, is the solution tothe design-based version of (3.54), given by

Taking a Taylor expansion of the left hand side of (3.62) around , weobtain

where

Since b is the maximum likelihood estimate of , defined analogously to (3.54)for the finite population, we have the design expectation of , given by

which has model expectation given by

For the design-based variance, we have

We can now derive the total variance matrix of . For the case where n/N isnegligible, we have

MORE COMPLEX ESTIMATORS 41

��� �� �+�.

��� �+�.��

��+��� �. ��

�� �+�.��� �+���. (3.60)

���+��� �. � +���.) (3.61)

��D �+�.E � 6)

��

���#�

� �� ��+ ��*��.

���� 6� (3.62)

�� � �

��� � ��

��#� �+�.� �+���.) (3.63)

�+�. � ��

��#��5 �� ��+ ��* �.

�����

� ��2� �� ��+ ��* �.

��� (3.64)

��+��. � ��

� �+�.�� � +���. � �� +���.) (3.65)

���+��. � ��+�.� +���.

� �� +���.� (3.66)

����+��. � ����

���#� �+�.

� �� +��2.

���

��� � �+�. � �+�.� � +��2.� (3.67)

��

��

�����+��. �

����� ���D �+�. � � +�.

�E� +��2.) (3.68)

Page 65: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

since and it has been assumed that varWe see, therefore, that the discussion in Section 3.3 can often be extended to

non-linear parameters and estimators. This implies that the p-expectations ofboth and are often equal to the model parameter of interest and that thedesign-based variance matrix of is asymptotically model unbiased for its totalvariance matrix.

3.4.5. Total variance matrix of for non-negligible sampling fractions

Much of the discussion with respect to the behaviour of the sampling variancesand covariances has assumed that the sampling fractions are negligible. Weconclude this section with an examination of the total variance matrix ofwhen the sampling fractions are not small.

We suppose that has the property that, conditional on the realised finitepopulation, the asymptotic design-based sampling distribution ofhas mean 0 and has variance matrix varp which is O(1). Now,conditional on the finite population, the asymptotic design-based samplingdistribution of has mean and variancematrix varp . As a result, the asymptotic distribution ofhas total variance matrix

For most models, var is O(1). We can thus see that when thesampling fraction, n/N , is small, the second term in Equation (3.69) may beignored, and an asymptotically design-unbiased estimator of varp isasymptotically model unbiased for var . However, when n/N is notnegligible � for example, in situations where the sampling fractions are large inat least some strata � var must also be estimated in order toestimate the total variance matrix of . Model-based methods must be usedto estimate this term since the design-based variance matrix of the finitepopulation constant, b, is zero. Examples of correcting the estimates of totalvariance under more complex survey designs are given in Korn and Graubard(1998a).

3.5. CONDITIONAL MODEL-BASED PROPERTIES

3.5.1. Conditional model-based properties of conditional model-based properties

Many model-based analysts would not consider the p-expectation or the totalvariance as the relevant quantities of interest for inferences, since they assumethat the sample design is ignorable; see Rubin (1976), Scott (1977a), Sugdenand Smith (1984) and Pfeffermann (1993) for discussions on ignorable sampledesigns. Since the model-based properties of are discussed at length in thestandard literature, we focus here, instead, on the properties of from the

42 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

� � �� +���. ���+��. � %+��2.�

��� ��

��

����

�+��� �.

��

��

��

D ��+��� �.E)

��+�� �. ��D

����

�+��� �.E � ���

��+�� �.

D ��+��� �.E�

���

�+��� �.

�����

���������

�+��� �. � ������ �

�+��� �. � +���.����

������

�+�� �.

� �� +��<:.

� D������

�+�� �.E

�D����

�+��� �.E

��D����

�+��� �.E�

�D������

�+�� �.E

��

��

����

� � � ����

��

Page 66: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

perspective of these model-based analysts. We thus consider the expectationsand variances conditional on the sample design; that is, conditional on therandom variables It (t 1, . . . ,N ), and, therefore, implicitly conditional onauxiliary variables associated with the selected units.

3.5.2. Conditional model-based expectations

Considering the It to be fixed quantities, so that we are conditioning on therandomisation under the design, we have

For large samples, by the strong law of large numbers (under the design), Econverges to its design expectation. Therefore,

We see that if (1), as is normally the case, then isasymptotically conditionally model unbiased for

3.5.3. Conditional model-based variance for and the use of estimatingfunctions

The model-based variance for is given by

If we make the further assumption that, for large samples, var converges toits design expectation, that is var then we have

Now, comparing (3.27) and (3.73), we see that if thenthe total variance of , given by var and the model-based variance ofgiven by var are both asymptotically equal to From (3.27) wehave that var Therefore, if, under the model, thedesign-based variance of b, given by varp converges to its model expectation,E varp then we also have that varp is asymptotically equal towhen

In general, the condition, depends on both the modeland the design. For example, if for all t and if we have a fixed sample sizedesign, then we have the even stronger condition that

CONDITIONAL MODEL-BASED PROPERTIES 43

��+��. � ��

��� #� ��

� ��

��� #� ��� (3.70)

�+��.

��+��. � ��D��+

��.E� +2.

� ��D��+��.E� +2.

� ��D ��� E� +2.

� ���� � +2.� (3.71)�

����� � �

������� ����

����+��. � ����

���#���

� ��

������ �#�#� ���� � � (3.72)

� �����+

��. ���

��+��#��� �#� �.��� � � +��2.

���

��� ���� � � +��2.�(3.73)

��

��

�+��. ������+

��. +��2.)

������ ���� � � +��2.)

�+��.

����� ���� � �

����� ���� ���

���� ���� � � +��2.������� ���� � � +��2.)

������ ���� � � 6�

��+��. � ������+

��.� +��2�

.�

�� � �

�� ��

�+��

���

)

.)��+

��

+ .) +��.

.)

+��.)

Page 67: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

An important case where this stronger condition holds regardless of the sampledesign is where all the t are zero. We examine how to achieve this case byputting the estimation problem into an estimating function framework.

Instead of considering the population values, y1, . . . , yN , we consider thequantities

Under the model, the ut have mean 0 and have the same variance matrix as theyt given by

Note that we make ut explicitly a function of t since the ut are not directlyobservable when the t are unknown. We define

where . From our results in (3.27) and (3.73), we have that

because E 1, . . . ,N ). Therefore, an asymptotically model-unbiased estimator of var is asymptotically model unbiased forvar

From (3.23), the design-based variance of is given by

which has model expectation . Assuming that a design-based esti-mator, converges to varp and that var converges to itsmodel expectation, then the model expectation of v would also beasymptotically equal to var

Generally, there are many choices for obtaining design-based estimators forvarp as reviewed in Wolter (1985), for example. However, to usevp the t would usually need to be estimated. We denote by vpthe design-based estimator of the design-based variance of evaluated atWe choose such that vp is asymptotically design unbiased forvarp and is, therefore, asymptotically model unbiased for var

From the discussion above, we see that, in practice, vp provides anestimate of the variance of ud( ) for a wide class of inference problems, both inthe design-based and in the model-based frameworks. Generally, even when thet are not all zero, the design-based variance of ud( ) is asymptotically equal to

its model-based counterpart, providing the model is valid. However, this equal-ity does not necessarily hold for design-based and model-based variances ofFortunately, as we have seen in Section 3.4, for many situations where we areestimating model parameters, we have all the t equal to zero after linearisa-tion. In these situations, the design-based approach has appropriate model-based properties.

44 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

�+��. �� � �� +� � 2) � � � )�.� (3.74)

!��D �+��.) � � +�� � .E � ��� � � (3.75)

� #+�. �

��#� �+��.) (3.76)

� � +�2) � � � ) ��.�

�����D� #+�.E � ����D� #+�.E� +��2. ���

��� ���� � � +��2.) (3.77)

�D �+��.E � 6 +� ���D� #+�.E

� #

����D� #+�.E � �����

��#� �+��.� �

���

��� � �+��. � �+�� � .) (3.78)��

��� ���� �

�D� #+�.E�

��D� #+�.E) D� #+�.E) D� #+�.E�D� #+�.E

�D� #+�.E�

D� #+�.E)D� #+�.E) �

��

� �

� �

D� #+��.E� #+�.

���D� #+��.E

D� #+�.E) �D� #+�.E�D� # +��.E

���

Page 68: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

3.6. PROPERTIES OF METHODS WHEN THE ASSUMEDMODEL IS INVALID properties of methods when the assumed model is invalid

We now examine the properties of design-based and model-based methodswhen the model we have assumed is violated. When the true model has beenmisspecified, there are two important elements to consider. These are whetherthe point estimate of is consistent under the true model, or at least asymptot-ically model unbiased, and whether the estimated variance of this point esti-mate is asymptotically model unbiased under the true model.

3.6.1. Critical model assumptions

We made some critical model assumptions in Sections 3.2 and 3.3. F irst, weassumed that the sample design was ignorable; as well, we assumed a certainstructure for the model means and covariances of the observations. Whenthe sampling is not ignorable, it is no longer appropriate to assume that theparameter of interest, the same as since theconditional expected value of yt , given under the model, is not necessar-ily equal to However, we still suppose, as we did in the ignorablecase (3.6), that converges to the model expectation of the finite populationmean, that is,

3.6.2. Model-based properties of

We first consider the model-based properties of the estimator, under the truemodel. We have already seen that is asymptotically model unbiased for thetrue see (3.71) and (3.79). Also, we have shown that the design-basedestimator of the design variance of provides a reasonable estimator for thetotal variance of , when the sample is large and the sampling fraction is small.However, we mentioned in Section 3.5 that, since under certain conditions thedesign-based variance of is asymptotically equal to its model-basedvariance, and since the model-based variances of are asymptoticallyequal, it may be more appropriate to estimate the model variance of using theestimating equation approach, described in Section 3.5.3, rather than by deriv-ing the model variance explicitly. This would still be true when the assumedmodel is incorrect, provided the structure for the means of the yt is correctlyspecified in the assumed model, even if the sample design is nonignorable. (Fornonignorable sample designs we consider the means conditional on whichunits have been selected in the sample.) If, however, the structure for themeans is incorrectly specified, then the design-based variance of is not,in general, asymptotically equal to its model variance, so that a design-basedvariance estimate of may not be a good estimate of the model varianceof

PROPERTIES OF METHODS WHEN THE ASSUMED MODEL IS INVALID 45

� �������) �� ��+

�������.)

�� � 2)�� � ��+ ��.�

���� *

� ��

���� � +2.� (3.79)

��)��

�*

� #+�.� #+�.

� # +�.

��

����)

��� ����

� #+�.���

Page 69: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

3.6.3. Model-based properties of

Since the model-based estimator, has been derived using modelconsiderations, such as best linear unbiasedness, we now consider the situationswhere would be preferred because some of the assumed properties of and ofits estimated variance are incorrect. In particular, we consider the followingthree cases in turn:

(i) the sample design is nonignorable;(ii) the structure for the model means given by the t is incorrect; and(iii) the structure for the model variances given by the is incorrect.

F irst, we consider the case where the sample design is nonignorable. Underthe true model, the expectation of conditional on the units selected in thesample, is given by

where provided that the ct depend onlyon the units included in the sample and not on the y-values. When the sampledesign is nonignorable, in general, so that may be asymptoticallybiased under the model.

Example 4

Pfeffermann (1996) gives the following artificial example. Suppose that the finitepopulation consists of zeros and ones where, under the model,with 0 < 1. The parameter of interest, , is equal to We nowassume that for each unit where yt 0, the probability of inclusion in the sampleis p0, but for each unit where yt 1, the probability of inclusion in thesample is p1. Therefore, using Bayes� theorem, we have

which is not necessarily equal to . Under this nonignorable design, the modelbias of the sample mean, an estimate of is

We see, therefore, that the sample mean, is model biased and model incon-sistent for under the true model if (A modeller who knows about thenon-ignorability of the inclusion probabilities would use this model informa-tion and would not normally use to estimate .)

We now discuss the second case, where the structure for the assumed modelmeans given by the t is incorrect. The condition for to be asymptoticallyunbiased for under the true model is

46 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

��

�� � ��� !� ��)

�� ��

���� �

��)

��+����� . ��

��+��!������ . ��

��!���2) (3.80)

�� � +�2) � � � ) ��.�) ��2 � ��D����� � 2E)

��2 �� �� ��

/�+ �� � 2. � �)� � � ��+ ��� . � ��

��

��2 � /�+ �� � 2��� � 2. ���2

+2� �.�6 � ��2) (3.81)

��2 � � � �+2� �.+�2 � �6.

+2� �.�6 � ��2� (3.82)

����) �� �

���)�

��� �

�6 �� �2�

� ���

Page 70: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

We now give an example where a bias would be introduced when the modelmeans are incorrect.

Example 5

Suppose that there are two observations, y1 and y2, and that the modellerassumes that y1 and y2 have the same mean and have variance matrix

In this case, the best linear model-unbiased estimator of the common mean isgiven by 0.8y1 0.2y2. If, however, the true means, for t 1, 2, aredifferent, then the bias of this estimator in estimating

) and is model inconsistent.

Some model-based methods to protect against biases when the model meanstructure is incorrectly specified have been suggested in the literature; see, forexample, the application of non-parametric calibration in a model-based set-ting to estimate finite population characteristics, given in Chambers, Dorfmanand Wehrly (1993).

F inally, we consider the third case where the model variance structure isincorrectly specified. Here, the estimator for the variance of under the modelmay be biased and inconsistent. The following is a simple example of this case.

Example 6

Suppose that, under the assumed model, the observations, y1, . . . , yn, are inde-pendent and identically distributed Poisson random variables with mean . Theusual estimate for is the sample mean, ys, which has variance underthe Poisson model. Therefore, a commonly used model-based estimator of thevariance of is Clearly, this can be biased if the Poisson model isincorrect and the true variance of the yt is not .

The statistical literature contains many model-based examples of model-based methods for robust estimation of variances which address this issue:see, for example, McCullagh and Nelder (1983), Royall (1986) and Liang andZeger (1986). Many of these methods are analogous to the design-based ap-proach that we have been discussing.

3.6.4. Summary

In summary, when our model parameter of interest, , is asymptotically equalto , we find that the model-based estimator, , may have poor proper-ties when some of the model assumptions are violated. On the other hand, isasymptotically model unbiased for , and the design-based estimator of its

PROPERTIES OF METHODS WHEN THE ASSUMED MODEL IS INVALID 47���!��� �

�������� � +2.� (3.83)

�5

2 6

6 1

� �� (3.84)

� �� � ��+ ��.) �� � +�2 � �5.�5 ��

6��+�2 � �5 ��

��

�� ���

��� �������

���

�����

���

Page 71: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

design variance provides a reasonable estimator for the total variance ofwhen the sample is large and the sampling fraction is small, even when themodel assumptions are incorrect. This provides some rationale for preferringover and for using a design-based estimator for the variance of . If theassumed model is true, then the design-based estimator, , may have largermodel variance than the model-based estimator, However, for large samplesizes, this loss of efficiency may not be serious, compared with the gain inrobustness. As well, the design-based estimator of the variance of ) pro-vides an asymptotically model-unbiased estimate of var provided that theconditional means of the yt (conditional on having been selected into thesample) are correctly specified. This is the case even when the model variancestructure is incorrectly specified.

3.7. CONCLUSION conclusion

The choice between model-based and design-based estimation should be basedon whether or not the assumed model is true. If the model could have beenmisspecified, and if a model-based approach has been taken, one needs toconsider carefully what is being estimated and whether or not the varianceestimators are robust to misspecification. On the other hand, the correspondingdesign-based estimates tend to give valid inferences, even in some cases whenthe model is misspecified. However, the use of design-based methods tends tolead to a loss of efficiency when the assumed model is indeed true. For verylarge samples, this loss may not be too serious, given the extra robustnessachieved from the design-based approach. When the sample size is not largeor when the sampling fractions are non-negligible, some modification of a puredesign-based approach would be necessary.

We have proposed the use of an estimating function framework to obtainvariance estimates that can be justified under either a design-based or model-based framework. These variance estimates use design-based methods appliedto estimating functions which are obtained through the model mean structure.We believe this approach would have wide appeal for many analysts.

48 DESIGN-BASED METHODS FOR ESTIMATING MODEL PARAMETERS

��)

����) ���

��)���

� #+��+��.

Page 72: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 4

The Bayesian Approachto Sample Survey InferenceRoderick J. Little

4.1. INTRODUCTION introduction

This chapter outlines the Bayesian approach to statistical inference for finitepopulation surveys (Ericson, 1969, 1988; Basu, 1971; Scott, 1977b; Binder, 1982;Rubin, 1983, 1987; Ghosh and Meeden, 1997). Chapter 18 in this volume appliesthis perspective to the analysis of survey data subject to unit and item nonre-sponse. In the context of this conference honoring the retirement of Fred Smith,it is a pleasure to note his many important contributions to the model-basedapproach to survey inference over the years. His landmark paper with Scott wasthe first to describe random effects models for the analysis of clustered data(Scott and Smith, 1969). His paper with Holt (Holt and Smith, 1979) revealedthe difficulties with the randomization-based approach to post-stratification,and his early assessment of the historical development of survey inferencefavored a model-based approach (Smith, 1976). The difficulties of formulatingrealistic models for human populations, and more recent work by Smith and hiscolleagues on effects of model misspecification (for example, see Holt, Smithand Winter, 1980; Pfeffermann and Holmes, 1985), have led Smith to a prag-matic approach that recognizes the strengths of the randomization approach forproblems of descriptive inference while favoring models for analytic surveyinference (Smith, 1994). My own view is that careful model specification,sensitive to the survey design, can address the concerns with model misspecifica-tion, and that Bayesian statistics provide a coherent and unified treatment ofdescriptive and analytic survey inference. Three main approaches to finitepopulation inference can be distinguished (Little and Rubin, 1983):

1. Design-based inference, where probability statements are based on thedistribution of sample selection, with the population quantities treated asfixed.

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 73: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

2. Frequentist superpopulation modeling, where the population values areassumed to be generated from a superpopulation model that conditions onfixed superpopulation parameters, and inferences are based on repeatedsampling from this model. For analytic inference about superpopulationparameters, this is standard frequentist model-based parametric inference,with features of the complex sample design reflected in the model asappropriate. For descriptive inference, the emphasis is on prediction ofnonsampled values and hence of finite population quantities.

3. Bayesian inference, where the population values are assigned a prior distri-bution, and the posterior distribution of finite population quantities iscomputed using Bayes� theorem. In practice, the prior is usually specifiedby formulating a parametric superpopulation model for the populationvalues, and then assuming a prior distribution for the parameters, asdetailed below. When the prior is relatively weak and dispersed comparedwith the likelihood, as when the sample size is large, this form of Bayesianinference is closely related to frequentist superpopulation modeling, exceptthat a prior for the parameters is added to the model specification, andinferences are based on the posterior distribution rather than on repeatedsampling from the superpopulation model.

Attractive features of the Bayesian approach including the following:

Bayesian methods provide a unified framework for addressing all the pro-blems of survey inference, including inferences for descriptive or analyticalestimands, small or large sample sizes, inference under planned, ignorablesample selection methods such as probability sampling, and problems wheremodeling assumptions play a more central role such as missing data ormeasurement error. Many survey statisticians favor design-based inferencefor some problems, such as large-sample descriptive inference for meansand totals, and model-based inference for other problems, such as small-area estimation or nonresponse. The adoption of two potentially contra-dictory modes of inference is confusing, and impedes communication withsubstantive researchers such as economists who are accustomed to modelsand have difficulty reconciling model-based and randomization-based infer-ences. Consider, for example, the issue of weighting in regression models,and in particular the difficulties of explaining to substantive researchersthe difference between weights based on the inverse of the sampling prob-ability and weights for nonconstant variance (Konijn, 1962; Brewer andMellor, 1973; Dumouchel and Duncan, 1983; Smith, 1988; Little, 1991;Pfeffermann, 1993).As illustrated in Examples 1 and 2 below, many standard design-basedinferences can be derived from the Bayesian perspective, using classicalmodels with noninformative prior distributions. Hence tried and testedmethods do not need to be abandoned. In general, Bayesian inferencesunder a carefully chosen model should have good frequentist properties,

50 THE BAYESIAN APPROACH TO SAMPLE SURVEY INFERENCE

Page 74: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

and a method derived under the design-based paradigm does not becomeany less robust under a Bayesian etiology.Bayesian inference allows for prior information about a problem to beincorporated in the analysis in a simple and clear way, via the prior dis-tribution. In surveys, noninformative priors are generally chosen for fixedparameters, reflecting absence of strong prior information; but multilevelmodels with random parameters also play an important role for problemssuch as small-area estimation. Informative priors are generally shunned insurvey setting, but may play a useful role for some problems. For example,the treatment of outliers in repeated cross-sectional surveys is problematic,since information about the tails of distributions in a single survey isgenerally very limited. A Bayesian analysis can allow information aboutthe tails of distributions to be pooled from prior surveys and incorporatedinto the analysis of the current survey via a proper prior distribution, withdispersion added to compensate for differences between current and priorsurveys.With large samples and relatively dispersed priors, Bayesian inference yieldssimilar estimates and estimates of precision to frequentist superpopulationmodeling under the same model, although confidence intervals are replacedby posterior probability intervals and hence have a somewhat differentinterpretation. Bayes also provides unique solutions to small-sampleproblems under clearly specified assumptions where there are no exactfrequentist answers, and the choice between approximate frequentist solu-tions is ambiguous. Examples include whether to condition on the marginsin tests for independence in 2 2 tables (see, for example, Little, 1989), andin the survey context, inference from post-stratified survey data, where thereis ambiguity in the frequentist approaches concerning whether or not tocondition on the post-stratum counts (Holt and Smith, 1979; Little, 1993).Bayesian inference deals with nuisance parameters in a natural and appeal-ing way.Bayesian inference satisfies the likelihood principle, unlike frequentistsuperpopulation modeling or design-based inference.Modern computation tools make Bayesian analysis much more practicallyfeasible than in the past (for example, Tanner, 1996).

Perhaps the main barrier to the use of Bayesian methods in surveys is theperceived �subjectivity� involved in the choice of a model and prior, and theproblems stemming from misspecified models. It is true that Bayes lacks built-in �quality control,� and may be disastrous if the prior or model is poorlyspecified. Frequentist superpopulation modeling avoids the specification ofthe prior, but is also vulnerable to misspecification of the model, which isoften the more serious issue, particularly when samples are large. The design-based approach, which is based on the known probability distribution ofsample selection and avoids overt specification of a model for the surveyoutcomes, is less vulnerable to gross errors of misspecification. However,

INTRODUCTION 51

Page 75: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Bayesian models with noninformative priors can be formulated that lead toresults equivalent to those of standard large-sample design-based inference, andthese results retain their robustness properties regardless of their inferentialorigin. I believe that with carefully specified models that account for the sampledesign, Bayesian methods can enjoy good frequentist properties (Rubin, 1984).I also believe that the design-based approach lacks the flexibility to handlesurveys with small samples and missing data. For such problems, the Bayesianapproach under carefully chosen models yields principled approaches to im-putation and smoothing based on clearly specified assumptions.

A common misconception of the Bayesian approach to survey inference isthat it does not affirm the importance of random sampling, since the modelrather than the selection mechanism is the basis for Bayesian probabilitystatements. However, random selection plays an important role in the Bayesianapproach in affording protection against the effects of model misspecificationand biased selection. The role is illuminated by considering models that incorp-orate the inclusion indicators as well as the survey outcomes, as discussed in thenext section. In Chapter 18 this formulation is extended to handle missing data,leading to Bayesian methods for unit and item nonresponse.

4.2. MODELING THE SELECTION MECHANISM modeling the selection mechanism

I now discuss how a model for the selection mechanism is included within theBayesian framework. This formulation illuminates the role of random samplingwithin the Bayesian inferential framework. The following account is based onGelman et al. (1995). Following notation in Chapters 1 and 2, let

yU ( ytj), ytj value of survey variable j for unit t, j 1, . . . , J ; t U{1, . . . ,N }

Q Q( yU ) finite population quantityiU (i1, . . . , iN ) sample inclusion indicators; it 1 if unit t included, 0

otherwiseyU ( yinc, yexc), yinc included part of yU , yexc excluded part of yUzU fully observed covariates, design variables.

We assume for simplicity that design variables are recorded and available foranalysis; if not, then zU denotes the subset of recorded variables. The expandedBayesian approach specifies a model for both the survey data yU and thesample inclusion indicators iU . The model can be formulated as

p( yU , iU zU , ) p( yU zU , ) p(iU zU , yU , ),

where are unknown parameters indexing the distributions of yU and iUgiven yU , respectively. The likelihood of based on the observed data(zU , yinc, iU ) is then

52 THE BAYESIAN APPROACH TO SAMPLE SURVEY INFERENCE

� � � ��

� �� � �

� � ��

� �#� � � � � � �

�#��#�

Page 76: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

L (

In the Bayesian approach the parameters ( ) are assigned a prior distributionp( U ). Analytical inference about the parameters is based on the posteriordistribution:

p(

Descriptive inference about Q Q( yU ) is based on its posterior distributiongiven the data, which is conveniently derived by first conditioning on, and thenintegrating over, the parameters:

p(Q( yU )

The more usual likelihood does not include the inclusion indicators iU as partof the model. Specifically, the likelihood ignoring the selection process is basedon the model for yU alone:

L (

The corresponding posterior distributions for and Q( yU ) are

p(

p(

When the full posterior reduces to this simpler posterior, the selection mechan-ism is called ignorable for Bayesian inference. Applying Rubin�s (1976) theory,two general and simple conditions for ignoring the selection mechanism are:

Selection at random (SAR):p(iUBayesian distinctness:p(

It is easy to show that these conditions together imply that

p( U , yinc) p( U , yinc, iU ) and p(Q( yU ) zU , yinc) p(Q( yU ) zU , yinc, iU ),

so the model for the selection mechanism does not affect inferences about andQ(yU ).

Probability sample designs are generally both ignorable and known, in thesense that

p(iUwhere zU represents known sample design information, such as clustering orstratification information. In fact, many sample designs depend only on zU andnot on yinc; an exception is double sampling, where a sample of units isobtained, and inclusion into a second phase of questions is restricted to a

MODELING THE SELECTION MECHANISM 53

�#���� # ����# �� / � �" ����# �� ��� # �#�/ ���" �� # �� ��� # �#�/������

�#��#���

�#���� # ����# �� / � �"�#���� /�"�#���� # ����# �� /�

��� # ����# �� / ���"�" �� /��� # ����# �� # �#�/�"�#���� # ����# �� /�����

���� # ����/ � �" ������� # �/ ���" �� ��� # �/������

���� # ����/ � �"���� /�"���� # ����/

�" �� /��� # ����/ ���"�" �� /��� # ����# �/�"���� # ����/���

��� # �� #�/ � �"�� ��� # ����#�/��� ��� ����

�#���� / � �"���� /�"���� /�

��� ���� �� ��

��� # �� #�/ � �"�� ��� # ����/#

Page 77: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

subsample with a design that depends on characteristics measured in the firstphase.

Example 1. Simple random sampling

The distribution of the simple random sampling selection mechanism is

p(iU

This mechanism does not depend on yU or unknown parameters, and hence isignorable. A basic model for a single continuous survey outcome with simplerandom sampling is

[yt (4.1)

where N (a, b) denotes the normal distribution with mean a, variance b. Forsimplicity assume initially that 2 is known and assign an improper uniformprior for the mean:

p( (4.2)

Standard Bayesian calculations (Gelman et al., 1995) then yield the posteriordistribution of to be N ( ), where ys is the sample mean. To derive theposterior distribution of the population mean yU , note that the posterior distri-bution of yU given the data and is normal with mean ( andvariance ( . Integrating over , the posterior distribution of yU giventhe data is normal with mean

(

and variance

E [Var(

Hence a 95% posterior probability interval forwhich is identical to the 95% confidence interval from design-based theory for asimple random sample. This correspondence between Bayes� and design-basedresults also applies asymptotically to nonparametric multinomial models withDirichlet priors (Scott, 1977b; Binder, 1982).

With 2 unknown, the standard design-based approach estimates 2 by thesample variance 2, and assumes large samples. The Bayesian approach yieldssmall sample t corrections, under normality assumptions. In particular, if thevariance is assigned Jeffreys� prior p( the posterior distribution ofyU is Student�s t with mean ys, scale N ) n and degrees of freedom

The resulting 95% posterior probability interval is, where t is the 97.5th percentile of the t distribution

54 THE BAYESIAN APPROACH TO SAMPLE SURVEY INFERENCE

��� # �� #�/ ��

� ��$

# �� ����$�� � �(

8# ���5���

�����

��# �+I���&�"�# �+/#

���� / � ������

� ���# �+�� �

�� �

����� � "� � �/�/��

� � �/�+��+ �

���� � "� � �/#"�����/ /�� � "���� � "� � �/���/�� � ���

��� ����# �/ I � '��H#" ��� ����# �/ I � "� � �/�+��+

�"$ � ���/+�+�� � "$ � ���/�+���

��� �� ��� $�%&�������������������������������+"$ � ���/��

#

� ��

"�+/ � $��+#���������������������������+"$ � ��

�����

� ��� $ �� ���$# 8�%)9

��$# 8�%)9

�����������������������������+"$ � ���/��

Page 78: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

with n 1 degrees of freedom. Note that the Bayesian approach automaticallyincorporates the finite population correction (1 n N ) in the inference.

Example 2. Stratified random sampling

More generally, under stratified random sampling the population is dividedinto H strata and nh units are selected from the population of Nh units instratum h. Define zU as a set of stratum indicators, with components

zt1, if unit t is in stratum h;0, otherwise:

This selection mechanism is ignorable providing the model for yt conditions onthe stratum variables zt . A simple model that does this is

[yt

where N (a, b) denotes the normal distribution with mean a, variance b. Forsimplicity assume 2

h is known and the flat prior

p(

on the stratum means. Bayesian calculations similar to the first example lead to

[

where

sample mean in stratum h,

These Bayesian results lead to Bayes� probability intervals that are equivalentto standard confidence intervals from design-based inference for a stratifiedrandom sample. In particular, the posterior mean weights cases by the inverseof their inclusion probabilities, as in the Horvitz�Thompson estimator (Horvitzand Thompson, 1952):

selection probability in stratum h.

The posterior variance equals the design-based variance of the stratifiedmean:

Var(

MODELING THE SELECTION MECHANISM 55

�� �

��� � # F� # �+ G I ���&�"� # �

+ /# (4.3)

� ��� / � ������

��� ��� # &���# F�+ G I � �" ����# �

+��/

���� ��) �$

* ��� # * � � ��# ��� �

�+�� �

�) �$

*+ "$ � /�

+ �� # � � �� �

���� � ��$�) �$

� ��� � ��$�) �$

������

������ #

� � � �� �

��� ��� # &���/ ��) �$

*+ "$ � � �� /�

+ �� �

Page 79: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Binder (1982) demonstrates a similar correspondence asymptotically for astratified nonparametric model with Dirichlet priors. With unknown variances,the posterior distribution of yU for this model with a uniform prior on log ( 2

h)is a mixture of t distributions, thus propagating the uncertainty from estimatingthe stratum variances.

Suppose that we assumed the model (4.1) and (4.2) of Example 1, with nostratum effects. With a flat prior on the mean, the posterior mean of yU is thenthe unweighted mean

E (

which is potentially very biased for yU if the selection rates Nh varyacross the strata. The problem is that inferences from this model are nonrobustto violations of the assumption of no stratum effects, and stratum effects are tobe expected in most settings. Robustness considerations lead to the model (4.3)that allows for stratum effects.

From the design-based perspective, an estimator is design consistent if (irre-spective of the truth of the model) it converges to the true population quantityas the sample size increases, holding design features constant. For stratifiedsampling, the posterior mean based on the stratified normal model (4.1) con-verges to yU , and hence is design consistent. For the normal model that ignoresstratum effects, the posterior mean converges to

, and hence is not design consistent unless const. I generallyfavor models that yield design-consistent estimates, to limit effects of modelmisspecification (Little, 1983a,b).

An extreme form of stratification is to sample systematically with interval Bfrom a list ordered on some stratifying variable Z . The sample can be viewed asselecting one unit at random from implicit strata that are blocks of size Bdefined by selection interval. This design is ignorable for Bayesian inferenceprovided the model conditions on zU . The standard stratified sampling model(4.3) does not provide estimates of variance since only one unit is selected perstratum, and inference requires a more structured model for the stratum effects.Modeling assumptions are also needed to derive sampling variances under thedesign-based approach in this situation; generally the approach taken is toderive variances from the differences of outcomes of pairs of units selectedfrom adjacent strata, ignoring differences between adjacent strata.

Example 3. Two-stage sampling

Suppose the population is divided into C clusters, based for example ongeographical areas. A simple form of two-stage sampling first selects a simplerandom sample of c clusters, and then selects a simple random sample of nc ofthe Nc units in each sampled cluster c. The inclusion mechanism is ignorableconditional on cluster information, but the model needs to account for within-cluster correlation in the population. A normal model that does this is

56 THE BAYESIAN APPROACH TO SAMPLE SURVEY INFERENCE

� �

��� ��� # &���# �+/ � ��� �) �$

� ��� # � � � ��#

� � � ��

���� �

�) �$ � � ��� ��)

�$ � � � �

Page 80: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

yct outcome for unit t in cluster c, t 1, . . . ,Nc; c 1, . . . ,C ,

[[

(4.4)

Unlike (4.3), the cluster means cannot be assigned a flat prior, p( c) const.,because only a subset of the clusters is sampled; the uniform prior does notallow information from sampled clusters to predict means for nonsampledclusters. Maximum likelihood inference for models such as (4.4) can be imple-mented in SAS Proc Mixed (SAS, 1992).

Example 4. Quota Sampling

Market research firms often use quota sampling, where strata are formed basedon characteristics of interest, and researchers select individuals purposivelyuntil a fixed number nh of respondents are obtained in each stratum h. Thisscheme assures that distribution of the stratifying variable in the samplematches that in the population. However, the lack of control of selection ofunits within strata allows the selection mechanism to depend on unknownfactors that might be associated with the survey outcomes, and hence benonignorable. We might analyze the data assuming ignorable selection, butthe possibility of unknown selection biases compromises the validity of theinference. Random sampling avoids this problem, and hence greatly enhancesthe plausibility of the inference, frequentist or Bayesian.

This concludes this brief overview of the Bayesian approach to survey infer-ence. In Chapter 18 the Bayesian framework is extended to handle problems ofsurvey nonresponse.

MODELING THE SELECTION MECHANISM 57

� � �������# �+I ���& �"��# �

+/#

����#�I ���& �"�#�/�

� �

Page 81: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 82: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 5

Interpreting a Sample asEvidence about a FinitePopulationRichard Royall

5.1. INTRODUCTION introduction

Observations, when interpreted under a probability model, constitute statisticalevidence. A key responsibility of the discipline of statistics is to provide sciencewith theory and methods for objective evaluation of statistical evidence per se,i.e., for representing observations as evidence and measuring the strength ofthat evidence. Birnbaum (1962) named this problem area �informative infer-ence,� and contrasted it to other important problem areas of statistics, such asthose concerned with how beliefs should change in the face of new evidence, orwith how statistical evidence can be used most efficiently, along with othertypes of information, in decision-making procedures. In this chapter we brieflysketch the basic general theory and methods for evidential interpretation ofstatistical data and then apply them to problems where the data consist ofobservations on a sample from a finite population.

The fundamental rule for interpreting statistical evidence is what Hacking(1965) named the law of likelihood. It states that

If hypothesis A implies that the probability that a random variable X takes thevalue x is pA , while hypothesis B implies that the probability is pB , then theobservation X x is evidence supporting A over B if pA > pB , and the likelihoodratio, pA pB , measures the strength of that evidence.

This says simply that if an event is more probable under A than B, then theoccurrence of that event is evidence supporting A over B � the hypothesis thatdid the better job of predicting the event is better supported by its occurrence. Itfurther states that the degree to which occurrence of the event supports A over

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd

�=

#

Page 83: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

B (the strength of the evidence) is quantified by the ratio of the two probabil-ities, i.e., the likelihood ratio.

When uncertainty about the hypotheses, before X x is observed, is meas-ured by probabilities Pr(A ) and Pr(B), the law of likelihood can be derivedfrom elementary probability theory. In that case the quantity pB is the condi-tional probability that X x , given that A is true, Pr(X x A ), and PB isPr(X x B). The definition of conditional probability implies that

This formula shows that the effect of the statistical evidence (the observationX x ) is to change the probability ratio from Pr(A ) Pr(B) to Pr(A X x )Pr(B X x ). The likelihood ratio pA pB is the exact factor by which theprobability ratio is changed. If the likelihood ratio equals 5, it means that theobservation X x constitutes evidence just strong enough to cause a five-foldincrease in the probability ratio. Note that the strength of the evidence isindependent of the magnitudes of the probabilities, Pr(A ) and Pr(B), and oftheir ratio. The same argument applies when pA and pB are not probabilities,but probability densities at the point x .

The likelihood ratio is a precise and objective numerical measure of thestrength of statistical evidence. Practical use of this measure requires that welearn to relate it to intuitive verbal descriptions such as �weak�, �fairly strong�,�very strong�, etc. For this purpose the values of 8 and 32 have been suggestedas benchmarks � observations with a likelihood ratio of 8 (or 1/8) constitute�moderately strong� evidence, and observations with a likelihood of ratio of 32(or 1/32) are �strong� evidence (Royall, 1997). These benchmark values aresimilar to others that have been proposed (Jeffreys, 1961; Edwards, 1972;Kass and Raftery, 1995). They are suggested by consideration of a simpleexperiment with two urns, one containing only white balls, and the othercontaining half white balls and half black balls. Suppose a ball is drawn fromone of these urns and is seen to be white. While this observation surelyrepresents evidence supporting the hypothesis that the urn is the all-white one(vs. the alternative that it is the half white one), it is clear that the evidence is�weak�. The likelihood ratio is 2. If the ball is replaced and a second draw ismade, also white, the two observations together represent somewhat strongerevidence supporting the �all-white� urn hypothesis � the likelihood ratio is 4.A likelihood ratio of 8, which we suggest describing as �moderately strong�evidence, has the same strength as three consecutive white balls. Observation offive white balls is �strong� evidence, and gives a likelihood ratio of 32.

A key concept of evidential statistics is that of misleading evidence. Obser-vations with a likelihood ratio of pA pB 40 (40 times as probable under A asunder B) constitute strong evidence supporting A over B. Such observationscan occur when B is true, and when that happens they constitute strongmisleading evidence. No error has been made � the evidence has been properlyinterpreted. The evidence itself is misleading. Statistical evidence, properlyinterpreted, can be misleading. But the nature of statistical evidence is such

60 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION

� ��

��

jj

jj

Pr(AjX � x)Pr(BjX � x)

� pAPr(A)pBPr(B)

:

= =

=

= �

Page 84: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

that we cannot observe strong misleading evidence very often. There is auniversal bound for the probability of misleading evidence: if A implies thatX has destiny (or mass) function fA , while B implies fB , then for any k > 1,pB( fA (X ) fB(X ) k) 1 k . Thus when B is true, the probability of observingdata giving a likelihood ratio of 40 or more in favour of A can be no greaterthan 0.025. Royall (2000) discusses this universal bound as well as much smallerones that apply within many important parametric models.

A critical distinction in evidence theory and methods is that between thestrength of the evidence represented by a given body of observations, which ismeasured by likelihood ratios, and the probabilities that a particular procedurefor making observations (sampling plan, stopping rule) will produce observa-tions that constitute weak or misleading evidence. The essential flaw in stand-ard frequentist theory and methods such as hypothesis testing and confidenceintervals, when they are used for evidential interpretation of data, is the failureto make such a distinction (Hacking, 1965, Ch. 7). Lacking an explicit conceptof evidence like that embodied in the likelihood, standard statistical theory triesto use probabilities (Type I and Type II error probabilities, confidence coeffi-cients, etc.) in both roles: (i) to describe the uncertainty in a statistical proced-ure before observations have been made; and (ii) to interpret the evidencerepresented by a given body of observations. It is the use of probabilities inthe second role, which is incompatible with the law of likelihood, that producesthe paradoxes pervading contemporary statistics (Royall, 1997, Ch. 5).

With respect to a model that consists of a collection of probability distribu-tions indexed by a parameter , the statistical evidence in observations X xsupporting any value, 1, vis-a-vis any other, 2, is measured by the likelihoodratio, f (x ; 1) f (x ; 2). Thus the likelihood function, L ( ) f (x ; ), is the math-ematical representation of the statistical evidence under this model � if twoinstances of statistical evidence generate the same likelihood function, theyrepresent evidence of the same strength with respect to all possible pairs( 1, 2), so they are equivalent as evidence about . This important implicationof the law of likelihood is known as the likelihood principle. It in turn hasimmediate implications for statistical methods in the problem area of infor-mation inference. As Birnbaum (1962) expressed it, �One basic consequence isthat reports of experimental results in scientific journals should in principle bedescriptions of likelihood functions.� Note that the likelihood function is, bydefinition, the function whose ratios measure the strength of the evidence in theobservations.

When the elements of a finite population are modelled as realisations ofrandom variables, and a sample is drawn from that population, the observa-tions presumably represent evidence about both the probability model and thepopulation. This chapter considers the definition, construction, and use oflikelihood functions for representing and interpreting the observations as evi-dence about (i) parameters in the probability model and (ii) characteristics ofthe actual population (such as the population mean or total) in problems wherethe sampling plan is uninformative (selection probabilities depend only onquantities whose values are known at the time of selection, see also the

INTRODUCTION 61

= =� �

yy y

y y y y

y y y

/=

Page 85: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

discussion by Little in the previous chapter). Although there is broad agree-ment about the definition of likelihood functions for (i), consensus has not beenreached on (ii). We will use simple transparent examples to examine andcompare likelihood functions for (i) and (ii), and to study the probability ofmisleading evidence in relation to (ii). We suggest that Bjørnstad�s (1996)assertion that �Survey sampling under a population model is a field where thelikelihood function has not been defined properly� is mistaken, and we note thathis proposed redefinition of the likelihood function is incompatible with the lawof likelihood.

5.2. THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATIONthe evidence in a sample from a finite population

We begin with the simplest case. Consider a condition (disease, genotype,behavioural trait, etc.) that is either present or absent in each member of apopulation of N 393 individuals. Let x t be the zero�one indicator of whetherthe condition is present in the tth individual. We choose a sample, s, consistingof n 30 individuals and observe their x -values: x t , t s.

5.2.1. Evidence about a probability

If we model the x as realised values of iid Bernoulli ( ) random variables,X 1, X 2, . . . , X 393, then our 30 observations constitute evidence about the prob-ability . If there are 10 instances of the condition (x 1) in the sample, thenthe evidence about is represented by the likelihood function L ( ) 10

(1 )20 shown by the solid curve in Figure 5.1, which we have standardisedso that the maximum value is one. The law of likelihood explains how tointerpret this function: the sample constitutes evidence about , and for anytwo values 1 and 2 the ratio L ( 2) measures the strength of the evidencesupporting 1 over 2. For example, our sample constitutes very strong evi-dence supporting over (the likelihood ratio is nearly 60 000),weak evidence for (likelihood ratio 3.2), and even weakerevidence supporting over (likelihood ratio 1.68).

5.2.2. Evidence about a population proportion

The parameter is the probability in a conceptual model for the process thatgenerated the 393 population values x1, x 2, . . . , x 393. It is not the same as theactual proportion with the condition in this population, which is 393

t 1 x t 393.The evidence about the proportion is represented by a second likelihood func-tion (derived below), shown in the solid dots in Figure 5.1. This function isdiscrete, because the only possible values for the proportion are k /393, fork 0, 1, 2, . . . , 393. Since 10 ones and 20 zeros have been observed, populationproportions corresponding to fewer than 10 ones (0/393, 1/393, . . . , 9/393) andfewer than 20 zeros (374/393, . . . , 392/393, 1) are incompatible with the sample,so the likelihood is zero at all of these points.

62 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION

� 2

y

yy y

yy yy y

�/ y

ÿ y

y1)=L(y

overy � 1=4y � 1=4

y � 3=4y � 1=2

y � 1=3 y � 1=4

y

P� =

Page 86: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

To facilitate interpretation of this evidence we have supplied some numericalsummaries in Figure 5.1, indicating, for example, where the likelihood ismaximised and the range of values where the standardised likelihood is atleast 1/8. This range, the �1/8 likelihood interval�, consists of the values of theproportion that are consistent with the sample in the sense that there is noalternative value that is better supported by a factor of 8 (�moderately strongevidence�) or greater.

To clarify further the distinction between the probability and the proportion,let us suppose that the population consists of only N 50 individuals, not 393.Now since the sample consists of 30 of the individuals, there are only 20 whosex -values remain unknown, so there are 21 values for the population proportionthat are compatible with the observed data, 10/50, 11/50, . . . , 30/50. The likeli-hoods for these 21 values are shown by the open dots in Figure 5.1. Thelikelihood function that represents the evidence about the probability does notdepend on the size of the population that is sampled, but Figure 5.1 makes itclear that the function representing the evidence about the actual proportion inthat population depends critically on the population size N .

5.2.3. The likelihood function for a population proportion or total

The discrete likelihoods in Figure 5.1 are obtained as follows. In a populationof size N , the likelihood ratio that measures the strength of the evidencesupporting one value of the population proportion versus another is the factor

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0Probability or proportion

Max at 0.3331/8 LI (0.186, 0.509)1/32 LI (0.15, 0.562)

Figure 5.1 Likelihood functions for probability of success (curved line) and for pro-portion of successes in finite populations under a Bernoulli probability model (popula-tion size N 393, black dots; N 50, white dots). The sample of n 30 contains 10successes.

THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATION 63

� � �

Page 87: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

by which their probability ratio is changed by the observations. Now before thesample is observed, the ratio of the probability that the proportion equals k/Nto the probability that it equals j/N is

where T is the population totalAfter observation of a sample of size n in which s have the condition, the

probability that the population proportion equals U N is just the probabilitythat the total for the N n non-sample individuals is U s, so the ratio of theprobability that the proportion equals k/N to the probability that it equals j/Nis changed to

Pr(Pr(

The likelihood ratio for proportions k/N and j/N is the factor by which theobservations change their probability ratio. It is obtained by dividing this lastexpression by the one before it, and equals

Therefore the likelihood function is (Royall, 1976)

L (

This is the function represented by the dots in Figure 5.1.An alternative derivation is based more directly on the law of likelihood: an

hypothesis asserting that the actual population proportion equals U /N is easilyshown to imply (under the Bernoulli trial model) that the probability ofobserving a particular sample vector consisting of s ones and n s zeros isproportional to the hypergeometric probability

Thus an alternative representation of the likelihood function is

L

It is easy to show that L so that L and L represents thesame likelihood function.

It is also easy to show (using Stirling�s approximation) that for any fixedsample the likelihood function for the population proportion, L ( ), con-verges to that for the probability, , as N . That is, for any sequences ofvalues of the population total, 1N and 2N , for which 1 and

2 the likelihood ratio is

64 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION

Pr(T � k)Pr(T � j)

� N

k

� �yk(1ÿ y)Nÿk N

j

� �y j(1ÿ y)Nÿj

PN

t�1 Xt.t

t =ÿ ÿt t

T � kjsample)T � jjsample)

� N ÿ n

kÿ ts

� �ykÿts (1ÿ y)Nÿnÿk�ts

Nÿn

j ÿ ts

� �y jÿts (1ÿ y)Nÿnÿj�ts :

N ÿ n

kÿ ts

� �N

j

� �N ÿ n

j ÿ ts

� �N

k

� �:

tU=N) / N ÿ n

tU ÿ ts

� �N

tU

� �:

t

t ÿ t

tU

ts

� �N ÿ tU

nÿ ts

� �N

n

� �:

�(tU=N) / tU

ts

� �N ÿ tU

nÿ ts

� �:

�(tU=N) / L(tU=N), �

tU=N

y !1t1N=N ! y

t2N=N ! yt t

Page 88: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

L (L (

5.2.4. The probability of misleading evidence

According to the law of likelihood, the likelihood ratio measures the strength ofthe evidence in our sample supporting one value of the population proportionversus another. Next we examine the probability of observing misleadingevidence. Suppose that the population total is actually 1, so the proportion isp1 N . For an alternative value, p2 N , what is the probability ofobserving a sample that constitutes at least moderately strong evidence sup-porting p2 over p1? That is, what is the probability of observing a sample forwhich the likelihood ratio L (p2) L (p1) 8? As we have just seen, the likelihoodratio is determined entirely by the sample total s, which has a hypergeometricprobability distribution

Pr(

Thus we can calculate the probability of observing a sample that represents atleast moderately strong evidence in favour of p2 over the true value p1. Thisprobability is shown, as a function of p2, by the heavy lines in Figure 5.2 for apopulation of N 393 in which the true proportion is p1 254,or 25.4 per hundred.

Note how the probability of misleading evidence varies as a function of p2.For alternatives very close to the true value (for example, p2 101 393), theprobability is zero. This is because no possible sample of 30 observations

0.05

0.04

0.03

0.02

0.01

0.00.0 0.2 0.4 0.6 0.8 1.0

Population proportion p

Figure 5.2 Probability of misleading evidence: probability that a sample of n 30 frompopulation of N 393 will produce a likelihood ratio of 8 or more supporting popula-tion proportion p over the true value, 100 393 0.254.

THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATION 65

t1N=N)t2N=N)

! yts

1 (1ÿ y1)nÿts

yts

2 (1ÿ y2)nÿts:

t� t1= � t2=

= �t

ts � jjT � t1) t1j

� �N ÿ t1nÿ j

� �N

n

� �:

� � 100=393 � 0:

� =

��

�=

Page 89: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

represents evidence supporting p2 101 393 over p1 100 393 by a factor aslarge as 8. This is true for all values of p2 from 101/393 through 106/393. For thenext larger value, p2 107 393, the likelihood ratio supporting p2 over p1 canexceed 8, but this happens only when all 30 sample units have the trait, s 30,and the probability of observing such a sample, when 100 out of the populationof 393 actually have the trait, is very small (4 10 20).

As the alternative, p2, continues to increase, the probability of misleadingevidence grows, reaching a maximum at values in an interval that includes170 393 0.42. For any alternative p2 within this interval a sample in which 13or more have the trait ( 13) gives a likelihood ratio supporting p2 over thetrue mean, 100/393, by a factor of 8 or more. The probability of observing sucha sample is 0.0204. Next, as the alternative moves even farther from the truevalue, the probability of misleading evidence decreases. For example, atp2 300 393 0.76 only samples with 17 give likelihood ratios as largeas 8 in favour of p2, and the probability of observing such a sample is only0.000 15.

Figure 5.2 shows that when the true population proportion is p1 100 393the probability of misleading evidence (likelihood ratio 8) does not exceed0.028 at any alternative, a limit much lower than the universal bound, which is1 8 0.125. For other values of the true proportion the probability of mislead-ing evidence shows the same behaviour � it equals zero near the true value, riseswith increasing distance from that value, reaches a maximum, then decreases.

5.2.5. Evidence about the average count in a finite population

We have used the population size N 393 in the above example for consistencywith what follows, where we will examine the evidence in a sample from anactual population consisting of 393 short-stay hospitals (Royall and Cumber-land, 1981) under a variety of models. Here the variate x is not a zero-oneindicator, but a count � the number of patients discharged from a hospital inone month. We have observed the number of patients discharged from each ofthe n 30 hospitals in a sample, and are interested in the total number ofpatients discharged from all 393 hospitals, or, equivalently, the average,

First we examine the evidence under a Poisson model: the countsx 1, x 2, . . . , x 393 are modelled as realised values of iid Poisson ( ) random vari-ables, X1, X 2, . . . , X 393. Under this model the likelihood function for the actualpopulation mean, U , is proportional to the probability of the sample, giventhat U , which for a sample whose mean is s is easily shown to be

L (

Just as with the proportion and the probability in the first problem that weconsidered, we must be careful to distinguish between the actual populationmean (average), U , and the mean (expected value) in the underlying probabil-ity model, E (X ) . And just as in the first problem (i) the likelihood for the

66 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION

� �

��

� �

= =

=

t

� ÿ

ts �

ts �=

=

=

=

�xU �P

xt=393.

l

�x�x�X � �x

�xU ) / N�xU

n�xs

� �n

N

� �n�xs

1ÿ n

N

� �N�xUÿn�xs

for �xU � (n�xs � j)=N, j � 0, 1, 2, . . . :

�x� l

Page 90: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

finite population mean, L ( ), is free of the model parameter, , and (ii) for anygiven sample, as N the likelihood function for the population mean,L ( ), converges to the likelihood function for the expected value , which isproportional to . That is, for any positive values 1 and 2,L (

In our sample of n 30 from the 393 hospitals the mean number of patientsdischarged per hospital is and the likelihood functionL (xU ) is shown by the dashed line in Figure 5.3.

The Poisson model for count data is attractive because it provides a simpleexplicit likelihood function for the population mean. But the fact that thismodel has only one parameter makes it too inflexible for most applications,and we will see that it is quite inappropriate for the present one. A more widelyuseful model is the negative binominal with parameters r and , in which thecount has the same distribution as the number of failures before the rth successin iid Bernoulli ( ) trials. Within this more general model the Poisson ( )distribution appears as the limiting case when the expected value r(1 isfixed at , while r .

Under the negative binominal model the likelihood for the population meanis free of , but it does involve the nuisance parameter r:

L (

1.0

0.8

0.6

0.4

0.2

0.0

200 400 600 800 1000 1200 1400 1600Number of patients discharged per hospital

Population meanNeg. binom.

Poisson

Max at 803.41/8 LI (601, 1110)

1/32 LI (550, 1250)

Expected valueNeg. binom.

Figure 5.3 Likelihood functions for expected number and population mean number ofpatients discharged in population N 393 hospitals under negative binomial model.Dashed line shows likelihood for population mean under Poisson model. Sample ofn 30 hospitals with mean 803.4 patients/hospital.

THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATION 67

�xU l!1

�xU lln�xseÿnl �x �x

�x1)=L(�x2)! (�x1=�x2)n�xseÿn(�x1ÿ�x2).

�xs � 24 103=30 � 803:4�

y

y

lÿ y)=y

l !1

�xU y

�xU , r) / P(samplej �XU �

�xU ; r; y) �Q

t2sxt � rÿ 1

rÿ 1

� �N�xU ÿ n�xs � (N ÿ n)rÿ 1

(N ÿ n)rÿ 1

� �

N�xU ÿNrÿ 1Nrÿ 1

� � :

Page 91: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The profile likelihood, Lp( ) maxr L ( , r), is easily calculated numerically,and it too is shown in Figure 5.3 (the solid line). For this sample themore general negative binomial model gives a much more conservativeevaluation of the evidence about the population mean, , than the Poissonmodel.

Both of the likelihood functions in Figure 5.3 are correct. Each one properlyrepresents the evidence about the population mean in this sample under thecorresponding model. Which is the more appropriate? Recall that the negativebinomial distribution approaches the Poisson as the parameter r approachesinfinity. The sample itself constitutes extremely strong evidence supportingsmall values of r (close to one) over the very large values which characterisean approximate Poisson distribution. This is shown by the profile likelihoodfunction for r, or equivalently, what is easier to draw and interpret, the profilelikelihood for which places the Poisson distribution (r ) at the origin.This latter function is shown in Figure 5.4. For comparison, Figure 5.4 alsoshows the profile likelihood function for 1 generated by a sample of 30independent observations from a Poisson probability distribution whose meanequals the hospital sample mean. The evidence for �extra-Poisson variability� inthe sample of hospital discharge counts is very strong indeed, with values of

near one supported over values near zero (the Poisson model) by enor-mous factors.

Under the previous models (Bernoulli and Poisson), for a fixed sample thelikelihood function for the finite population mean converges, as the populationsize grows, to the likelihood for the expected value. Whether the same resultsapplied to the profile likelihood function under the negative binomial model isan interesting outstanding question.

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.5 1.0 1.5cv(E(X ))

Poissonsample

Max at 0.9561/8 SI (0.817, 1.14)1/32 SI (0.782, 1.2)

Hospitalsample

Figure 5.4 Profile likelihood for negative binomial model. Two samples:(1) numbers of patients discharged from 30 hospitals, and (2) 30 iid Poisson variables.

68 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION

�xU / �xU

�xU

1=��rp

, � 1

=��rp

=��rp

1

1=��rp

:

Page 92: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

5.2.6. Evidence about a population mean under a regression model

For each of the 393 hospitals in this population we know the value of apotentially useful covariate � the number of beds, z. For our observed sampleof n 30 counts, let us examine the evidence about the population meannumber of patients discharged, , under a model that includes this additionalinformation about hospital size. The nature of the variables, as well as inspec-tion of a scatterplot of the sample, suggests that as a first approximation weconsider a proportional regression model (expected number of dischargesproportional to the number of beds, z), with variance increasing with z:E (X ) z and var(X ) 2z. Although a negative binomial model with thismean and variance structure would be more realistic for count data such asthese, we will content ourselves for now with the simple and convenient normalregression model.

Under this model the likelihood function for the population mean dependson the nuisance parameter , but not the slope :

L (

Maximising over the nuisance parameter 2 produces the profile likelihoodfunction

where

es

This profile likelihood function, with n replaced by (n 1) in the exponent (assuggested by Kalbfleisch and Sprott, 1970), is shown in Figure 5.5.

The sample whose evidence is represented in the likelihood functions inFigures 5.3, 5.4, and 5.5 is the �best-fit� sample described by Royall andCumberland (1981). It was obtained by ordering 393 hospitals according tothe size variable (z), then drawing a centred systematic sample from the orderedlist. For this demonstration population the counts (the x ) are actually knownfor all 393 hospitals, so we can find the true population mean and see how wellour procedure for making observations and interpreting them as evidenceunder various probability models has worked in this instance. The total numberof discharges in this population is actually 320 159, so the true mean is814.65. Figures 5.3 and 5.5 show excellent results, with the exception ofthe clearly inappropriate Poisson model in Figure 5.3, whose overstatementof the evidence resulted in its 1/8 likelihood interval excluding the true value.

THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATION 69

��xU

� b � s

s b

�xU , s2) / Pr(samplej �XU �

�xU ;b, s2) / sÿn exp ÿ 12s2

Xt2s

x2t

zt

� (N�xU ÿ n�xs)2

N�zU ÿ n�zs

ÿ (N�xU )2

N�zU

!" #:

s

Lp(�xU ) / 1� e2s

nÿ 1

� �

� �xU ÿ (�xs=�zs)�zU

u1=2s

with us ��zU (�zU ÿ (n=N)�zs)

n(nÿ 1)�zs

� �Xt2s

1zt

xt ÿ�xs

�zs

zt

� �2

:

ÿ

tU �

Page 93: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

1.0

0.8

0.6

0.4

0.2

0.0

200 400 600 800 1000 1200 1400 1600Number of patients discharged per hospital

Max at 806.41/8 LI (712, 901)1/32 LI (681, 932)

Figure 5.5 Likelihood for population mean number of patients discharged in popula-tion of N 393 hospitals: normal regression model. Sample of n 30 hospitals withmean 803.4 patients/hospital.

5.3. DEFINING THE LIKELIHOOD FUNCTION FOR A FINITEPOPULATION defining the likelihood function for a finite population

Bjørnstad (1996) proposed a general definition of the �likelihood function� that,in the case of a finite population total or mean, differs from the one derivedfrom the law of likelihood, which we have adopted in this chapter. In our firstexample (a population of N under the Bernoulli ( ) probability model) a sampleof n units in which the number with the trait is s is evidence about . Thatevidence is represented by the likelihood function

L (

The sample is also evidence about the population total U represented by thelikelihood function

L (

Bjørnstad�s function is the product of L ( U ) and a function that is independentof the sample, namely the model probability that T U :

LB(

Consider a population of N 100 and a sample that consists of half of them(n 50). If 20% of the sample have the trait (sample total s 10), then ourobservations represent very strong evidence supporting the hypothesis that 20%of the population have the trait (population total U 20) versus the hypoth-

70 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION

� �

yyt

y) � yts(1ÿ y)nÿts :

t

tU ) � tU

ts

� �N ÿ tU

nÿ ts

� �:

t� t

tU ; y) / L(tU )Pr(T � tU ; y) / tU

ts

� �N ÿ tU

nÿ ts

� �N

tU

� �ytU (1ÿ y)NÿtU :

�� �

t

t

Page 94: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

esis that 50% have it ( 50): L (20) L (50) 1.88 108. This is what the lawof likelihood says, because the hypothesis that 20 implies that the prob-ability of our sample is 1.88 108 times greater than the probability implied bythe hypothesis that 50, regardless of the value of . Compare this likeli-hood ratio with the values of the Bjørnstad function:

LB(20;

This is the ratio of the conditional probabilities, Pr(T 20 10;Pr(T ). It represents, not the evidence in the observations, buta synthesis of that evidence and the probabilities Pr(T ) andPr(T ) and it is strongly influenced by these �prior� probabilities. Forexample, when the parameter equals 1/2 the Bjørnstad ratio isLB(20; 0.5) LB(50; 0.5) 1. If this were a likelihood ratio, its value, 1, wouldsignify that when is 1/2 the sample represents evidence of no strength at all insupport of 20 over 50, which is quite wrong. The Bjørnstad func-tion�s ratio of 1 results from the fact that the very strong evidence in favour of

20 in the sample, L (20) 108, is the exact reciprocal of thevery large probability ratio in favour of 50 that is given by the modelindependently of the empirical evidence:

Pr(T

In our approach, the likelihood function is derived from the law of likelihood,and it provides a mathematical representation of the evidence in the observa-tions (the empirical evidence) vis-a-vis the population total, under the adoptedmodel. Our goal is analytic � to isolate and examine the empirical evidence perse. The Bjørnstad function represents something different, a synthesis of theempirical evidence with model probabilities. Whatever its virtues and potentialuses in estimation and prediction, the function LB( ; ) does not representwhat we seek to measure and communicate, namely, the evidence about inthe sample.

DEFINING THE LIKELIHOOD FUNCTION FOR A FINITE POPULATION 71

tU � = � �tU �

�tU � y

y)=LB(50; y) � [(1ÿ y=y]30:

� jts � y)=� 50jts � 10; y

� 20; y� 50; y

y= �

ytU � tU �

tU � =L(50) � 1:88�tU �

� 50jts � 10; y)=Pr (T � 20jts � 10; y) � 10050

� �10020

� �� 1:88� 108:

tU

tU

y

Page 95: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 96: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

PART B

Categorical Response Data

Page 97: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 98: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 6

Introduction to Part BC. J. Skinner

6.1. INTRODUCTION introduction

This part of the book deals with the analysis of survey data on categoricalresponses. Chapter 7 by Rao and Thomas provides a review of methods forhandling complex sampling schemes for a wide range of methods of categoricaldata analysis. Chapter 8 by Scott and Wild extends the discussion of logisticregression in Chapter 7 in the specific context of case�control studies, wheresampling is stratified by the categories of the binary response variable. In thischapter, we provide an introduction, as background to these two chapters. Ourmain emphasis will be on the basic methods used in Rao and Thomas�s chapter.

Two broad ways of analysing categorical response data may be distin-guished:

(i) analysis of tables;(ii) analysis of unit-level data.

In the first case, the analysis effectively involves two stages. F irst, a table isconstructed from the unit-level data. This will usually involve the cross-classification of two or more categorical variables. The elements of the tablewill typically consist of estimated proportions. These estimated proportions,together perhaps with associated estimates of the variances and covariances ofthese estimated proportions, may be considered as �sufficient statistics�, carry-ing all the relevant information in the data. This tabular information will thenbe analysed in some way, as the second stage of the analysis.

In the second case, there is no intermediate step of constructing a table. Thebasic unit-level data are analysed directly. A simple example arises in fitting alogistic regression model with a binary response variable and a single continuouscovariate. Because the covariate is continuous it is not possible to constructsufficient statistics for the unit-level data in the form of a standard table. Insteadthe data are analysed directly. This approach may be viewed as a generalisationof the first. We deal with these two cases separately in the following two sections.

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 99: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

6.2. ANALYSIS OF TABULAR DATA analysis of tabular data

The analysis of tabular data builds most straightforwardly on methods ofdescriptive surveys. Tables are most commonly formed by cross-classifyingcategorical variables with the elements of the table consisting of proportions.These may be the (estimated) proportions of the finite population falling intothe various cells of the table. Alternatively, the proportions may be domainproportions, either because the table refers only to units falling into a givendomain or, for example in the case of a two-way table, because the proportionsof interest are row proportions or column proportions. In all these cases, theestimation of either population proportions or domain proportions is a stand-ard topic in descriptive sample surveys (Cochran, 1977, Ch. 3).

6.2.1. One-way classification

Consider first a one-way classification, formed from a single categorical vari-able, Y , taking possible values i 1, . . . , I . For a finite population of size N , welet Ni denote the number of units in the population with Y i. Assuming thenthat the categories of Y are mutually exclusive and exhaustive, we haveNi N . In categorical data analysis it is more common to analyse the

population proportions corresponding to the population counts Ni thanthe population counts themselves. These (finite) population proportions aregiven by Ni N , i 1, . . . , I , and may be interpreted as the probabilities offalling into the different categories of Y for a randomly selected memberof the population (Bishop Fienberg and Holland, 1975, p. 9). Instead ofconsidering finite population proportions, it may be of more scientific interestto consider corresponding model probabilities in a superpopulation model.Suppose, for example, that the vector (N 1, . . . ,NI ) is assumed to be the realisa-tion of a multinomial random variable with parameters N and ( 1, . . . , I ), sothat the probability i corresponds to the finite population proportion Ni N .Suppose also, for example, that it is of interest to test whether two categoriesare equiprobable. Then it might be argued that it would be more appropriate totest the equality of two model probabilities than the equality of two finitepopulation proportions because the latter proportions would not be exactlyequal in a finite population except by rare chance (Cochran, 1977, p. 39). Raoand Thomas use the same notation i in Chapter 7 to denote either the finitepopulation proportion Ni N or the corresponding model probability. It isassumed that the analyst will have specified which one of these parametersis of interest, dependent upon the purpose of the analysis. Note that in eithercase 1.

One reason why it is convenient to use the same notation i for both the finitepopulation and model parameters is that it is natural to use the same pointestimator for each. A common general approach in descriptive surveys is toestimate the ith proportion (or probability) by

76 INTRODUCTION TO PART B

��

��

��

� ��

��� �

��� � ������* (6.1)

Page 100: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where

Here, it is assumed that Y is measured for a sample s, with yt and wt denotingthe value of Y and the value of a survey weight respectively for the tth elementof s, and with I(.) denoting the indicator function. The survey weight wt may beequal to the reciprocal of the sample inclusion probability of sample unit t ormay be a more complex weight (Chambers, 1996), reflecting for examplepoststratification to make use of auxiliary population information or to adjustfor nonresponse in some way. Note that the multiplication of wt by a constantleaves i unchanged. This can be useful in practice for secondary data analysts,since the weights provided to such survey data users are often arbitrarily scaled,for example to force the sum of the weights to equal the sample size. Suchscaling has no impact on the estimator i.

6.2.2. Multi-way classifications and log�linear models

The one-way classification above, based upon the single categorical variable Y ,may be extended to a multi-way classification, formed by cross-classifying twoor more categorical variables. Consider, for example, a 2 2 table, formed bycross-classifying two binary variables, A and B, each taking the values 1 or 2.The four cells of the table might then be labelled (a, b), a 1, 2; b 1, 2. Forthe purpose of defining log�linear models, as discussed by Rao and Thomas inChapter 7, these cells may alternatively be labelled just with a single indexi 1, . . . , 4. We shall follow the convention, adopted by Rao and Thomas, ofletting the index i correspond to the lexicographic ordering of the cells, asillustrated in Table 6.1.

The same approach may be adopted for tables of three or more dimensions.In general, we suppose the cells are labelled i 1, . . . , I in lexicographic order,where I denotes the number of cells in the table. The cell proportions (orprobabilities) i are then gathered together into the I 1 vector

. A model for the table may then be specified by representingas a function of an r 1 vector of parameters, that is writing

Because the vector is subject to the constraint where 1 is theI 1 vector of ones, the maximum necessary value of r is I 1. For, in this case

Table 6.1 Alternative labels for cells of 2 2 table.

Bivariate labels ordered lexicographically(a, b)

Single indexi

(1,1) 1(1,2) 2(2,1) 3(2,2) 4

ANALYSIS OF TABULAR DATA 77

��� ���

�� �/ �� � �0* �� ���

�� ���

���� (6.2)

��

��

� �

�� � /�9* � � � * �� 0

� � � �/�0���

��� � ��9 � 9*

��

Page 101: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

we could define a saturated model, with as the first I 1 elements of and theIth element of expressed in terms of using

A log�linear model for the table is one which may be expressed as

log (

as in Equation (7.1), where log ( ) denotes the I 1 vector of log ( i), X is aknown I r matrix and u( ) is a normalising factor that ensures Notethe key feature of this model that, apart from the normalising factor, thelogarithms of the i may be expressed as linear combinations of the elementsof the parameter vector .

Consider, for example, the model of independence in the 2 2 table dis-cussed above. Denoting the probability of falling into cell (a, b) as AB

(a, b), theindependence of A and B implies that AB

(a, b) may be expressed as a productA(a)

B(b) of marginal probabilities. Taking logarithms, we have the additive

property

log ( AB(a, b)) log ( A

(a)) log ( B(b))

Since 1, the cell probabilities AB(a, b) can be expressed

in terms of two parameters, say A(1) and B

(1). To express the model of independ-ence as a log�linear model, we first reparameterise (e.g. Agresti, 1990, Ch. 5) bytransforming from A

(1) and B(1) to the parameter vector where

Then, setting

u(

(which is a function of through the reparameterisation) and settingwith the correspondence between i and (a, b) in Table 6.1, we may express thelogarithms of the cell probabilities in a matrix equation as

This equation is then a special case of Equation (6.3) above. For further discus-sion of log�linear models and their interpretation see Agresti (1990, Ch. 5).

Having defined the general log�linear model above, we now consider infer-ence about the parameter vector . We discuss point estimation first. Instandard treatments of log�linear models, a common approach is to use themaximum likelihood estimator (MLE) under the assumption that the samplecell counts are generated by multinomial sampling. The MLE of is thenobtained by solving the following likelihood equations for :

X

78 INTRODUCTION TO PART B

��

� �� ��

�9 � 9�

�0 � �/�0�9�C�* (6.3)

��

� �

���9 � 9�

��

� ��

� � �� � �

��/90 � ��/50 � ��/90 � ��/50 � �� �

� �

�� � �� /��/900� D �� /��/900� �� /��/500E�5*

�� � �� /��/900� D �� /��/900� �� /��/500E�5�

� � /��* ��0*

�0 � D �� /��/900� �� /��/500E�5� D �� /��/900� �� /��/500E�5

� �� � ���/�* �0

�� �9�� �5�� �7�� �A

����

���� � �/�0

9

9

9

9

����

�����

9 9

9 �9�9 9

�9 �9

����

���� ��

��

��

��/�0 � � ���*

Page 102: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where is the (unweighted) vector of sample cell proportions (Agresti, 1990,section 6.4). In the case of complex sampling schemes, the multinomial assump-tion will usually be unreasonable and the MLE may be inconsistent for .Instead, Rao and Thomas (Chapter 7) discuss how may be estimated consist-ently by the pseudo-MLE of , obtained by solving the same equations, but with

replaced by ( 1, . . . , I ), where i is a design-consistent estimator of i such asin (6.1) above.

Let us next consider the estimation of standard errors or, more generally, thevariance�covariance matrix of the pseudo-MLE of . Under multinomial sam-pling, the usual estimated covariance matrix of the MLE is given by (Agresti,1990, section 6.4.1; Rao and Thomas, 1988)

n

where n is the sample size upon which the table is based and ( ) is thecovariance matrix of under multinomial sampling (with the log�linearmodel holding):

where D( ) denotes the I I diagonal matrix with diagonal elements i. Re-placing the MLE of by its pseudo-MLE in this expression will generally fail toaccount adequately for the effect of a complex sampling design on standarderrors. Instead it is necessary to use an estimator of the variance�covariancematrix of which makes appropriate allowance for the actual complex design.Such an estimator is denoted by Rao and Thomas (Chapter 7). There are avariety of standard methods from descriptive surveys, which may be used toobtain such an estimator � see for example Lehtonen and Pahkinen (1996, Ch. 6)or Wolter (1985).

Given , a suitable estimator for the covariance matrix of is (Rao andThomas, 1989)

where is now the pseudo-MLE of . Rao and Thomas (1988) provide anumerical illustration of how the standard errors obtained from the diagonalelements of this matrix may differ from the usual standard errors based uponmultinomial sampling assumptions.

Rao and Thomas (Chapter 7) focus on methods for testing hypothesesabout . Many hypotheses of interest in contingency tables may be expressedas a hypothesis about in a log�linear model. Rao and Thomas consider theimpact of complex sampling schemes on the distribution of standard Pearsonand likelihood ratio test statistics and discuss the construction of Rao�Scottadjustments to these standard procedures. As in the construction of ap-propriate standard errors, these adjustments involve the use of the matrix.Rao and Thomas also consider alternative test procedures, such as the Waldtest, which also make use of , and make comparisons between the differentmethods.

ANALYSIS OF TABULAR DATA 79

��

�� �� �� ��

��

��

��

�5D� ��/��0� E�9

�/�0 � ��9D�/�/�00� �/�0�/�0�E

� ��

��5D� ��/��0� E�9/� ��� 0D� ��/��0� E�9

��*��

�� ��

��

��

��

��

Page 103: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

6.2.3. Logistic models for domain proportions

A log�linear model for a cross-classification treats the cross-classifying vari-ables symmetrically. In many applications, however, one of these variables willbe considered as the response and others as the explanatory variables (alsocalled factors). The simplest case involves the cross-classification of a binaryresponse variable Y by a single categorical variable X . Let the categories of Y belabelled as 0 and 1, the categories (also called levels) of X as 1, . . . , I and let Njibe the number of population units with X i and Y j. In order to study howY depends upon X , it is natural to compare the proportions N1i N i, whereNi N0i N 1i, for i 1, . . . , I . For example, suppose Y denotes smokingstatus (1 if smoker, 0 if non-smoker) and X denotes age group. Then N 1i N i isthe proportion of people in age group i who smoke. If these proportions areequal then there is no dependence of smoking status on age group. On the otherhand, if the proportions vary then we may wish to model how the N1i N idepend upon i.

In contingency table terminology, if Y and X define the rows and columns ofa two-way table then the N 1i N i are column proportions, as compared with theNji N (N Ni), used to define a log�linear model, which are cell propor-tions. In survey sampling terminology, the subsets of the population defined bythe categories of X are referred to as domains and the N1i N i are referred to asdomain proportions. See Section 7.4, where the notation 1i N 1i N i is used.

In a more general setting, there may be several categorical explanatoryvariables (factors). For example, we may be interested in how smoking statusdepends upon age group, gender and social class. In this case it is still possibleto represent the cells in the cross-classification of these factors by a single indexi 1, . . . , I , by using a lexicographic ordering, as in Table 6.1. For example, ifage group, gender and social class have five, two and five categories (levels)respectively, then I 5 2 5 50 domains i are required. In this case, eachdomain i refers to a cell in the cross-classification of the explanatory factors, i.e.a specific combination of the levels of the factors.

The dependence of a binary response variable on one or more explanatoryvariables may then be represented by modelling the dependence of the 1i onthe factors defining the domains. Just as i could refer to a finite populationproportion or model probability, depending upon the scientific context, 1i mayrefer to either a domain proportion N 1i N i or a corresponding model probabil-ity. For example, it might be assumed that N 1i is generated by a binomialsuperpopulation model (conditional on Ni): N1i N i Bin(Ni, 1i). A commonmodel for the dependence of 1i on the explanatory factors is a logistic regres-sion model, which is specified by Rao and Thomas in Section 7.4 as

log [

where x i is an m 1 vector of known constants, which may depend upon thefactor levels defining domain i, and is an m 1 vector of regression param-eters. In general m I and the model is saturated if m I . For example,suppose that there are I 4 domains defined by the cross-classification of

80 INTRODUCTION TO PART B

� ��

� ��

� �

����

� �� �

��

��

��

� �

�9��/9� �9�0E � (���* (6.4)

� �

Page 104: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

two binary factors. A saturated model in which 1i may vary freely according tothe levels of these two factors may be represented by this model with m 4parameters. On the other hand a model involving main effects of the twofactors but no interaction (on the logistic scale) may be represented with a

vector of dimension m 3. See Agresti (1990, section 4.3.2) for furtherdiscussion of such models.

Rao and Thomas discuss how may be estimated by pseudo-MLE. Thisrequires the use of consistent estimates of both the 1i and the proportionsNi N . These may be obtained from standard descriptive survey estimates of theN1i and the Ni, as in (6.2). Rao and Thomas discuss appropriate methods fortesting hypotheses about , which require an estimated covariance matrix forthe vector of estimated 1i. These approaches and methods for estimating thestandard errors of the pseudo-MLEs are discussed further in Roberts, Rao andKumar (1987) and are extended, for example to polytomous Y , in Rao, Kumarand Roberts (1989).

6.3. ANALYSIS OF UNIT-LEVEL DATA analysis of unit-level data

The approach to logistic regression described in the previous section can becomeimpractical in many survey settings. F irst, survey data often include a largenumber of variables which an analyst may wish to consider using as explanatoryvariables and this may result in small or zero sample sizes in many of the domainsformed by cross-classifying these explanatory variables. This may lead in par-ticular to unstable 1i and problems in estimating the covariance matrix of the

1i. Second, some potential explanatory variables may be continuous and it maybe undesirable to convert these arbitrarily to categorical variables. It maytherefore be preferable to undertake regression modelling at the unit level.

6.3.1. Logistic regression

The basic unit-level data consist of the values (xt , yt) for units t in the sample s,where xt is the vector of values of the explanatory variables and yt is the value(0 or 1) of the response variable. A conventional logistic regression modelassumes that yt is generated by a model, for fixed x t , in such a way that theprobability that yt 1, denoted E ( yt), obeys

log [

This is essentially the same model as in (6.4), except that Pr( yt 1) can nowvary from unit to unit, whereas before the same probability applied to all unitswithin a given domain.

Under the conventional assumption that the sample yt values independentlyfollow this model, the MLE of is obtained by solving

ANALYSIS OF UNIT-LEVEL DATA 81

��

��

��

��

� ��/�0 � *�/ �� � 90 ���/�0�/9� ��/�0E � (����

����

��/�0 � 6 (6.5)

Page 105: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where ut( )]xt is the component of the score function for unit t.For a complex sampling scheme, this assumption is invalid and the MLE maybe inconsistent. Instead, Rao and Thomas consider the pseudo-MLE obtainedfrom solving the weighted version of these score equations:

where wts is the survey weight for unit t. For standard Horvitz�Thompsonweights, i.e. proportionate to the reciprocals of the inclusion probabilities, thepseudo-MLE will be design consistent for the census parameter U (as discussedin Section 1.2), the value of solving the corresponding finite population scoreequations:

If the model is true at the population level then this census parameter willbe model consistent for the model parameter and the pseudo-MLE may beviewed as a consistent estimator (with respect to both the design and the model)of . If the model is not true, only holding approximately, then the pseudo-MLE is still consistent for the census parameter. The extent to which this is asensible parameter to make inference about is discussed further below inSection 6.3.2 and by Scott and Wild in Chapter 8.

Rao and Thomas discuss the estimation of the covariance matrix of thepseudo-MLE of as well as procedures for testing hypotheses about .One advantage of the approach to fitting logistic regression models discussedin the previous section is that goodness-of-fit tests are directly generatedby comparing the survey estimates 1i of the domain proportions with theproportions predicted by the model. Such goodness-of-fit tests are notdirectly generated when the model is fitted to unit-level data and Rao andThomas only discuss testing hypotheses about , treating the logistic modelas given. See Korn and Graubard (1999, section 3.6) for further discussion ofthe assessment of the goodness of fit of logistic regression models with surveydata.

Properties of pseudo-MLE are explored further by Scott and Wild(Chapter 8). They consider the specific kinds of sampling design that are usedin case�control studies. A basic feature of these studies is the stratification ofthe sample by Y . In a case�control study, the binary response variable Yclassifies individuals into two groups, the cases which possess some conditionand for whom Y 1, and the controls which do not possess the condition andfor whom Y 0. Scott and Wild give an example, where the cases are childrenwith meningitis and the controls are children without meningitis. The aim is tostudy how Y depends on a vector of explanatory variables via a logisticregression model, as above. Sampling typically involves independent samplesfrom the cases and from the controls, often with very different samplingfractions if the cases are rare. The sampling fraction for rare cases may be

82 INTRODUCTION TO PART B

�0 � D �� � ��/�

����

�����/�0 � 6

��

��9

��/�0 � 6� (6.5)

� �

���9�/��0

��

Page 106: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

high, even 100%, whereas the sampling fraction for controls may be muchlower. In Scott and Wild�s example the ratio is 400 to 1.

Pseudo-MLE, weighting inversely by the probabilities of selection, providesone approach to consistent estimation of in case�control studies, providedthese probabilities are available, at least up to a scale factor. Alternatively, asScott and Wild discuss, a special property of case�control sampling and thelogistic regression model is that the MLE, obtained from solving (6.5) aboveand ignoring the unequal probabilities of selection, provides consistent esti-mates of the elements of other than the intercept (if the model holds).

Scott and Wild refer to the pseudo-MLE as the design-weighted estimatorand the MLE which ignores the sample selection as the sample-weighted esti-mator. They provide a detailed discussion of the relative merits of these twoestimators for case�control studies. An introduction to this discussion is pro-vided in the following section.

6.3.2. Some issues in weighting

The question of whether or not to use survey weights when fitting regressionmodels, such as the logistic regression model above, has been widely discussedin the literature � see e.g. Pfeffermann (1993, 1996) and Korn and Graubard(1999, Ch. 4). Two key issues usually feature in these discussions. F irst, there isthe issue of bias and efficiency � in certain senses the use of survey weights mayreduce bias but also reduce efficiency. Second, there is the issue of the impact ofmodel misspecification.

If the specified model is assumed correct and if sampling depends only uponthe explanatory variables (i.e. the survey weights are unrelated to the responsevariable conditional upon the values of the explanatory variables), then it maybe argued that the sampling will not induce bias in the unweighted estimators ofthe regression coefficients. The survey-weighted estimator will, however, typic-ally have a greater variance than the unweighted estimator and thus the latterestimator is preferred in this case (Pfeffermann, 1996). The loss of efficiency ofthe survey-weighted estimator will usually be of more concern, the greater thevariation in the survey weights.

On the other hand, if sampling depends upon factors which may be related tothe response variable, even after conditioning on the explanatory variables,then sampling may induce bias in the unweighted estimator. The use of surveyweights provides a simple method to protect against this potential bias. Theremay remain an issue of trade-off in this case between bias and efficiency. Thecase�control setting with logistic regression is unusual, since the unweightedestimator does not suffer from bias, other than in the intercept, despite sam-pling being directly related to the response variable.

It is desirable to use diagnostic methods when specifying regression models,so that the selected model fits the data well. Nevertheless, however muchcare we devote to specifying a model well, it will usually be unreasonable toassume that the model is entirely correct. It is therefore desirable to protectagainst the possibility of some misspecification. A potential advantage of the

ANALYSIS OF UNIT-LEVEL DATA 83

Page 107: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

survey-weighted estimator is that, even under misspecification, it is consistentfor a clearly defined census parameter U , which solves (6.6) in the caseof logistic regression. This census parameter may be interpreted as defining alogistic regression model which best approximates the actual model in a certainsense in the population. Scott and Wild refer to this as the �whole population�approximation. In the same way, the unweighted estimator will be designconsistent for another value of , the one which solves

(6.7)

where t is the inclusion probability of unit t. This parameter defines a logisticregression model which approximates the actual model in a different sense. As aresult, the issue of whether survey weighting reduces bias is replaced by thequestion of which approximation is more scientifically sensible. A conceptualattraction of the whole population approximation is that it does not dependupon the sampling scheme. As indicated by (6.7), the second approximationdepends upon the t , which seems unattractive since it seems conceptuallyunnatural for the parameter of interest to depend upon the sampling scheme.There may, however, be other countervailing arguments. Consider, forexample, a longitudinal survey to study the impact of birthweight on educa-tional attainment at age 7. The survey designer might deliberately oversamplebabies of low birthweight and undersample babies of normal birthweight inorder to achieve a good spread of birthweights for fitting the regression rela-tionship. It may therefore be scientifically preferable to fit a model which bestapproximates the true relationship across the sample distribution of birth-weights rather than across the population distribution. In the latter case theapproximation will be dominated by the fit of the model for babies of normalbirthweight (cf. Korn and Graubard, 1999, section 4.3). Further issues, relatingspecifically to binary response models, are raised by Scott and Wild. In theirapplication they show that the whole population approximation only providesa good approximation to the actual regression relationship for values in the tailof the covariate distribution, whereas the second approximation provides agood approximation for values of the covariate closer to the centre of thepopulation distribution. They conclude that the second approximation �repre-sents what is happening in the population better than� the whole populationapproximation and that the unweighted MLE is more robust to model mis-specification.

Issues of weighting and alternative approaches to correcting for nonignor-able sampling schemes are discussed further in Sections 9.3 and 11.6 and inChapter 12.

84 INTRODUCTION TO PART B

��9

����/�0 � 6*

Page 108: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 7

Analysis of CategoricalResponse Data fromComplex Surveys: anAppraisal and UpdateJ. N. K. Rao and D. R. Thomas

7.1. INTRODUCTION introduction

Statistics texts and software packages used by researchers in the social and healthsciences and in other applied fields describe and implement classical methods ofanalysis based on the assumption of independent identically distributed data. Inmost cases, there is little discussion of the analysis difficulties posed by datacollected from complex sample surveys involving clustering, stratification andunequal probability selection. However, it has now been well documented thatapplying classical statistical methods to such data without making allowancefor the survey design features can lead to erroneous inferences. In particular,ignoring the survey design can lead to serious underestimation of standard errorsof parameter estimates and associated confidence interval coverage rates, as wellas inflated test levels and misleading model diagnostics. Rapid progress has beenmade over the last 20 years or so in the development of methods that account forthe survey design and as a result provide valid inferences, at least for largesamples.

This chapter will focus on the complex survey analysis of cross-classifiedcount data using log�linear models, and the analysis of binary or polytomousresponse data using logistic regression models. In Chapter 4 of the book editedby Skinner, Holt and Smith (1989), abbreviated SHS, we provided an overviewof methods for analyzing cross-classified count data. The methods reviewedincluded Wald tests, Rao�Scott first-and second-order corrections to classicalchi-squared and likelihood ratio tests, as well as residual analysis to detect the

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 109: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

nature of deviations from the null hypothesis. Other accounts of these develop-ments in survey data analysis have been published. For example, the book byLehtonen and Pahkinen (1995) builds on an earlier review by the authors (Raoand Thomas, 1988), and the recent book by Lohr (1999) includes a descriptionof methods for analyzing regression data and count data from complexsamples. The objective of this chapter is to update our 1989 review and topresent an appraisal of the current status of methods for analyzing survey datawith categorical variables.

To make this chapter self-contained, we begin with an overview of Rao�Scottand Wald tests for log�linear models, together with a description of someapproaches not covered in Chapter 4 of SHS. These will include the jackknifeprocedures developed by Fay (1985) and the Bonferroni procedures applied tocomplex survey data by Thomas (1989), Korn and Graubard (1990) andThomas, Singh and Roberts (1996). These developments are based on large-sample theory, and it cannot be assumed that asymptotic properties will alwayshold for the �effective� sample sizes encountered in practice. Therefore, recentsimulation studies will be reviewed in Section 7.3 in order to determine whichclasses of test procedures should be recommended in practice. In Sections 7.4and 7.5 we present a discussion of logistic regression methods for binary orpolytomous response variables, including recent general approaches based onquasi- score tests (Rao, Scott and Skinner, 1998). To conclude, we discuss someapplications of complex survey methodology to other types of problems, in-cluding data subject to misclassification, toxicology data on litters of experi-mental animals, and marketing research data in which individuals can respondto more than one item on a list. We describe how methods designed for complexsurveys can be extended and/or applied to provide convenient test procedures.

7.2. FITTING AND TESTING LOG�LINEAR MODELS fitting and testing log�linear models

7.2.1. Distribution of the Pearson and likelihood ratio statistics

Consider a multi-way cross-classification of categorical variables for a sample(or domain) of size n, in which the cells in the table are numbered lexicographic-ally as i 1, . . . , I , where I is the total number of cells. Let denote an I 1vector of finite population cell proportions, or cell probabilities under a super-population model, and let denote a corresponding design-consistent estima-tor of . For any survey design, a consistent estimate of the elements of isgiven by , where the represent weighted-up survey counts, i.e.,consistent estimates of the population cell counts Ni, and ; publishedtables often report weighted-up counts. A log�linear model on the cell propor-tions (or probabilities) can be expressed as

where log ( ) is the I-vector of log cell proportions. Here X is a known I rmatrix of full rank r I 1, such that X 1 0, where 0 and 1 are I-vectors of

86 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

� � �

��� �

��� � �������� � �

��

� � � 0�4

���

7.1)��� 0�4 � �0�41� ��) (

�� � �

� � � �

��

Page 110: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

zeros and ones, respectively, is an r-vector of parameters andnormalizing factor that ensures that 1 1.

For the case of simple random sampling with replacement (frequentlyreferred to in this context as multinomial sampling), likelihood equations canbe solved to yield the maximum likelihood estimates (MLEs) of . For generaldesigns, however, this approach cannot be implemented due to the difficulty ofdefining appropriate likelihoods. Rao and Scott (1984) obtained instead a�pseudo-MLE� of by solving the estimating equations

7.2)

namely the multinomial likelihood equations in which unweighted cell propor-tions have been replaced by . Consistency of under the complex surveydesign ensures the consistency of as an estimator of , and consequentlyensures the consistency of ( ) as an estimator of under the log�linearmodel (7.1).

A general Pearson statistic for testing the goodness of fit of model (7.1) canbe defined as

I

(7.3)

and a statistic corresponding to the likelihood ratio is defined as

(7.4)

Rao and Scott (1984) showed that for general designs, X 2P and X 2

LR are asymp-totically distributed as a weighted sum, 1W 1 . . . I r 1WI r 1, ofI r 1 independent 2

1 random variables,W i. The weights, i, are eigenvaluesof a general design effects matrix, given by

(7.5)

where

and Z is any matrix of dimension I (I r 1) such that (1, X , Z ) is non-singular and that satisfies Z 1 0. The matrix is defined by

n 1[D( ) ], where D( ) diag( ), the I I diagonal matrix withdiagonal elements 1, . . . , I . Thus has the form of a multinomial covariancematrix, corresponding to the covariance matrix of under a simple randomsampling design. In this case, however, is a vector of consistent estimates ofcell proportions under a complex survey design, with asymptotic covariancematrix under the complex survey design denoted by Var( ). When theactual design is simple random, , in which case reduces to the identitymatrix so that all i 1, and both X 2

P and X 2LR recover their classical 2

I r 1asymptotic distributions. As illustrated in Chapter 4 of SHS using data from

FITTING AND TESTING LOG�LINEAR MODELS 87

u( ) is a� ����

� ��0��4 � � ���) (

� �� �

� � �

� �

��

� +� � �

����1

0��� � ��0��44+���0��4)

� +�� � +�

����1

��� ��� I������0��4J�

� � � � � � � �� � �

� � 0 ���� ��4�10 ��

�� ��4)

� � I�� � � 0� ��� 4�1� ��J�) (

� � ��� �

��

� � � � � ��� � � � �� �

��

� � �

��

�� � �

� � � � �

7 6).

Page 111: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the Canada Health Survey, the generalized design effects, i, obtained from(7.5) can differ markedly from one when the survey design is complex.

The result (7.5) can also be extended to tests on nested models. Rao and Scott(1984) considered the case when X of Equation (7.3) is partitioned asX 1 1 X 2 2, with X 1 and X 2 of full column rank r1 and r2, respectively, withr1 r2 r I . The Pearson statistic for testing the nested hypothesisH0(2 1): 2 0 is given by

(7.7)

where i( 1) is the pseudo-MLEs under the reduced model, X 1 1, obtainedfrom the pseudo-likelihood equations X 1 ( 1) X 1 . A general form of thelikelihood ratio statistic can also be defined, namely

(7.8)

The asymptotic distributions of X 2P(2 1) and X 2

LR (2 1) are weighted sums of r2independent 2

1 variables. The design effects matrix defining the weights has theform of Equation (7.5), with replaced by 2, where 2 is defined as inEquation (7.6), with X replaced by X1 and Z replaced by X2.

7.2.2. Rao�Scott procedures

Under multinomial sampling, tests of model fit and tests of nested hypothesesare obtained by referring the relevant Pearson or likelihood ratio statistic to

2d

freedom, where d I r 1 for the test of model fit and d r2 for the test ofa nested hypothesis. Under complex sampling, alternatives to the classicaltesting strategy are required to deal with the design-induced distortion of theasymptotic distribution of the Pearson and likelihood ratio statistics. Theadjusted chi-squared procedures of Rao and Scott (1981, 1984) are describedin this section, while Sections 7.2.3, 7.2.4 and 7.2.5 consider Wald tests andtheir variants, tests based on the Bonferroni inequality, and Fay�s jackknifedtests, respectively.

First-order Rao�Scott proceduresThe first-order Rao�Scott strategy is based on the observation that X 2

P andX 2LR (where is the mean of the generalized design effects, i) have the same

first moment as 2d , the asymptotic distribution of the Pearson statistic under

multinomial sampling. Thus a first-order Rao�Scott test of model fit refers

(7.9)

2d , where d I r 1, is a consistent estimator

of given by

88 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

�� � �� � �

� � ��

�+�0+�14 � �

����1

0��0��4� ��0��144+���0��14)

� �� �� � �� � � �

�+��0+�14 � +�

����1

��0��4 ��� I��0��4���0��14J�

� ��

( ), the upper 100 % point of a chi-squared distribution on d degrees of

to the upper 100 % point of

� � �� � � �

����� �� �

�+�0���4 � � +

����� �� �+

��0���4 � � +

������

� � � � ���

��

�� �� ��

Page 112: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

(7.10)

Here is obtained from Equation (7.5), with replaced by its estimate,, given by n 1[diag ( ( )) ( ) ( )], and with replaced by , the

consistent estimator of the covariance matrix of under the actual design (orsuperpopulation model). The tests based on X 2

P( ) and X 2LR ( ) are asymptot-

ically correct when the individual design effects, i, are equal, and simulationevidence to be discussed later confirms that they work well in practice when thevariation among the i is small. The second-order Rao�Scott procedure out-lined below is designed for cases when the variation among the design effects isexpected to be appreciable.

One of the main advantages of the first-order procedures is that they do notalways require the full estimate, , of the covariance matrix. The trace functionof Equation (7.10) can often be expressed in closed form as a function of theestimated design effects of specific cell and marginal probabilities, whichmay be available from published data. Closed form expressions are given insection 4.3 of SHS for a simple goodness-of-fit test on a one-way array, the testof homogeneity of vectors of proportions, the test of independence in a two-way table, as well as for specific tests on log�linear models for multi-way tablesfor which closed form MLEs exist. In the latter case, depends only on theestimated cell and marginal design effects. Moreover, if the cell and marginaldesign effects are all equal to D, then reduces to D provided the model iscorrect (Nguyen and Alexander, 1989). For testing association in an A Bcontingency table, Shin (1994) studied the case where only the estimatedmarginal row and column design effects, Da(1) and Db(2), are available, andproposed an approximation to , denoted min , that depends only on Da(1)and Db(2), a 1, . . . , A ; b 1, . . . , B. For , we have (under independence)the closed form expression

(7.11)

while Shin�s (1994) approximation takes the form

(7.12)

where ab, a and b are the estimated cell, row and column proportions, andab is the estimated cell design effect of ab. This approximation turned out to

be quite close to the true and performed better than the approximationmin proposed earlier by Holt, Scott and Ewings(1980) when applied to some survey data obtained from a longitudinal study ofdrinking behaviour among US women living in non-institutional settings.

FITTING AND TESTING LOG�LINEAR MODELS 89

��� � ���� 0��4���

���� �� � � � �� � � �� �� �� � ��

�� ���

��

��

���

���

��� ���� �

0�� 140�� 14��� ���

01� ���� ��� 4��� ���

01� ����4���014

��

01� ��� 4�� 0+4)

0�� 140�� 14��� ���

01� ���� ��� 4�� K���014) �� 0+4L

���

01� ����4���014��

01� ��� 4�� 0+4)

���

�� �� � ��� ��

K�

����014��)

� �� 0+4��

L)

���

Page 113: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Second-order Rao�Scott proceduresWhen an estimate of the full covariance matrix V ( ) is available, a moreaccurate correction to X 2

P or X 2LR can be obtained by matching the first two

moments of the weighted sum of 21 random variables to a single chi-squared

distribution. Unlike the first-order correction, the second-order correctionaccounts for variation among the generalized design effects, i. For a test ofmodel fit, it is implemented by referring

(7.13)

to the upper 100 % point of a 2d random variable with

fractional degrees of freedom, where is a measure of the variation amongthe design effects, given by the expressions

(7.14)

From Equation (7.14), it can be seen that has the form of a coefficient ofvariation. The simulation results of Thomas and Rao (1987) for simple good-ness of fit showed that second-order corrected tests do a better job of control-ling Type I error when the variation among design effects is appreciable. Thesimulation results for tests of independence in two-way tables discussed inSection 7.4 confirm this conclusion.

-based variants of the Rao�Scott proceduresFor a simple goodness-of-fit test, a heuristic large-sample argument was givenby Thomas and Rao (1987) to justify referring F-based versions of the first-andsecond-order Rao�Scott tests to F-distributions. An extension of this argumentto the log�linear case yields the following variants.

F-based first-order tests Refer FX 2P( ) to an F-distribution on d and d

degrees of freedom, where

7.15)

and is the degrees of freedom for estimating the covariance matrix . Thestatistic FX 2

LR ( ) is defined similarly.

F-based second-order tests Denoted these second-order procedures are similar to the first-order F-based tests in that they use thesame test statistics, namely but refer them instead to anF-distribution on d and d degrees of freedom, again with

Rao�Scott procedures for nested hypothesesFor testing nested hypotheses on log�linear models, first-order and second-order Rao�Scott procedures, similar to (7.9) and (7.13), are reported in

90 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

�� � � ��

� +�0���) ��4 � � +

�0���4

1� ��+�� �+

��0���) ��4 � �+

��0���4

1� ��+

� � � � � � 01� ��+4

��+ � ����� 0��+4�I��0��4J+ � 1 �

����1

0��� � ���4+�����+�

��

��

and

���

"� +�0���4 � �+

�0���4��) (

����

"� +�0���) ��4 "� +

��0���) ��4)

�+�0���4�� �� �+

��0���4��)

� �� �� � ��01� ��+4!

Page 114: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

section 4.4 of SHS. By analogy with the above strategy, F-based versions ofthese tests can also be defined.

7.2.3. Wald tests of model fit and their variants

Log�linear formWald tests, like the second-order Rao�Scott procedures, cannot be imple-mented without an estimate, , of the full covariance matrix of under theactual survey design. Wald tests can be constructed for all of the cases con-sidered in Chapter 4 of SHS, and in theory Wald tests provide a generalsolution to the problem of the analysis of survey data. In practice, however,they have some serious limitations as will be discussed. As an example, weconsider a Wald test of goodness of fit of the log�linear model (7.1). Thesaturated model corresponding to model (7.1) can be written as

7.16)

where Z is an I (I r 1) matrix as defined earlier (with the added con-Z is a vector of (I r 1) parameters. The good-

ness-of-fit hypothesis can be expressed as H0: Z 0, which is equivalent toThe corresponding log�linear Wald statistic can there-

fore be written as

7.17)

where and the estimated covariance matrix of is given by

7.18)

where D( ) diag( ). An asymptotically correct Wald test is obtained byreferring X 2

W to the upper -point of a random variable.

Residual formFor a test of independence on an A B table, the null hypothesis of independ-ence can be expressed in residual form as

(7.19)

Rao and Thomas (1988) described a corresponding residual form of the Waldstatistic, given by

7.20)

where the elements of are defined by the estimated residuals

under the general survey design. Residual forms of the Wald test can beconstructed for testing the fit of any log�linear model, by using the expressionsgiven by Rao and Scott (1984) for the variances of estimated residuals forlog�linear models of the form (7.1). Thus X 2

W (LL ) and X 2W (R ) are direct

competitors.

FITTING AND TESTING LOG�LINEAR MODELS 91

straint that Z X 0), and

�� �

��� 0�4 � �0�4&� ��� ���) (

� � �� � � � �

� �

and is a consistent estimate of the covariance matrix of

�,�� � �� ��� 0�4 � %!

� +� 0��4 � ���I �� 0��4J�1���) (

�� � ��

�� 0��4 � ���0��4�1���0��4�1�) (

�� � ���+����1

�,� '� � �� � ����� � ,D � � 1) � � � ) 0�� 14) � 1) � � � ) 0�� 14�

�+� 0�4 � �'�I �� 0�'4J�1�') (

0� 140�� 14

��� � ������� ) �� 0�'4� �'

�'

��� 0��4 ��

Page 115: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

As discussed in Section 7.3 on simulation studies, both versions of the Waldtest are very susceptible to serious Type I error inflation in practice even formoderate-sized tables, and should be used in practice only for small tables andlarge values of , the degrees of freedom for estimating .

-based Wald proceduresThomas and Rao (1987) described an F-based version of the Wald test thatexhibits better control of test levels at realistic values of . For a Wald test ofmodel fit, this procedure consists of referring

(7.21)

to an F-distribution on d and degrees of freedom, wherefor the test of fit of model (7.1). The FX 2

W test can be justifiedunder the heuristic assumption that has a Wishart distribution. F-basedversions of X 2

W (LL ) and X 2W (R ) will be denoted FX 2

W (LL ) and FX 2W (R ),

respectively. This simple strategy for taking account of the variation inimproves the Type I error control of both procedures.

Wald tests on the parameter vectorScott, Rao and Thomas (1990) developed a unified weighted least squaresapproach to estimating in the log�linear model (7.1) and to developingWald tests on . They used the following log�linear model:

(7.22)

where is the error vector with asymptotic mean 0 and singular asymptoticcovariance matrix . By appealing to the optimal theory for linearmodels having singular covariance matrices, Scott, Rao and Thomas (1990)obtained an asymptotically best linear unbiased estimator of and a consistentestimator of its asymptotic covariance matrix, together with Wald tests ofgeneral linear hypotheses on the model parameter vector .

7.2.4. Tests based on the Bonferroni inequality

Log�linear formThomas and Rao�s (1987) simulation results show that the Type I error controlof a Wald test rapidly worsens as the dimension, d, of the test increases. Sincethe hypothesis of goodness of fit of log�linear model (7.1), can bereplaced by d simultaneous hypotheses, . . . , d, it is natural toconsider whether or not d simultaneous tests might provide better controlthan the Wald test, or even the F-based Wald tests. A log�linear Bonferroniprocedure, Bf (LL ), with asymptotic error rate bounded above by the nominaltest level, , is given by the rule

7.23)

92 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

� �

"�+� � 0�� � � 14

���+

0�� � � 14

� � � � �� 1��

��

��

��� 0��4 � ��1� ��� �

��0�4�1��0�4�

�,�� ��,)

�� � ,) � � 1)

1

��=��� �, �� � ����� � ����+�0����41�+) ��� � �) � � 1) � � � ) �) (

Page 116: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where is the survey estimate of denotes the estimated variance of ,given by the ith diagonal element of the covariance matrix (7.18), andrepresents the upper 100( 2d) % point of a t-distribution on degrees offreedom, namely the degrees of freedom for estimating the error variance .The use of a t-distribution in place of the asymptotic standard normal distribu-tion is a heuristic device that was shown to yield much better control of overallerror rates by Thomas (1989), in his study of simultaneous confidence intervalsfor proportions under cluster sampling.

Residual formThe hypothesis of goodness of fit of log�linear model (7.1) can also beexpressed in the residual form Since H0 can be replaced by the dsimultaneous hypotheses , a residual form of the Bonferronitest is given by

7.24)

where denotes the estimated variance of the residual .

Caveats Wald procedures based on the log�linear form of the hypothesisare invariant to the choice of Z , provided it is of full column

rank and satisfies Similarly, Wald procedures based on theresidual form of the independence hypothesis are invariant to thechoice of the residuals to be dropped from the specification of the null hypoth-esis, a total of A B 1 in the case of a two-way A B table. Unfortunately,these invariances do not hold for the Bonferroni procedures. The residualprocedure, Bf (R ), can be modified to include all residuals, e.g., AB residualsin the case of a two-way table, but in this case there is a risk that the Bonferroniprocedure will become overly conservative. An alternative to the log�linearBonferroni procedure, BF(LL ), can be derived by using a Scheffe-typeargument to get simultaneous confidence intervals on the . Byinverting these intervals, a convenient testing procedure of the form (7.23)can then be obtained, with the critical constant replaced by

. Unfortunately, it can be that this pro-cedure is very conservative. As an example, for a 3 3 table and large , theasymptotic upper bound on the Type I error of this procedure corresponding toa nominal 5 % -level is 0.2 %, far too conservative in practice.

The only practical approach is for the analyst to recognize that the Bonfer-roni procedure Bf (LL ) of Equation (7.23) yields results that are specific to thechoice of Z . This is quite acceptable in many cases, since the choice of Z isfrequently determined by the nature of the data. For example, orthogonalpolynomial contrasts are often used to construct design matrices for ordinalcategorical variables. Design matrices obtained by dropping a column from anidentity matrix are usually recommended when the categories of a categoricalvariable represent comparisons to a control category.

FITTING AND TESTING LOG�LINEAR MODELS 93

��� ��) �������

����+��� �

���

�,� ' � ,!

'� � ,) � � 1) � � � ) �

��=��� �, �� ��'�� � ����+�0��'� 41�+) ��� � �) � � 1) � � � ) �) (

��'��'�

shown

0�,�� ��,4

��1 ��� ��,!

0�,� ' ��,4

� � �

����+�I0���0�� � � 144"�) 0����140�4J

1�+

��

Page 117: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

7.2.5. Fay�s jackknifed tests

In a series of papers, Fay (1979, 1984, 1985) proposed and developed jack-knifed versions of X 2

P and X 2LR , which can be denoted XPJ and XLRJ , respect-

ively. The jackknifed tests are related to the Rao�Scott corrected tests, and canbe regarded as an alternative method of removing, or at least reducing, thedistortion in the complex survey distributions of X 2

P and X 2LR characterized by

the weights Though more computationally intensivethan other methods, Fay�s procedures have the decided advantage of simplicity.Irrespective of the analytical model to be tested, and subject to some mildregularity conditions, all that is required is an algorithm for computing X 2

P orX 2LR . Software for implementing the jackknife approach is available and the

simulation studies described in Section 7.3 have shown that jackknifed tests arecompetitive from the point of view of control of test levels, and of power.A brief outline of the procedure will be given here, together with an example.

The form of XPJ and XLRJIt was noted in Section 7.2.1 that consistent estimators of the I finite populationproportions in a general cross-tabulation can be obtained as ratios of thepopulation count estimates, written in vector form as . Fay(1985) considered the class of sample replication methods represented by

, which satisfy the condition and for which an esti-mator, of the covariance matrix of is given by

(7.25)

A number of replication schemes can be represented this way, including jack-knife replicates for stratified cluster sampling, with j 1, . . . , J strata and njindependently sampled clusters in statum j. In this case, the W ( j, k) are given by

where it is assumed that can be represented in the formwith Z ( j, k) that are i.i.d. within strata. In this case,

Let X 2P( ) and X 2

P represent general chi-squared statistics basedon the full sample and the (j, k)th replicate, respectively. The form of thesestatistics is given by Equation (7.3) except that the sample size n is replaced bythe sum of the cell estimates in and , respectively. Fay�s (1985)jackknife statistic, XPJ , is then defined as

7.27)

where

7.28)

94 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

��) � � 1) � � � ) � � �� 1!

�� � 0 ��1) � � � ) ��� 4�!

�� �� 0 ,) -4 �- �

0 ,) -4 ��,

� �0 ��4)

� �0 ��4 ��,

,�-

� 0 ,) -4� 0 ,) -4� �

, � 1) � � � ) +

� 0 ,) -4 ��-�

�0 ,) -�4 � �,�0 ,) -4

� ��0�, � 14

�� � �,

�- �

0 ,) -4)

, � 0�, � 14��,!�� 0 �� �� 0 ,) -44

�� �� �� 0 ,) -4

��+ � I0� +�0

��441�+ � 0/�41�+J�I��03�+�0

��44J1�+) (

��

��

/ ��,

,�-

�,-) (

(7.26)

Page 118: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

7.29)

and

(7.30)

In Equation (7.27), for K positive, andlent expression for XLRJ , the jackknifed version of X 2

LR , is obtained by replacingX 2P by X 2

LR in Equations (7.27) through (7.30). Similarly, jackknifed statisticsXPJ (2 1) and XLRJ (2 1) for testing a nested hypothesis can be obtained byreplacing X 2

P in the above equations by X 2P(2 1) and X 2

LRJ (2 1), respectively.

The distribution of XPJ and XLRJFay (1985) showed that XPJ and XLRJ are both asymptotically distributed asa function of weighted sums of d I -r-1 independent 2

1 variables. The weightsthemselves are functions of the i featured in the asymptotic distributions ofX 2P and X 2

LR . When the i are all equal, the asymptotic distributionsimplifies to

(7.31)

Numerical investigations and simulation experiments have shown that (7.31) isa good approximation to the distribution of the jackknifed statistics even whenthe i are not equal. In practice, therefore, the jackknife test procedure consistsof rejecting the null hypothesis whenever XPJ (or XLRJ ) exceeds the upper100 critical value determined from (7.31).

Example 1The following example is taken from the program documentation of PC-CPLX(Fay, 1988), a program written to analyze log�linear models using the jackknifemethodology. The data come from an experimental analysis of the responserates of school teachers to a survey; thus the response variable (RESPONSE)was binary (response/no response). Two variables represent experimental treat-ments designed to improve response, namely PAY, a binary variable with levelspaid/unpaid, and FOLLOWUP, with levels phone/mail. Other variables enteredinto the log�linear model analysis were RGRADE (elementary/secondary/com-bined), and STATUS (public/private). The schools from which the teachers weresampled were treated as clusters, and the 10 participating states were treated asseparate strata for the public schools. All private schools were aggregated intoan eleventh stratum. The reported analysis involves 2630 teachers; statestrata contained an average of 20 clusters (schools) while the private schoolstratum contained 70 clusters. Thus the average cluster size was close to 10.Differential survey weights were not included in the analysis, so that the designeffect can be attributed to the clustering alone.

The following three hierarchical log�linear models were fitted, specifiedbelow in terms of the highest order interactions that they contain:

FITTING AND TESTING LOG�LINEAR MODELS 95

otherwise. An equiva-

� ��,

,�-

�+,-) (

�,- � �+�0

�� �� 0 ,) -44� �+1 0

��4�

/� � / /� � ,

� �� �

� ��

��+ � ���+ +1�+I0�+�4

1�+ � �1�+J�

��

Page 119: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

{PAY FOLLOWUP RGRADE STATUS}, {RESPONSERGRADE STATUS}{PAY FOLLOWUP RGRADE STATUS}, {RESPONSERGRADE STATUS} {RESPONSE PAY}{PAY FOLLOWUP RGRADE STATUS}, {RESPONSERGRADE STATUS} {RESPONSE PAY}, {RESPONSEFOLLOWUP}

Table 7.1 shows the results of two comparisons. The first compares model 3 tomodel 1, to test the joint relationship of RESPONSE to the two experimentalvariables PAY and FOLLOWUP. The second compares model 3 to model 2,to test the relationship between RESPONSE and FOLLOWUP alone.Results are shown for the jackknifed likelihood and first-and second-orderRao�Scott procedures for testing nested hypotheses. F-based versions ofthe Rao�Scott procedures are omitted in view of the large number of degreesof freedom for variance estimation (over 200) in this survey.

It can be seen from Table 7.1 that the p-values obtained from the uncorrectedX 2P and X 2

LR procedures are very low, in contrast to the p-values generated bythe Fay and Rao�Scott procedures, which account for the effect of the stratifiedcluster design. The values of . for the two model comparisons are 4.32 and4.10, for model 3 versus model 1, and model 2 versus model 1, respectively. Thisis indicative of a strong clustering effect. Thus the strong �evidence� for the jointeffect of the RESPONSE by PAY and RESPONSE by FOLLOWUP inter-actions provided by the uncorrected tests is erroneous. The comparison ofmodel 3 to model 2 shows that, when the effect of the design is accountedfor, there is marginal evidence for a RESPONSE by FOLLOWUP interactionalone. The p-values generated by Fay�s jackknifed likelihood test (denotedXLRJ (2 1) for nested hypotheses) and the Rao�Scott tests are quite similar. Inthis example, the value of for the two degree of freedom test of model 3 versusmodel 1 is very low at 0.09, so that the second-order Rao�Scott procedure,X 2LR differs very little from the first-order procedure. The

nested comparison of model 3 to model 2 involves only one degree of freedom;thus the first-order Rao�Scott test is asymptotically valid in this case, and thesecond-order test is not applicable.

Table 7.1 Tests of nested log�linear models for the teacher response study.

Model 3�model 1 Model 3�model 2

Statistic ValueEstimatedp-value (d 2) Value

Estimatedp-value (d 1)

X 2P(2 1) 11.80 0.003 11.35 < 0.001X 2LR (2 1) 11.86 0.003 11.41 < 0.001XLRJ (2 1) 0.27 0.276 0.85 0.109X 2P(2 1) 2.73 0.255 2.76 0.097X 2P(2 1) (1 2) 2.71 0.258 n.a. n.a.

96 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

����� � � � � ��

����� � � � � �� �

����� � � � �� � �

��

���

0+�14����01� ��+4)

� ����

� ���� ���� � �

Page 120: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

7.3. FINITE SAMPLE STUDIES finite sample studies

The techniques reported in Section 7.2 rely for their justification on asymptoticresults. Thus simulation studies of the various complex survey test proceduresthat have been proposed are essential before definitive recommendations can bemade. Early simulation studies featuring tests on categorical data from com-plex surveys were reported by Brier (1980), Fellegi (1980), Koehler and Wilson(1986) and others, but these studies were limited in scope and designed only toexamine the specific procedures developed by these authors. The first studydesigned to examine a number of competing procedures was that of Thomasand Rao (1987), who focused on testing the simple hypothesis of goodness of fitfor a vector of cell proportions, i.e., 0, under a mixture Dirichletmultinomial model of two-stage cluster sampling. They found that the Rao�Scott family of tests generally provided good control of Type I error andadequate power. For large values of a, the coefficient of variation of the designeffects, i, second-order corrections were best. F irst-order corrections weresomewhat liberal in this case, FX 2

P( ) providing better control than X 2P( ).

Fay�s jackknifed statistic, XPJ , also performed well, but tended to be somewhatliberal for small numbers of clusters. They also found that while the Wald testof goodness of fit was extremely liberal and should not be used, the F-basedversion was a much better performer, providing adequate control of Type Ierror except for small numbers of clusters.

More recently, Thomas, Singh and Roberts (1996) undertook a large simu-lation study of tests of independence in two-way tables, under cluster sampling,and included in their comparison a number of test procedures not included inthe Thomas and Rao (1987) study. An independent replication of this study,which used a different method of simulating the cluster sampling, has beenreported (in Spanish) by Servy, Hachuel and Wojdyla (1998). The results ofboth studies will be reviewed in this section.

7.3.1. Independence tests under cluster sampling

Thomas, Singh and Roberts (1996) developed most of their conclusions bystudying a 3 3 table, and then verified their results on a 3 4 and a 4 4table. They developed a flexible model of cluster sampling, based on a modifiedlogistic normal distribution that enabled them to study a variety of clusteringsituations under a partially controlled experimental design. All cluster sizeswere set equal to 20, and sample sizes consisting of from 15 to 70 clusters weresimulated. Servy, Hachuel and Wojdyla (1998) set all cluster sizes equal to 5,and examined 3 3 tables only.

Test statisticsWith the exception of Fay�s statistic, studied by Thomas, Singh and Roberts(1996) but not by Servy, Hachuel and Wojdyla (1998), all versions ofthe procedures discussed in Section 7.2 were examined together with thefollowing:

FINITE SAMPLE STUDIES 97

�,�� � �

��� ���

� � �

Page 121: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

(i) Conservative versions of the F-based Rao�Scott tests, denoted F X 2P( )

and F X 2P( ), in which FX 2

P( ) and FX 2LR ( ) are referred to F-

distributions on and degrees of freedom,where A and B are the row and column dimensions of the table, respect-ively, and L is the number of clusters.

(ii) A heuristic modification of a Wald statistic suggested by Morel (1989),designed to improve test performance for small numbers of clusters. It isdenoted X 2

W (L ;M ). Details are given in the technical report by Thomas,Singh and Roberts (1995).

(iii) A family of modified Wald statistics proposed by Singh (1985). These arevariants obtained by first expressing the Wald statistic, having

degrees of freedom, in terms of its spectral decom-position, and then discarding d T of the smallest eigenvalues (andcorresponding eigenvectors) that account for no more than of theoriginal eigenvalue sum. The preset constant is the range 0.01 to 0.1.The test procedure consists of referring the statistics, denoted QT (LL )and QT (R ), to the upper 100 % point of a chi-squared distribution onT degrees of freedom. F-based versions, FQT (LL ) and FQT (R ), can bedefined as in Equation (7.21), with replaced by T .

(iv) A second family of modified Wald statistics designed to increase thestability of Wald tests by adding to each eigenvalue of the kernel covar-

quantity is defined as k times the estimated standard error of the corres-ponding eigenvalue, where k is selected in the range 0.25 to 1.0. Undersuitable assumptions, this results in a modified Wald statistic that is ascalar multiple of the original. These statistics are denoted X 2

W (LL ;EV 1)and X 2

W (R ;EV 1), and are referred to chi-squared distributions ondegrees of freedom. F-based versions, denoted FX 2

W (LL ;EV 1) andFX 2

W (R ;EV 1), can be obtained as before.

Further details are given in the technical report by Thomas, Singh and Roberts(1995).

7.3.2. Simulation results

Type I error control was estimated using empirical significance levels (ESLs),namely the proportion of simulation trials leading to rejection of the (true) nullhypothesis of independence. Similarly, the empirical powers (EPs) of the vari-ous test procedures were estimated as the proportion of trials leading to rejec-tion of the null, when the simulations were generated under the alternativehypothesis. The results discussed below relate to the 5 % nominal -level.

Though the main recommendations of both studies are very similar, there aresome interesting differences. Servy, Hachuel and Wojdyla (1998) found that foralmost all procedures, their ESL estimates of Type I error were lower thanthose of Thomas, Singh and Roberts (1996) and as a result they focused theirrecommendations on the relative powers of the competing procedures. In

98 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

���� ��� ���

� � 0�� 140�� 14 � � �� 1

� � 0�� 140�� 14

����

iance matrix a small quantity that tends to zero as L . The small �

Page 122: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

contrast, Thomas, Singh and Roberts (1996) put primary emphasis on Type Ierror control. Some general conclusions are listed below, followed by a rankingof the performance of the viable test procedures. Differences between the twostudies are specifically noted.

General conclusions

f The standard Wald procedures, X 2W (R ) and X 2

W (LL ), yield seriously inflatedESLs, a finding that is consistent with the earlier study by Thomas and Rao(1987). Further, their variants QT ( ) and X 2

W ( ;EV 1) are also seriouslyinflated. These procedures are not viable in practice.

f There are no differences of practical importance between ESLs for testsbased on X 2

P and tests based on X 2LR . Only the former are listed.

f ESLs for XLRJ are lower than for XPJ , the reverse of Thomas and Rao�s(1987) finding for the goodness-of-fit test. The difference is small, however,less than one percentage point. For samples consisting of small numbers ofclusters, ESLs for both XPJ and XLRJ are unacceptably inflated (> 9 %for L 15), confirming the earlier finding of Thomas and Rao (1987).

f Thomas, Singh and Roberts (1996) found that the F-based log�linear Waldprocedure, FX 2

W (LL ), and the log�linear Bonferroni procedure, Bf (LL ),provide better control of Type I error than their residual based versions,FX 2

W (R ) and Bf (R ). Servy, Hachuel and Wojdyla (1998) found the reverseto be true.

f Though they provide good control of Type I error, the powers of theresidual statistics Bf (R ) and FQT (R ) are very low. They should not be used.

f Servy, Hachuel and Wojdyla (1998) noted erratic Type I error control forseveral test procedures for samples containing 15 clusters, the smallestnumber of clusters considered.

Rankings of the viable procedures

Table 7.2 gives rankings (based on Thomas, Singh and Roberts, 1996) of thetest procedures providing adequate Type I error control. The best-performingtests appear at the top of their respective columns. Adequate control is definedas an ESL between 3 % and 7 %, for the nominal 5 % test level. Similarly, onlystatistics exhibiting reasonable power levels are included.

ESLs in Table 7.2 are averages over three sets of cluster numbers (L 15, 30and 70) and all values of and other study parameters (see Thomas, Singh andRoberts 1996), for two different groupings of a, the study parameter that hasgreatest impact on Type I error control. The power rankings shown in Table 7.2are average EP values over L 30 and 70 and over all other study parameters,for one specific form of the alternative hypothesis.

In their technical report, Servy, Hachuel and Wojdyla (1998) also provided aranked list of �best� procedures. Their list contains essentially the same proced-ures as Table 7.2, with two exceptions. F irst, the settings of the parameters eand k featured in the Singh and eigenvalue-adjusted procedures were different,

FINITE SAMPLE STUDIES 99

� �

���

Page 123: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 7.2 Rankings of test performance, based on Type I error and power.

Average ESLs Average EPs

(a) For L > 30, to avoid error rate inflation for small L .

and second, Servy, Hachuel and Wojdyla (1998) included the Morell modifiedWald procedure, X 2

W (LL ;M ), in their list. They gave the second-order Rao�Scott statistic, X 2

P( , ), a higher power ranking than did Thomas, Singh andRoberts (1996), but an important point on which both studies agreed is that thelog�linear Bonferroni procedure, Bf (LL ), is the most powerful procedureoverall. Thomas, Singh and Roberts (1996) also noted that Bf (LL ) providesthe highest power and most consistent control of Type I error over tables ofvarying size (3 3, 3 4 and 4 4).

7.3.3. Discussion and final recommendations

The benefits of Bonferroni procedures in the analysis of categorical data fromcomplex surveys were noted by Thomas (1989), who showed that Bonferronisimultaneous confidence intervals for population proportions, coupled with logor logit transformations, provide better coverage properties than competingprocedures. This is consistent with the preeminence of Bf (LL ) for tests ofindependence. Nevertheless, it is important to bear in mind the caveat thatthe log�linear Bonferroni procedure is not invariant to the choice of basis forthe interaction terms in the log�linear model.

The runner-up to Bf (LL ) is the Singh procedure, FQT (LL ), which was highlyrated in both studies with respect to power and Type I error control. However,some practitioners might be reluctant to use this procedure because of thedifficulty of selecting the value of e. For example, Thomas, Singh and Roberts(1996) recommend 0.05, while Servy, Hachuel and Wojdyla (1998) recom-mend 0.1. Similar comments apply to the adjusted eigenvalue procedure,FX 2

W (LL ;EV 1). Nevertheless, an advantage of FX 2W (LL ;EV 1) is that it is a

simple multiple of the Wald procedure FX 2W (LL ), and it is reasonable to

recommend it as a stable improvement on FX 2W (LL ).

100 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

"�+�0���) ��4 "�+

� 0��D2�14) - � ,�< � 0��4

"��+�0���4 � 0��4 "90840��4) � ,�,<

"�+� 0��D2�14) - � ,�< "��+

�0���4 "� +

� 0��4

� 0��4 "90840��4) � ,�,< ���+

�+�0���) ��4 "�+

�0���) ��4 "� +

� 0��D2�14) - � ,�<"90840��4) � ,�,< �+

�0���) ��4 "� +

�0���4

"�+� 0��4 "��+

�0���4 "��+

�0���4

�+0���4� � "� +0���) ��4� ��+ �

,�8 � � � ,�@ ,�� � � � 1�,0/�3� � ��7 � ��,�4 0<�/� � ��7 � ��,�4 0<<� � �� � �8�4

���

� � �

��

� �� �

Page 124: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

If the uncertainties of the above procedures are to be avoided, the choice oftest comes down to the Rao�Scott family, Fay�s jackknifed tests, or the F-basedWald test FX 2

W (LL ). Table 7.2 can provide the necessary guidance, dependingon the degree of variation among design effects. There is a choice of Rao�Scottprocedures available whatever the variation among design effects. If full surveyinformation is not available, then the first-order Rao�Scott tests might be theonly option. Fay�s jackknife procedures are viable alternatives when full surveyinformation is available, provided the number of clusters is not small. Thesejackknifed tests are natural procedures to choose when survey variance estima-tion is based on a replication strategy. Finally, FX 2

W (LL ), the F-based Wald testbased on the log�linear representation of the hypothesis, provides adequatecontrol and relatively high power provided that the variation in design effects isnot extreme. It should be noted that in both studies, all procedures derivedfrom the F-based Wald test exhibited low power for small numbers of clusters,so some caution is required if these procedures are to be used.

7.4. ANALYSIS OF DOMAIN RESPONSE PROPORTIONS analysis of domain response proportions

Logistic regression models are commonly used to analyze the variation ofsubpopulation (or domain) proportions associated with a binary responsevariable. Suppose that the population of interest consists of I domains corres-ponding to the levels of one or more factors. Let andsurvey estimates of the ith domain size, Ni, and the ith domain total, N 1i, of abinary (0, 1) response variable, and let A consistent estimate of thedomain response proportions is denoted by . Theasymptotic covariance matrix of , denoted 1, is consistentlyestimated by

A logistic regression model on the response proportions is given by

7.32)

where x i is an m-vector of known constants derived from the factor levels withx i1 1, and is an m-vector of regression parameters.

The pseudo-MLE is obtained by solving the estimating equations specifiedby Roberts, Rao and Kumar (1987), namely

7.33)

where and, and where with elements

Equations (7.33) are obtained from the likelihood equations under independentbinomial sampling by replacing by and by the ratio estimator 1i,where ni and n1i are the sample size and the sample total of the binary responsevariable from the ith domain . A general Pearson statistic for testingthe goodness of fit of the model (7.32) is given by

ANALYSIS OF DOMAIN RESPONSE PROPORTIONS 101

be

is the m-vectordiag with

�� ��1� 0� � 1) � � � ) �

�� � �!

�1� � �1���� ��1� � ��1�����

��1 � 0��11) � � � ) ��1� 4 1��1!

�1

��� I�1��01� �1�4J � 3���) (

� ���

� �1�0��4�10��4 � � �

1�0��4��1) (

1 � ��1) � � � ) ��� ��� � �������� � �

���) �1�0��4!

� �� �

4

� �

� �1

��10��4

� 03 ) � � � ) 3 4)�0��4 � 0 4

�� � �1� �� �1�

0�

�� � �4!

Page 125: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

(7.34)

and a statistic corresponding to the likelihood ratio is given by

(7.35)

Roberts, Rao and Kumar (1987) showed that both X 2P1 and X 2

LR1 are asymp-totically distributed under the model as a weighted sum, 1iW i, of independ-ent 2

1 variablesW i, where the weights 1i, i 1, . . . , I m, are eigenvalues of ageneralized design effects matrix, 1, given by

(7.36)where

(7.37)

Z 1 is any I (I m) full rank matrix that satisfies and

diagonal elements ai, where Under independent binomialsampling, 1 I and hence 1i 1 for all i so that both X 2

P1 and X 2LR1 are

asymptotically . A first-order Rao�Scott correction refers

(7.38)2I m , where

(7.39)

and 1 is given by (7.36) with 1 replaced by by , and 1 by 1.Similarly, a second-order Rao�Scott correction refers

(7.40)

2 with degrees of freedomwhere . A simple conservative test, dependingonly on the domain design effects , is obtained byusing where is an upper bound on

given by . The modified first-order corrections X 2P1( ) and

X 2LR1( ) can be readily implemented from published tables provided the

estimated proportions and their design effects are known.Roberts, Rao and Kumar (1987) also developed first-order and second-order

corrections for nested hypotheses as well as model diagnostics to detect anyoutlying domain response proportions and influential points in the factorspace, taking account of the design features. They also obtained a linearizationestimator of Var( ) and the standard errors of �smoothed� estimates 1i( ).

102 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

�+�1 � �

����1

��I��1� � �1�0��4J+�I�1�0��401� �1�0��44J

� +��1 � +�

����1

��K��1� ��� I��1���1�0��4J� 01� ��1�4 ��� I01� ��1�4�01� �1�0��44JL�

��

� � � ��

�1 � 0 ���1�1

��14�10 ��

�1�1

��14)

���1 �

�� ��0�4�1�0�14�0

�1� �14�1I�

�1�0�14�0

�1� �14�1J

�1��1�0�4)

�with D(a) denoting the diagonal matrix with

to the upper 100 % point of

��1 � ��1�0�4�1�0�14�01� �14

� ��� � ��+��4

� +�10

��1�4 � � +�1�

��1� �� �+��10

��1�4 � � +��1�

��1�

� � ���1� � ����0��14�0� �44)

��1� 1 � ,)

� � 0�1) � � � ) �(41!

�� � �10�4) � �� � ���

�+�10

��1�) ��14 �� +

�10��1�4

1� ��+1�� �+

��10��1�) ��14 �

�+��10

��1�41� ��+1

to the upper 100 % point of a� � 0� �44�01� ��+14)01� ��+14�0� �44 � ��0��+

14�I��0��14J

+

��1� � �1���I��� ��1�01� ��1�4J)

� +�10

���1�4 �� �+��10

���1�4) ���1� � I��0� �44J��1

�� ��1� � ��1���1����1�

���1�1�

��1���1�

�� � ��

Page 126: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Example 2Roberts, Rao and Kumar (1987) and Rao, Kumar and Roberts (1989) appliedthe previous methods to data from the October 1980 Canadian Labour ForceSurvey. The sample consisted of males aged 15�64 who were in the labour forceand not full-time students. Two factors, age and education, were chosen toexplain the variation in employment rates via logistic regression models. Agegroup levels were formed by dividing the interval [15, 64] into 10 groups withthe jth age group being the interval [10 5j, 14 5j] for j 1, . . . , 10 and thenusing the mid-point of each interval, Aj 12 5j, as the value of age for allpersons in that age group. Similarly, the levels of education, Ek , were formed byassigning to each person a value based on the median years of schoolingresulting in the following six levels: 7, 10, 12, 13, 14 and 16. The resulting ageby education cross-classification provided a two-way table of I 60 estimatedcell proportions or employment rates, 1jk , j 1, . . . , 10; k 1, . . . , 6.

A logistic regression model involving linear and quadratic age effects and alinear education effect provided a good fit to the two-way table of estimatedemployment rates, namely:

The following values were obtained for testing the goodness of fit of the abovemodel: X 2

P1 99.8, X 2LR1 102.5, X 2

P1( ) 23.4 and X 2LR1( ) 23.9.

If the sample design is ignored and the value of X 2P1 or X 2

LR1 is referred to theupper 5% point of 2 with I m 56 degrees of freedom, the model withlinear education effect and linear and quadratic effects could be rejected. On theother hand, the value of X 2

P1( ) or X 2LR1 ) when referred to the upper

5% point of a 2 with degrees of freedom is not significant. Thesimple conservative test X 2

P1 or X 2LR depending only on the domain

design effects, is also not significant, and is close to the first-order correction2.04 compared to 1. 1.88.

Roberts, Rao and Kumar (1987) also applied their model diagnostic tools tothe Labour Force Survey data. On the whole, their investigation suggested thatthe impact of cells flagged by the diagnostics is not large enough to warranttheir deletion.

Example 3Fisher (1994) also applied the previous methods to investigate whether the useof personal computers during interviewing by telephone (treatment) versusin-person on-paper interviewing (control) has an effect on labour forceestimates. For this purpose, he used split panel data from the US CurrentPopulation Survey (CPS) obtained by randomly splitting the CPS sample intotwo panels and then administering the �treatment� to respondents in one of thepanels and the �control� to those in the other panel. Three other binary factors,sex, race and ethnicity, were also included. A logistic regression model contain-ing only the main effects of the four factors fitted the four-way table ofestimated unemployment rates well. A test of the nested hypothesis that the

ANALYSIS OF DOMAIN RESPONSE PROPORTIONS 103

� � �� �

��� � �

��� I��1,-�01� ��1,-4J � �8�1,� ,�+11�, � ,�,,+ 13�+, � ,�1<12-�

� � ��1�) ��1 � ��1�) ��1 �

� ���1�) ��1 0��1�) ��1

<@ 01� ��+140���1� 0���1�4)

�+�10

���1�4 �� �+��10

���1�4D ���1� � � �

4

1

Page 127: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

�treatment� main effect was absent given the model containing all four maineffects was rejected, suggesting that the use of a computer during interviewingdoes have an effect on labour force estimates.

Rao, Kumar and Roberts (1989) studied several extensions of logistic regres-sion. They extended the previous results to Box�Cox-type models involvingpower transformations of domain odds ratios, and illustrated their use on datafrom the Canadian Labour Force Survey. The Box�Cox approach would beuseful in those cases where it could lead to additive models on the transformedscale while the logistic regression model would not provide as good a fit withoutinteraction terms. Methods for testing equality of parameters in two logisticregression models, corresponding to consecutive time periods, were alsodeveloped and applied to data from the Canadian Labour Force Survey.Finally, they studied a general class of polytomous response models anddeveloped Rao�Scott adjusted Pearson and likelihood tests which they appliedto data from the Canada Health Survey (1978�9).

In this section we have discussed methods for analyzing domain responseproportions. We turn next to logistic regression analysis of unit-specific sampledata.

7.5. LOGISTIC REGRESSION WITH A BINARY RESPONSEVARIABLE logistic regression with a binary response variable

Suppose that (x t , yt) denote the values of a vector of explanatory variables, x ,and a binary response variable, y, associated with the tth population unit,t 1, . . . ,N . We assume that for a given x t , yt is generated from a model withmean E ( yt) t( ) g(x t , ) and �working� variance var( yt) v0t v0( t).Specifically, we assume a logistic regression model so that

(7.41)

and v0( t) t(1 t). Our interest is in estimating the parameter vector andin testing hypotheses on using the sample data {(x t , yt), t s}. A pseudo-MLE

of is obtained by solving

(7.42)

where the wts are survey weights that may depend on s, e.g., when the designweights, wt , are adjusted for post-stratification, and (see section 3.4.5 of SHS)

(7.43)

The estimator is a consistent estimator of the census parameter obtained asthe solution of the estimating equations

104 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

�� � � � � � � �

��� I��0�4�01� ��0�44J � 3���

� � � � � �� �

�� �

�80�4 �����

;����0�4 ��,)

��0�4 � I �� � ��0�4J3��

�� �

80�4 ���1

��0�4 ��,�

Page 128: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Following Binder (1983), a linearization estimator of the covariance matrixV ( ) of is given by

(7.44)

where J ( ) t s wts ut( ) is the observed information matrix and ( )is the estimated covariance matrix of the estimated total ( ) evaluated at

. The estimator ( ) should take account of post-stratification andother adjustments. It is straightforward to obtain a resampling estimator ofV ( ) that takes account of post-stratification adjustment. For stratified multi-stage sampling, a jackknife estimator of V ( ) is given by

(7.45)

where J is the number of strata, nj is the number of sampled primary clustersfrom stratum j and ( jk) is obtained in the same manner as when the data fromthe ( j k)th sample cluster are deleted, but using jackknife survey weights wts( jk)(see Rao, Scott and Skinner, 1998). Bull and Pederson (1987) extended (7.44) tothe case of a polytomous response variable, but without allowing for post-stratification adjustment as in Binder (1983). Again it is straightforward todefine a jackknife variance estimator for this case.

Wald tests of hypotheses on are readily obtained using either L ( ) orJ ( ). For example, suppose ( 1, 2), where 2 is r2 1 and 2 is r1 1,

with r1 r2 r, and we are interested in testing the nested hypothesisH0: 2 20. Then the Wald statistic

is asymptotically 2r2 under H0, where ( ) may be taken as either L ( ) or

J ( ). A drawback of Wald tests is that they are not invariant to non-lineartransformations of . Further, one has to fit the full model (7.1) beforetesting H0, which could be a disadvantage if the full model contains a largenumber, r, of parameters. This would be the case with a factorial structure ofexplanatory variables containing a large number of potential interactions, 2, ifwe wished to test for the absence of interaction terms, i.e., H0: 2 0. Rao,Scott and Skinner (1998) proposed quasi-score tests to circumvent the problemsassociated with Wald tests. These tests are invariant to non-linear transform-ations of and we need only fit the simpler model under H0, which is aconsiderable advantage if the dimension of is large, as noted above.

Let ( 1, 20) be the solution of 1( ) 0, where ( ) [ 1( ) , 2( ) ] ispartitioned in the same manner as . Then quasi-score test of H0: 2 20is based on the statistic

where 2 2( ) and ( 2) is either a linearization estimator or a jackknifeestimator.

LOGISTIC REGRESSION WITH A BINARY RESPONSE VARIABLE 105

�� ��

���0��4 � I+0��4J�1 �� 0 �84I+0��4J�1

�� � ��� � ��� �� �8

�8 �� � �� �� �8

����

��+0��4 ��+,�1

�, � 1

�,

��,-�1

0��0 ,-4 � ��40��0 ,-4 � ��4�

�� ��

(7.47)

(7.46)

� �� �

�� �� � � �� ) �� � � � �� �

� � �

� +� � 0��+ � ��+,4

�I �� 0��+4J�10��+ � ��+,

� �� �� �� ���� ��

� �

��

�� � ��� ) �� �8 �� �

�,) �8 � � �8 � � �8 � � �

� � � �

4

� +9 � �8

�+I�� 0 �8+4J

�1 �+8 )

�8 � �8 �� �� �8

Page 129: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The linearization estimator, L ( 2), is the estimated covariance matrix of theestimated total

evaluated at and , where

corresponds to the partition of as ( 1, 2) . Again, L ( 2) should take accountof post-stratification and other adjustments. The jackknife estimator J ( 2)is similar to (7.45), with and ( jk) changed to 2 and 2( jk), where 2( jk)

is obtained in the same manner as 2, i.e., when the data from the (j k)thsample cluster are deleted, but using the jackknife survey weights wts( jk).Under H0, X 2

QS is asymptotically 2r2

so that X 2QS provides a valid test of H0.

A difficulty with both X 2W and X 2

QS is that when the effective degrees offreedom for estimating V ( 2) or V ( 2) are not large, the tests become unstablewhen the dimension of 2 is large because of instability of [ ( 2)] 1 and[ ( 2)] 1. To circumvent the degrees of freedom problem, Rao, Scott andSkinner (1998) proposed alternative tests including an F-version of the Waldtest (see also Morel, 1989), and Rao�Scott corrections to naive Wald or scoretests that ignore survey design features as in the case of X 2

P and X 2LR of Section

7.2, as well as Bonferroni tests. We refer the reader to Rao, Scott and Skinner(1998) for details.

It may be noted that with unit-level data {(x t , yt), t s} we can only testnested hypotheses on given the model (7.41), unlike the case of domainproportions which permits testing of model fit as well as nested hypothesesgiven the model.

7.6. SOME EXTENSIONS AND APPLICATIONS some extensions and applications

In this section we briefly mention some recent applications and extensions ofRao�Scott and related methodology.

7.6.1. Classification errors

Rao and Thomas (1991) studied chi-squared tests in the presence of classifica-tion errors which often occur with survey data. Following the important workof Mote and Anderson (1965) for multinomial sampling, they considered thecases of known and unknown misclassification probabilities for simple good-ness of fit. In the latter case two models were studied: (i) misclassification rate isthe same for all categories; (ii) for ordered categories, misclassification onlyoccurs between adjacent categories, at a constant rate. F irst- and second-orderRao�Scott corrections to Pearson statistics that account for the survey design

106 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

�� �8

8�+0�4 �

����

;��K�+�0�4� ��1�0�4L

� � �� � � ++10��4I+110��4J�1

+0�4 � +110�4 +1+0�4++10�4 +++0�4

� �

� �� ) �� � �� �8�� �8

�� �� �8 �8 �8�8

�� �8� �� �� �

�� �8 �

��

Page 130: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

features were obtained. A model-free approach using two-phase sampling wasalso developed. In two-phase sampling, error-prone measurements are made ona large first-phase sample selected according to a specified design and error-free measurements are then made on a smaller subsample selected according toanother specified design (typically SRS or stratified SRS). Rao�Scott correctedPearson statistics were proposed under double sampling for both the goodness-of-fit test and the tests of independence in a two-way table. Rao and Thomas(1991) also extended Assakul and Proctor�s (1967) method of testing of inde-pendence in a two-way table with known misclassification probabilitiesto general survey designs. They developed Rao�Scott corrected tests using ageneral methodology that can also be used for testing the fit of log�linearmodels on multi-way contingency tables. More recently, Skinner (1998)extended the methods of Section 7.4 to the case of longitudinal survey datasubject to classification errors.

7.6.2. Biostatistical applications

Cluster-correlated binary response data often occur in biostatistical applica-tions; for example, toxicological experiments designed to assess the teratogeniceffects of chemicals often involve animal litters as experimental units. Severalmethods that take account of intracluster correlations have been proposed butmost of these methods assume specific models for the intracluster correlation,e.g., the beta-binomial model. Rao and Scott (1992) developed a simplemethod, based on conceptual design effects and effective sample size, that canbe applied to problems involving independent groups of clustered binary datawith group-specific covariates. It assumes no specific models for the intraclustercorrelation in the spirit of Zeger and Liang (1986). The method can be readilyimplemented using any standard computer program for the analysis of inde-pendent binary data after a small amount of pre-processing.

Let nij denote the number of units in the jth cluster of the ith group,i 1, . . . , I , and let yij denote the number of �successes� among the nij units,with j yij yi and j nij ni. Treating yi as independent binomial B(ni, pi)leads to erroneous inferences due to the clustering, where pi is the successprobability in the ith group. Denoting the design effect of the ith estimatedproportion, i yi ni, by Di, and the effective sample size by i ni Di, theRao�Scott (1992) method simply treats i yi Di as independent binomialB( 1, pi) which leads to asymptotically correct inferences. The method wasapplied to a variety of biostatistical problems; in particular, testing homogen-eity of proportions, estimating dose response models and testing for trends inproportions, computing the Mantel�Haenszel chi-squared test statistic forindependence in a series of 2 2 tables and estimating the common oddsratio and its variance when the independence hypothesis is rejected. Obu-chowski (1998) extended the method to comparisons of correlated proportions.

Rao and Scott (1999a) proposed a simple method for analyzing groupedcount data exhibiting overdispersion relative to a Poisson model. This methodis similar to the previous method for clustered binary data.

SOME EXTENSIONS AND APPLICATIONS 107

� � � � �

�1 � �� � ��� � �

��

Page 131: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

7.6.3. Multiple response tables

In marketing research surveys, individuals are often presented with questionsthat allow them to respond to more than one of the items on a list, i.e., multipleresponses may be observed from a single respondent. Standard methods cannotbe applied to tables of aggregated multiple response data because of themultiple response effect, similar to a clustering effect in which each independentrespondent plays the role of a cluster. Decady and Thomas (1999, 2000)adapted the first-order Rao�Scott procedure to such data and showed that itleads to simple approximate methods based on easily computed, adjusted chi-squared tests of simple goodness of fit and homogeneity of response probabil-ities. These adjusted tests can be calculated from the table of aggregate multipleresponse counts alone, i.e., they do not require knowledge of the correlationsbetween the aggregate multiple response counts. This is not true in general; forexample, the test of equality of proportions will require either the full dataset oran expanded table of counts in which each of the multiple response items istreated as a binary variable. Nevertheless, the first-order Rao�Scott approachis still considerably easier to apply than the bootstrap approach recentlyproposed by Loughin and Scherer (1998).

108 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

Page 132: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 8

F itting Logistic RegressionModels in Case�ControlStudies with ComplexSamplingAlastair Scott and Chris Wild

8.1. INTRODUCTION introduction

In this chapter we deal with the problem of fitting logistic regression models insurveys of a very special type, namely population-based case�control studies inwhich the controls (and sometimes the cases) are obtained through a complexmulti-stage survey. Although such surveys are special, they are also verycommon. For example, we are currently involved in the analysis of a popula-tion-based case�control study funded by the New Zealand Ministry of Healthand the Health Research Council looking at meningitis in young children. Thestudy population consists of all children under the age of 9 in the Aucklandregion. There were about 250 cases of meningitis over the three-year durationof the study and all of these are included. In addition, a sample of controls wasdrawn from the remaining children in the study population by a complex multi-stage design. At the first stage, a sample of 300 census mesh blocks (eachcontaining roughly 50 households) was drawn with probability proportionalto the number of houses in the block. Then a systematic sample of 20 house-holds was selected from each chosen mesh block and children from thesehouseholds were selected for the study with varying probabilities that dependon age and ethnicity as in Table 8.1. These probabilities were chosen to matchthe expected frequencies among the cases. Cluster sample sizes varied from oneto six and a total of approximately 250 controls was achieved. This correspondsto a sampling fraction of about 1 in 400, so that cases are sampled at a rate thatis 400 times that for controls.

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 133: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 8.1 Selection probabilities.

Maori Pacific Islander Other

1 year 0.29 0.70 0.103 years 0.15 0.50 0.075 years 0.15 0.31 0.048 years 0.15 0.17 0.04

A similar population-based case�control study of the effects of artificialsweeteners on bladder cancer is described in Graubard, Fears and Gail(1989). Again all cases occurring during the study were included, but thecontrol sample was drawn using Waksberg�s two-stage telephone samplingtechnique. A sample of exchanges, each containing 100 telephone numbers, ischosen at the first stage and then a simple random sample of households isdrawn from each of the selected exchanges.

Complex sampling may also be used in the selection of cases. For example,we have recently helped with the analysis of a study in which cases (patientswho had suffered a mild stroke) were selected through a two-stage design withdoctors� practices as primary sampling units. However, this is much lesscommon than with controls.

As we said at the outset, these studies are a very special sort of survey butthey share the two key features that make the analysis of survey data distinct-ive. The first feature is the lack of independence. In our example, we wouldexpect substantial intracluster correlation because of unmeasured socio-economic variables, factors affecting living conditions (e.g. mould on walls ofhouse), environmental exposures and so on. Ignoring this will lead to standarderrors that are too small and confidence intervals that are too narrow. Theother distinctive feature is the use of unequal selection probabilities. In case�control studies the selection probabilities can be extremely unbalanced and theyare based directly on the values of the response variable, so that we haveinformative sampling at its most extreme.

In recent years, efficient semi-parametric procedures have been developed forhandling the variable selection probabilities (see Scott and Wild (2001) for asurvey of recent work). �Semi-parametric� here means that a parametric modelis specified for the response as a function of potential explanatory variables, butthat the joint distribution of the explanatory variables is left completely free.This is important because there are usually many potential explanatory vari-ables (more than 100 in some of the studies in which we are involved) and itwould be impossible to model their joint behaviour, which is of little interest inits own right. However, the problem of clustering is almost completely ignored,in spite of the fact that a large number of studies use multi-stage sampling. Thepaper by Graubard, Fears and Gail (1989) is one of the few that even discussthe problem. Most analyses simply ignore the problem and use a programdesigned for simple (or perhaps stratified) random sampling of cases andcontrols. This chapter is an attempt to remedy this neglect.

110 FITTING REGRESSION MODELS IN CASE�CONTROL STUDIES

����

Page 134: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

In the next section we give a summary of standard results for simple case�control studies, where cases and controls are each selected by simple randomsampling or by stratified random sampling. In Section 8.3, we extend theseresults to cover cases when the controls (and possibly the cases) are selectedusing an arbitrary probability sampling design. We investigate the properties ofthese methods through simulation studies in Section 8.4. The robustness of themethods in situations when the assumed model does not hold exactly is ex-plored in Section 8.5, and some possible alternative approaches are sketched inthe final section.

8.2. SIMPLE CASE�CONTROL STUDIES simple case�control studies

We shall take it for granted throughout this chapter that our purpose is to makeinferences about the parameters of a superpopulation model since interest iscentred in the underlying process that produces the cases and not on the compos-ition of a particular finite population at a particular point in time. Suppose thatthe finite population consists of values {(x t , yt), t 1, . . . ,N }, whereyt is a binaryresponse and x t is the vector of potential explanatory variables. The x t can begenerated by any mechanism, but the yt are assumed to be conditionally inde-pendent given the x t and Pr(Y t y {x j}) Pr(Y t y x t). These assumptionsunderlie all standard methods for the analysis of case�control data from popula-tion-based studies. For simplicity, we will work with the logistic regression model,logit {Pr(Y 1 x ; )} x (see Sections 6.3.1 and 7.5), but similar results canbeobtained for any binary regression model. Wedenote Pr(Y 1 x ; ) byp1(x ; )and often shorten it to p1(x ). We also set p0(x ) 1 p1(x ).

If we had data on the whole population, that is what we would analyse,treating it as a random sample from the process that produces cases andcontrols. The finite population provides score (first derivative of the log-likelihood) equations

(8.1)

where N 1 is the number of cases in the finite population and N 0 is the number ofcontrols. As N , Equations (8.1) converge to

(8.2)

where W i Pr(Y i) and Ei{} denotes the expected value over the distribu-tion of X conditional on Y i (i 0, 1). Under the model, Equations (8.2)have solution .

Standard logistic regression applies if our data comes from a simple randomsample of size n from the finite population. For rare events, as is typical inbiostatistics and epidemiology, enormous efficiency gains can be obtained bystratifying on the value of Y and drawing a random sample of size ni from thestratum defined by Y i (i 0, 1) with n1 n0.

SIMPLE CASE�CONTROL STUDIES 111

� � � � �

� � �� ��� � � �

� �

3

����3

��A�� � �3(��C �)B ��3

������3

���1(��)

�3��1

������1

���3(��)

�1� 1&

� ��E�A� � �3(� C �)BF � �3�3A��1(� )B��1�1A��3(� )B � 1&

� �� �

� � �

Page 135: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Since the estimating equations (8.2) simply involve population means for thecase and control subpopulations, they can be estimated directly from case�control data using the corresponding sample means. This leads to a design-weighted (pseudo-MLE, Horvitz�Thompson) estimator, HT , that satisfies

(8.3)

as in Sections 6.3.1 and 7.5. This method can be used when theW i are known orcan be estimated consistently (by the corresponding population proportions,for example). More efficient estimates, however, are obtained using the semi-parametric maximum likelihood estimator, ML say. This is essentially obtainedby ignoring the sampling scheme (see Prentice and Pyke, 1979) and solving theprospective score equations (i.e. those that would be appropriate if we had asimple random sample from the whole population)

(8.4)

If ni n i as n n1 n0 , these equations tend to

(8.5)

which have solution ke1, where k log [ 1W 0 ( 0W 1)] and e1(1, 0, . . . , 0) . We see that only the intercept term is affected by the case�controlsampling. The intercept term can be corrected simply by using k as an offset inthe model, but if we are only interested in the coefficients of the risk factors, wedo not even need to know the relative stratum weights. More generally, Scottand Wild (1986) show that, for any positive 1 and 0, equations of the form

(8.6)

have the unique solution b k e1 with k log [ 1W 0 ( 0W 1)], providedthat the model contains a constant term (i.e. the first component of x is 1). Thiscan be seen directly by expanding (8.6). Suppose, for simplicity, that X iscontinuous with joint density function f (x ). Then, noting that the conditionaldensity of X given that Y i is f (x Y i) pi(x ; ) f (x ) W i, (8.6) is equivalentto

The result then follows immediately.The increase in efficiency of ML (or, more generally, estimates based upon

(8.6)) over W can be modest when, as in Scott and Wild (1989), we have onlyone explanatory variable and the ratio of the sampling rates between cases andcontrols is not too extreme. However, the differences become bigger as casesbecome rarer in the population and the sampling rates become less balanced,and also when we have more than one covariate. We have seen 50% efficiencygains in some simple case�control problems. In more complex stratified

112 FITTING REGRESSION MODELS IN CASE�CONTROL STUDIES

��

�3������

���1(��)

�3��1

���������

���3(��)

�1� 1&

��

�3

������

���1(��)

�3� �1

���������

���3(��)

�1� 1�

� � � �� �

�3�3A��1(� )B� �1�1A��3(� )B � 1&� � � �

��

� �

3 3 1 1 1 3� � A�� (� C )B� � � A�� (� C )B � 1� � � � � � �

� � � � �

����

���� (�)

(3 ��� )(3 ����) �� ��

���� (�)

(3 ��� )(3 ����) ���

����

Page 136: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

samples, design weighting can be very inefficient indeed (see Lawless,Kalbfleisch and Wild 1999).

Most case�control studies incorporate stratification on other variables, suchas age and ethnicity as in our motivating example. This is one aspect of complexsurvey design that is taken into account in standard case�control methodology.A common assumption is that relative risk is constant across these strata butthat absolute levels of risk vary from stratum to stratum, leading to a model ofthe form

(8.7)

for observations in the hth stratum, where x now contains dummy variables forall the strata as well as the measured covariates, x 1 say. There is no problemadapting the design-weighted (pseudo-MLE, Horvitz�Thompson) approach tothis problem; we simply add a dummy variable for each stratum to x andreplace the sample means in (8.3) by their stratified equivalents. The maximumlikelihood solution for 1 is even simpler, assuming simple random sampling ofcases and controls within each stratum; we simply fit the model in (8.7) as if wehad a simple random sample from the whole population, ignoring the stratifi-cation completely. If we want estimates of the 0h we need to adjust the outputby an additive constant depending on the relative sampling fractions of casesand controls in the hth stratum, but if we are interested only in 1, we do noteven need to know these sampling fractions.

In some studies, we want to model the stratum constants as functions of thevariables in x 1. For example, if the population is stratified by age, we might stillwant to model the effect of age by some smooth function. Again, adapting thedesign-weighted approach is completely straightforward and only requires thespecification of the appropriate sampling weights. Extending the maximumlikelihood approach to this case is considerably more difficult, however, anda fully efficient solution has only been obtained relatively recently (for detailssee Scott and Wild, 1997; Breslow and Holubkov, 1997).

8.3. CASE�CONTROL STUDIES WITH COMPLEX SAMPLINGcase�control studies with complex sampling

Now consider situations in which controls (and possibly cases) are obtainedusing a complex sampling plan involving multi-stage sampling. Our only as-sumption is that the population is stratified into cases and controls and thatsamples of cases and controls are drawn independently. Note that this assump-tion does not hold exactly in our motivating example since a control could bedrawn from the same cluster as a case. However, cases are sufficiently rare forthis possibility to be ignorable in the analysis. (In fact, it occurred once in thestudy.) One common option is simply to ignore the sample design (see Grau-bard, Fears and Gail, 1989, for examples) and use a standard logistic regressionprogram, just as if we had a simple (or stratified) case�control study. Under-estimates of variance are clearly a worry with this strategy; the coveragefrequencies of nominally 95% confidence intervals dropped to about 80% in

CASE�CONTROL STUDIES WITH COMPLEX SAMPLING 113

�� �A�(� � 3��& ������� "C �)B � �1" ��3�3

��

Page 137: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

some of our simulations. In fact, estimates obtained in this way need not evenbe consistent. To obtain consistency, we need (1 n0) controls x tp1(x t) to con-verge to E0{Xp1(X )} (with a similar condition for cases), which will be true forself-weighting designs but not true in general.

The estimating equations (8.2) involve separate means for the case andcontrol subpopulations. Both these terms can be estimated using standardsurvey sampling techniques for estimating means from complex samples. Vari-ances and covariances of the estimated means, which are required for sandwichestimates of Cov( ), can also be obtained by standard survey sampling tech-niques. Such analyses can be carried out routinely using specialist samplingpackages such as SUDAAN or with general packages such as SAS or Stata thatinclude a survey option. This approach is a direct generalisation of the design-weighted (Horvitz�Thompson) approach to the analysis of simple case�controldata which, as we have seen, can be quite inefficient.

An alternative is to apply standard sampling methods to (8.6), with appro-priate choices for 1 and 0, rather than to (8.2). This leads to an estimator,say, that satisfies

(8.8)

where 1 is the sample estimate of the stratum mean E1{(Xp0(X ))} etc. Thecovariance matrix of can then be obtained by standard linearisation argu-ments. This leads to an estimated covariance matrix

(8.8)

where is the standard survey estimate ofjust as in Equation (7.44). (Note that is just a linear combination of twoestimated means.) All of this can also be carried out straightforwardly inSUDAAN or similar packages simply by specifying the appropriate weights(i.e. by scaling the case weights and control weights separately so that the sumof the case weights is proportional to 1 and the sum of the control weights isproportional to 0).

We still have to decide on appropriate values for 1 and 0. For simplerandom samples of cases and controls, maximum likelihood, in which we seti ni n for i 0, 1, is the most efficient strategy (Cosslett, 1981). For more

complex schemes, using the sample proportions will no longer be fully efficientand we might expect weights based on some form of equivalent sample sizes toperform better. We have carried out some simulations to investigate this. Ourlimited experience so far suggests that setting i ni n for i 0, 1 leads to onlya moderate loss of efficiency unless design effects are extreme.

We have not said anything specific about stratification beyond the basicdivision into cases and controls so far in this discussion. Implicitly, we areassuming that stratification on other variables such as age or ethnicity ishandled by specifying the appropriate weights when we form the sampleestimators i(i 0, 1). One advantage of this approach is that, if we want tomodel the stratum constants in terms of other variables, then this is taken care

114 FITTING REGRESSION MODELS IN CASE�CONTROL STUDIES

��

��

� � �����(�) � �3��3(�C �)� �1��1(�C �) � 1&

�� ���

���$(���) � $�3� (���)���$(��)$

�3� (

���)&$�(�) � (������

�) ����%� (���) ��$(��)&�

��

� �

� � ��

� � � �

�� �

Page 138: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

of automatically. There is another approach to fitting model (8.6), which is toignore the different selection probabilities of units in different strata. Thismimics the maximum likelihood method of the previous section more closelyand usually leads to more efficient estimates of the coefficients in 1. However,further adjustments are necessary if we wish to model the stratum constants.We explore the difference in efficiency between the two approaches in oursimulations in the next section. Note that we can again implement the methodsimply in a package like SUDAAN by choosing the weights appropriately.

8.4. EFFICIENCY efficiency

In this section, we describe the results of some of the simulations that wehave done to test and compare the methods discussed previously. The simula-tions reported here use two explanatory variables, a continuous variable X 1normal (0, 1) and an independent binary variable X 2 binomial (1, 0.5). Wegenerated data from a linear logistic model with 1 2 0.5. With thesevalues, an increase of one standard deviation in X 1 corresponds to a 65 %increase in risk, and the effect of exposure (X2 1) is also a 65 % increase inrisk. Values of X 2 were kept constant within a cluster and values of X 1 werehighly correlated within a cluster ( 0.9). We used an intercept of 0 6.1which produces a population proportion of approximately 1 case in every 300individuals, a little less extreme than the 1 in 400 in the meningitis study. Thenumber of cases in every simulation study is 200. The number of control clustersis always fixed. The number of controls in each study averages 200, but the actualnumber is random in some studies (when �ethnicities� are subsampled at differentrates � see later). All simulations used 1000 generated datasets.

The first row of Table 8.2 shows results for a population in which 60 % of theclusters are of size 2 and the remaining 40 % of size 4. We see from Table 8.2 that,under these conditions, the sample-weighted estimator of 1 is 60 % more effi-cient than the population-weighted estimator, whereas for 2 there is a 20 %increase in efficiency. It is possible that the gain in efficiency is smaller for the

Table 8.2 Relative efficiencies (cf. design weighting).

1 2

Stratified samplingSampleweighted

Ethnicweighted

Sampleweighted

Ethnicweighted

None2s and 4s 160 120

of �ethnic groups�2s and 4s (whole) 149 150 118 1222s and 4s (random) 152 145 119 1144s and 8s (whole) 186 210 141 1564s and 8s (random) 189 177 127 122

EFFICIENCY 115

� �

� �

� � � ��

��

� �

Page 139: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

binary variable because it represents a smaller change in risk. We also investi-gated the effect of using a standard logistic regression program ignoring clus-tering. Members of the same cluster are extremely closely related here, so wewould expect the coverage frequency of nominally 95 % confidence intervals todrop well below 95 % if the clustering was ignored. We found that coveragefrequencies for 1 and 2 dropped to 80 % and 82 % respectively when cluster-ing was ignored. When we correct for clustering in both the sample-weightedand population-weighted variance estimates, the coverage frequencies wereclose to 95 %.

In Section 8.3, we discussed the possible benefits of using weights thatreflected �effective sample� size in place of sample weights. We repeated thepreceding simulation (using the clusters of size 2 and 4) down-weighting thecontrols by a factor of 3 (which is roughly equal to the average design effect).This produced a 5 % increase in the efficiency of 1 and only a 1 % increase inthe efficiency of 2 compared with sample weighting.

The rows of Table 8.2 headed �Stratified sampling of ��ethnic groups�� � showthe results of an attempt to mimic (on a smaller scale) the stratified sampling ofcontrols within clusters in the meningitis study. We generated our populationso that 60 % of the individuals belong to ethnic group 1, and 20 % belong toeach of groups 2 and 3. This was done in two ways. Where �(random)� has beenappended, the ethnic groups have been generated independently of cluster.Where �(whole)� has been appended, all members of a cluster have the sameethnicity. All members of a sampled cluster in groups 2 and 3 were retained,while members of group 1 were retained with probability 0.33. We varied thesizes of the clusters, with either 60 % twos and 40 % fours, or 60 % fours and40 % eights, before subsampling. The subsampling (random removal of group1 individuals) left cluster sample sizes ranging from one to the maximum clustersize. The models now also include dummy variables for ethnic group and wecan estimate these in two ways. The columns headed �Ethnic weighted� showresults when cases have sample weights, whereas controls have weights whichare the sample weights times 1.8 for group 1 or 0.6 for groups 2 or 3. Thisreconstitutes relative population weightings so that contrasts in ethnic coeffi-cients are correctly estimated.

In this latter set of simulations the increase in efficiency for sample weightingover population weighting is even more striking. Reweighting to account fordifferent sampling rates in the different ethnic groups has led to similar effi-ciencies. This is important as the type of �ethnic weighting� we have done in thecontrol group would allow us to model group differences rather than beingforced to fit a complete set of dummy variables. The coverages for 2 are goodfor all methods that take account of the clustering. The coverages for 1dropped a little, however, in the �4s and 8s� clusters with design weighting.We can interpret this as meaning that, with more clustering and additionalparameters to estimate, our effective sample sizes have dropped and, withsmaller sample sizes, asymptotic approximations are less accurate.

116 FITTING REGRESSION MODELS IN CASE�CONTROL STUDIES

� �

����

��

Page 140: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

8.5. ROBUSTNESS robustness

In the previous section we considered efficiency when we were fitting the correctmodel. In this section, we look at what happens when the model is misspecified.In this case, the estimator converges in probability to , the solution to(8.6), i.e.

If the true model is logistic with parameter , then k e1. Only theintercept is affected by the choice of weightings i used in the estimatingequations. It is the remaining regression coefficients that are of primary inter-est, however, since they determine the relative risk of becoming a case associ-ated with a change in the values of explanatory variables. These remainingcoefficients are unaffected by the choice of weightings.

By contrast, if the model is misspecified, every element of b depends uponthe weightings used. Choosing design weights ( i W i) leads to B, the quantitywe would be estimating if we fitted the model to the whole population. Since allmodels are almost certainly misspecified to some extent, many survey statisti-cians (see Kish and Frankel, 1974) would suggest that a reasonable aim is toestimate B. We adopted this perspective uncritically in Scott and Wild (1986,1989), suggesting as a consequence that, although maximum likelihood estima-tion was more efficient, design weighting was more robust because it alone ledto consistent estimates of B in the presence of model misspecification. This hasbeen quoted widely by both biostatisticians and survey samplers. However, amore detailed appraisal of the nature of B suggests that using sample weights( i ni n), which is maximum likelihood under the model, may often be morerobust as well as being more efficient.

Table 1 of Scott and Wild (1989) showed results from fitting a linear logisticregression model when the true model was quadratic with logit {Pr(Y 1 x )}

0 1x 2x 2. Any model can be expanded in a Taylor series on a logisticscale and a quadratic approximation should give a reasonable idea of whathappens when the working logistic model is not too badly misspecified. Twopopulation curves were investigated, with the extent of the curvature beingchosen so that an analyst would fail to detect the curvature about 50 % of thetime with samples of size n1 n2 200. Scott and Wild (1989) assumed astandard normal(0, 1) distribution for the covariate X and the population curveslogit {Pr(Y 1 x )} 0 4x 0.6x 2 with 0 5.8, plotted in Figure8.1(a), and logit {Pr(Y 1 x )} 0 2x 0.3x 2 with 0 5.1, plotted inFigure 8.1(b). We will refer to the former curve as the negative quadratic andthe latter as the positive quadratic. Both models result in about 1 case in every20 individuals (i.e. W 1 0.05). By reducing the value of 0 we reduce thevalue of W 1. F igure 8.1(c) plots the negative quadratic curve for 0 9.21and Figure 8.1(d) plots the positive quadratic curve for 0 9.44. These valuescorrespond to 1 case in every 400 individuals. For each population curve,

ROBUSTNESS 117

��� �

�3�3A��1(� )B� �1�1A��3(� )B � 1�� � � � �

�� �

� � �

� � �� � �

� �

� � � � � � � �� � � � � � �

� �� � �

� � �

Page 141: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Figure 8.1 Approximations to population curve.

we have plotted two linear approximations. The solid line corresponds to B, the�whole-population approximation�. The dashed-line approximation corres-ponds to using sample weights (maximum likelihood).

The slope of the logit curve tells us about the relative risk of becoming a caseassociated with a change in x . Using vertical arrows, we have indicated theposition at which the slope of the linear approximation agrees with the slope ofthe curve. The results are clearest in the more extreme situation of 1 case in 400(cf. the meningitis study). We see that B (design weights) is telling us about

118 FITTING REGRESSION MODELS IN CASE�CONTROL STUDIES

logit P 1(x) = b0 + 4x −0.6x2 logit P 1(x) = b0 + 2 x+0.3x2

(a) b0 = −5.8 (b) b0 = −5.1

−3 −2 −1 0 2 3

−3 −2 −1 −0 2 3

0.99 1.35−3 −2 −1 0 1 2 3

0.78 1.17 DesignSample

Design: b1 = 2.70

Sample: b1 = 2.47

logi

tP

1(x)

0−5

−10

−15

−20

1x

−3 −2 −1 −0 2 3

logi

t P1(

x)

40

2−2

−4−8

−6

1x

Design

Sample

Design: b1 = 2.38

Sample: b1 = 2.81

(c) b0 = −9.21 (d) b0= −9.44

Negative Q uadratic Positive Q uadratic

1 in 20

1 in 400

−3 −2 −1 0 1 2 3

−3 −2 −1 0 2 3

−3 −2 −1 0 1 2 3

1.28 2.08DesignSample

Design: b1 = 3.25

Sample: b1 = 2.77logi

t P1(

x)

−5−1

0−1

5−2

0−2

5

1x

−3 −2 −1 0 2 3

logi

t P1(

x)

−4−2

−6−8

−12

−10

1x

Design

SampleDesign: b1 = 1.89

Sample: b1 = 2.61

1.16 1.76

Weights fo r WLSSample

Design

Weights fo r WLSSample

Design

Page 142: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

changes in risk only in the extreme tail of the covariate distribution, while thesample-weighted maximum likelihood estimator is telling us about the effects ofchanges in x closer to the centre of the population. In other words, the sample-weighted estimator represents what is happening in the population better thanthe design-weighted one. The same thing is happening with 1 case in 20 but to alesser extent. This should not be surprising as the relative weightings are lesssevere here.

Why does the whole-population approximation provide a worse descriptionof the effects of changes in x on risk? To gain some insight we use a discreteapproximation to a normal distribution of X and look at the asymptoticweights employed in a weighted least squares linear fit to the true logit at x ,namely log {Pr(Y 1 x ) Pr(Y 0 x )} (weighted least squares with the appro-priate weights are asymptotically equivalent to maximum likelihood when themodel is true). Note that weighted least squares use weights that are inverselyproportional to variances and recall that the asymptotic variance of the logitis proportional to 1 n(x )p1(x )p0(x ), where f (x ) denotes the frequency of x in thepopulation. Thus, the weight at x corresponding to design weighting ( i W i)is proportional to f (x )p1(x )p0(x ). Similarly, when n1 n0, the weight corres-ponding to sample weighting ( i ni n) is proportional to {f (x 0) f (x 1)}

1(x ) 0(x ), where i(x ) is obtained by replacing with in pi(x ) (i.e. 1(x )represents the smoothed sample proportion, and p1(x ) the fitted populationproportion, of cases at x ). These weights are plotted at the bottom of eachgraph in Figure 8.1 with sample�control weights in grey and populationweights in black. We see that as cases become rarer (going from 1 in 20 to 1 in400), the weights used move away from the centre of the X -distribution with thedesign weights moving more quickly out into extreme regions that the sampleweights, and therefore approximating the logit curve only for extreme (andrare) high-risk individuals.

We have argued above that, for misspecified models, using sample weightsoften estimates a more relevant quantity than using design weights. Theweights also offer some insight into the efficiency losses, and slower asymptoticconvergence of the design-weighted estimates.

F igure 8.2 relates the weights to f (x 1), the X -distribution among cases, andf (x 0), the X -distribution among controls. These densities were obtained for themore extreme (1 in 400) models in Figure 8.1, which are curved, but almostidentical behaviour is observed when we look at fitting correctly specified linearmodels. In the following we assume a case�control sample of n1 n0 forsimplicity. Under case�control sampling, the sample odds at x j are of theform nj1 nj0 where nj1 is the number of cases in the sample at x j. Multiplicationby W 1 W 0 converts this to an estimate of the population odds at x j. Clearly,these quantities are most reliably estimated in X -regions where there aresubstantial numbers of both cases and controls in the population, i.e. wheref (x 1) and f (x 0) overlap. We see that this is the region from which thesample-weighted estimates draw most heavily. In situations where x has astrong effect and cases are rare, as shown in Figure 8.2, the design-weightedestimate puts its large weights where cases are most common. These weights can

ROBUSTNESS 119

� � � � �

�� �

�� � � � �

�� �� �� � � ��

��

��

� �

Page 143: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Figure 8.2 Comparing weights to marginal densities.

be shown to be proportional to f (x 1)Pr(Y 0 x ). Thus, in situations wherecases are rare across the support of X , the weight profile is almost identical tof (x 1) as in Figure 8.2. Heavy weight is being given to unreliable sample oddsestimates formed at x j-values where cases are common in the sample butcontrols are rare. One would expect this to lead to precisely the sort of behav-iour we have seen over many simulations, namely less efficient estimates andslower convergence to asymptotic behaviour.

These arguments lead us to the conclusion that design-weighted estimates areboth less efficient and no more robust that maximum likelihood sample-weighted estimates.

8.6. OTHER APPROACHES other approaches

One alternative approach is to modify the subsampling method suggested byHinkins, Oh and Scheuren (1997). Instead of trying to develop specialistprograms to handle the complex sample, we modify the sample so that it canbe processed with standard programs. In our case, this means producing asubsample of controls (cases) that looks like a simple random sample from thepopulation of controls (cases). Then we have simple case�control data whichcan be analysed using a standard logistic regression program. Can we producesuch a subsample? The answer is �Yes� for most designs (see Hinkins, Oh andScheuren, 1997) but the resulting subsample size is usually much smaller thanthe original. Obviously this results in a substantial loss of efficiency, since weare throwing away many (sometimes most) of the observations. We increasepower by repeating the process independently many times and averaging theestimates over the repetitions. Note that the repetitions, although conditionallyindependent, are not unconditionally independent when averaged over thedistribution of the original sample. In spite of this, it is relatively simple toestimate standard errors and derive other properties of the final estimates.Details are given in Rao and Scott (1999b).

120 FITTING REGRESSION MODELS IN CASE�CONTROL STUDIES

(a) Negative Quadratic

logit P1(x) = −9.21 + 4x − 0.6x2 logit P1(x) = −9.44 + 2x + 0.3x2

(b) Positive Quadratic

−3 −2 −1 0 2 3 41

x−3 −2 −1 0 2 3 41

x

1 in 400

Weights for WLSSample

Design

f(x 0) f(x 1) f(x 0) f(x 1)

� � �

Page 144: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

We have done some simulations to investigate the relative efficiency of thismethod. Preliminary results suggest that the method works well in relativelysimple situations. With a small number of strata and small cluster sizes weobserved efficiencies comparable with those obtained using the sample-weighted method of Section 8.3. Unfortunately, this method broke downcompletely when we attempted to apply it to the example discussed in theintroduction. With a maximum cluster sample size of six, a simple randomsubsample of controls can contain at most n0 6 units. This meant that oursubsamples only contained about 40 controls. As a consequence, some of theage ethnic strata were empty in some of the subsamples and the logisticregression program often failed to converge.

Another approach would be to build a mixed logistic model for Pr(Y t y x t)containing random cluster effects. In principle, fitting such a model is coveredby the methods in Scott and Wild (2001). In practice, this requires numericalintegration at each step of an iterative procedure, but software to make thisfeasible has not yet been developed. Work is proceeding on developing suchsoftware. It is important to note that the meaning of the regression coefficientsis different with this approach. In the rest of that paper, we are dealing withmarginal (or population-averaged) models, which are appropriate when we areinterested in the effects of changes in the explanatory variables on the wholepopulation rather than on specific individuals. Random effects models aresubject specific and are appropriate when we are interested in the effects ofchanges in the explanatory variables on specific individuals. This is a huge andimportant topic, but it is peripheral to the main points of this chapter and wedo not pursue it further here. The interested reader is referred to the compre-hensive discussions in Neuhaus, Kalbfleisch and Hauck (1991) or Diggle, Liangand Zeger (1994) for more details.

ACKNOWLEDGEMENTS acknowledgements

We wish to thank Joanna Stewart and Nick Garrett for introducing us to theproblem and our research assistant Jolie Hutchinson who did most of theprogramming.

ACKNOWLEDGEMENTS 121

� �

Page 145: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 146: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

PART C

Continuous and GeneralResponse Data

Page 147: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 148: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 9

Introduction to Part CR. L. Chambers

The three chapters that make up Part C of this book discuss methods for fittinga regression model using sample survey data. In all three the emphasis is onsituations where the sampling procedure may be informative, i.e. where thedistribution of the population values of the sample inclusion indicators dependson the response variable in the regression model. This model may be parame-trically specified, as in Chapter 12 (Pfeffermann and Sverchkov) or may benonparametric, as in Chapter 10 (Bellhouse, Goia and Stafford) and Chapter11 (Chambers, Dorfman and Sverchkov).

To focus our discussion of the basic issues that arise in this situation,consider a population made up of N units, with scalar variables Y and Xdefined for these units (the extension to multivariate X is straightforward).A sample of n units is taken and their values of Y and X are observed. The aimis to use these data to model the regression of Y on X in the population. LetgU (x ) denote this population regression function. That is, gU (x ) is the expectedvalue of Y given X x under the superpopulation joint distribution of Y and X .If a parametric specification exists for gU (x ), the problem reduces to estimatingthe values of the parameters underlying this specification. More generally, sucha specification may not exist, and the concern then is with estimating gU (x )directly.

9.1. THE DESIGN-BASED APPROACH the design-based approach

The traditional design-based approach to this problem weights the sample datain inverse proportion to their sample inclusion probabilities when fitting gU (x ).In particular, if an estimation procedure for gU (x ) can be defined when thevalues of Y and X for the entire population are known, then its extension to thegeneral unequal probability sampling situation is simply accomplished byreplacing all (unknown) population-based quantities in this procedure by ap-propriate estimates based on the sample data. If the population-level estimationprocedure is maximum likelihood this represents an application of the so-called

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd

Page 149: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

pseudo-likelihood approach (see Chapter 2). In Chapter 10, Bellhouse, Goia andStafford apply this idea to nonparametric estimation of both the populationdensity fU ( y) of Y and the population regression function gU (x ).

Nonparametric density estimation and nonparametric regression estimationare two well-established areas of statistical methodology. See Silverman (1986)and Hardle (1990). In both areas, kernel-based methods of estimation are wellknown and widely available in statistical software. These methods were origin-ally developed for simple random sampling from infinite populations. Forexample, given a population consisting of N independent and identically dis-tributed observations y1, y2, . . . , yN from a population distribution with dens-ity fU ( y), the kernel estimate of this density is

(9.1)

where K is its kernel function (usually defined as a zero-symmetric unimodaldensity function) and h > 0 is its bandwidth. The bandwidth h is the mostimportant determinant of the efficiency of fU ( y). In particular, the bias of(9.1) increases and its variance decreases as h increases. Asymptotic resultsindicate that the optimal value of h (in terms of minimising the asymptoticmean squared error of ( y)) is proportional to the inverse of the fifth rootof N .

Modifying this estimator for the case where data from a complex samplingscheme are available (rather than the population values) is quite straightfor-ward from a design-based viewpoint. One just replaces ( y) by

(9.2)

where wt is the sample weight associated with sampled unit t. More generally,Bellhouse, Goia and Stafford define a binned version of fs( y) in Chapter 10,and develop asymptotic design-based theory for this estimator. Their estimator(10.1) reduces to (9.2) when the bins correspond to the distinct Y -valuesobserved on the sample. Since larger bins are essentially equivalent to increas-ing bandwidth, there is a strong relationship between the bin size b and thebandwidth h, and Bellhouse and Stafford (1999) show that, analogous to therandom sampling case, a good choice is b 1.25h. In Section 10.4 Bellhouse,Goia and Stafford suggest a bootstrap algorithm for correcting the bias in (9.2).

Extension of kernel density estimation to nonparametric estimation of thepopulation regression function gU (x ) is straightforward. The population-basedkernel estimator of this function is

(9.3)

where the coefficients {bkx} are the solution to the local polynomial estimatingequation

126 INTRODUCTION TO PART C

��� ( �* �&

��

����&

��� � �

� �

���

���

���( �* �&

��

��

����� � �

� �

��� (�* �����+

�����

Page 150: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

(9.4)

That is, this approach estimates gU (x ) via a weighted least squares fit of apolynomial regression function of order p to the population data, with theweight for unit t in the population determined by its �distance�from x . A popular choice is p 1, in which case the solution to (9.4) is oftenreferred to as the local linear regression of Y on X .

With unequally weighted sample survey data (9.3) and (9.4) are replaced by

(9.5)

and

(9.6)

Again, binned versions of these formulae are easily derived (see Section 10.5).In this case binning is carried out on X , with Y -values in (9.6) replaced by theirsample weighted average values within each bin. Since the estimator (9.5) is justa weighted sum of these weighted averages, its variance (conditional on theestimated bin proportions for X ) can be estimated via a sandwich-type estima-tor. See Equation (10.10). Whether this variance is the right one for (9.5) is amoot point. In general both the estimated bin proportions as well as the kernelweights K (h 1(x t x )) are random variables and their contribution to theoverall variability of (9.5) is ignored in (10.10).

9.2. THE SAMPLE DISTRIBUTION APPROACH the sample distribution approach

This approach underpins the development in Chapter 11 (Chambers, Dorfmanand Sverchkov) and Chapter 12 (Pfeffermann and Sverchkov). See Section 2.3for a discussion of the basic idea. In particular, given knowledge of the methodused to select the sample and specification of the population distribution of Y ,one can use Bayes� theorem to infer the marginal distribution of this variablefor any sampled unit. This sample distribution then forms the basis for infer-ence about the population distribution.

Suppose the population values of Y can be assumed to be independent andidentically distributed observations from a distribution with density fU ( y). Thisdensity function may, or may not, be parametrically specified. Let I denote theindicator random variable defined by whether a generic population unit isselected into the sample or not and define y to be a small interval centredaround y. Then

THE SAMPLE DISTRIBUTION APPROACH 127

����&

��� � �

� ��� �

����+

������

� �& �� � � � �

��� � � +�

�(��&(�� � �**�

���(�* �����+

������

��

����� � �

� ��� �

����+

�������

� �& �� � � � �

��� � � +�

� �

$�(� � ���# � &* � $�(# � &�� � ��*$�(� � ��*

$�(# � &*�

Page 151: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The sample density fs( y) for Y is obtained by letting y shrink to the value y,leading to

(9.7)

When the random variables I and Y are independent of one another, the sampleand population densities of Y are identical. In such cases one can applystandard inferential methods to the sample data. In general, however, thesetwo random variables will not be independent, and inference then needs to bebased on the distribution of the sample data.

In Chapter 11 Chambers, Dorfman and Sverchkov extend this sample-basedapproach to nonparametric estimation of the population regression functiongU (x ). The development in this chapter uses relationships between populationand sample moments that hold under this approach to develop �plug-in� and�estimating equation� based estimators for gU (x ) given four data scenarioscorresponding to common types of sample data availability. These estimatorsare then contrasted with the simple inverse -weighted approach to estimationof gU (x ), as described, for example, in Chapter 10. Under the sample-basedframework, this -weighted approach is inconsistent in general. However, it canbe made consistent via a �twicing� argument, which adjusts the estimate bysubtracting a consistent estimate of its bias.

F inally, in Chapter 12 Pfeffermann and Sverchkov extend this idea to para-metric estimation. In particular, they focus on the important case where thepopulation values are independently distributed and follow a generalised linearmodel (GLM). The conditional version of (9.7) appropriate to this situation is

(9.8)

where fU ( ) denotes the value at y of the population conditional density of Ygiven X x . In order to use this relationship, it is necessary to model theconditional probability that a population unit is sampled given its values for Yand X . Typically, we characterise sample inclusion for a population unit interms of its sample inclusion probability . In addition to the standard situationwhere this probability depends on the population values zU of an auxiliaryvariable Z , it can also depend on the population values yU of Y and xU of X . Ingeneral therefore so the values ofin the population correspond to realised values of random variables. Using aniterated expectation argument, Pfeffermann and Sverchkov (Equation (12.1))show that where EU denotesexpectation with respect to the population distribution of relevant quantities. Itfollows that (9.8) can be equivalently expressed as

(9.9)

In order to use this relationship to estimate the parameters of the GLMdefining fU ( ), Pfeffermann and Sverchkov make the further assumption

128 INTRODUCTION TO PART C

��( �* �$�(# � &�� � �* �� ( �*

$�(# � &*�

��( ���* �$�(# � &�� � �$ � � �* �� ( ���*

$�(# � &�� � �*

����

� � �( �� $ �� $ '� * � )(# � &��� $ �� $ '� * �

$�(# � &�� � �$ � � �* � )� (��� � �$ � � �*$

��( ���* �)� (��� � �$ � � �*

)� (��� � �*�� ( ���*�

���

Page 152: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

that the sample data values can be treated as realisations of independentrandom variables (although not true in general, asymptotic results presentedin Pfeffermann, Krieger and Rinott (1998) justify this assumption as a first-order approximation). This allows the use of (9.9) to define a sample-basedlikelihood for the parameters of fU ( ) that depends on both the GLMspecification for this conditional density as well as the ratio EU ( ,

Let denote the (unknown) vector-valued parameter characterising theconditional population distribution of Y given X . The sample likelihood esti-mating equations for defined by (9.9) are

9.10)

where these authors note that (i) the conditional expectation EU ( t yt , x t) willbe �marginally� dependent on (and so the second term of (9.10) can beignored), and (ii) the conditional expectation EU ( t x t) in the third term of(9.10) can be replaced by an empirical estimate based on the sample data (ineffect leading to a sample-based quasi- likelihood approach).

In order to show how EU ( t x t) can be estimated, Pfeffermann and Sverch-kov use a crucial identity (see Equation (12.4)) linking population and sample-based moments. From this follows

9.11)

where the expectation operator in the denominator on the right hand side of(9.11) denotes expectation conditional on being selected into the sample. It canbe estimated by the simple expedient of fitting an appropriate regression modelfor the sample 1

t values in terms of the sample x t values. See Section 12.3.The sample-based likelihood approach is not the only way one can use (9.9)

to develop an estimating equation for . Pfeffermann and Sverchkov observethat when the population-level �parameter equations�

are replaced by their sample equivalent and (12.4) applied, then these equationscan be expressed as

They therefore suggest can also be estimated by solving

9.12)

THE SAMPLE DISTRIBUTION APPROACH 129

������ � �

� � �*�)� (��� � �*��

���

�� �� ( �����*� ����

�� )� (�����$ ��*� �

���

�� )� (�����*� � � + (

� ��

� �

� �

)� (�����* � &�)�(��&� ���* (

)�(�*

��

����&

)( �� � ( �����*� � ���* � +

��

)�( �� ��( �����*� � ���* ���

)�(��&� �� �� ( �����*� � ���*

)�(��&� ���*

� +�

��

+�(��* �� �� ( �����*� � � + (

Page 153: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where the weights qt(x t) are defined as

(9.13)

As before, the conditional expectation in the denominator above is unknown,and is estimated from the sample data.

Pfeffermann and Sverchkov refer to the conditional weights (9.13) asadjusted weights and argue that they are superior to the standard weightsthat underpin the pseudo-likelihood approach to estimation of . Their argu-ment is based on the observation that the pseudo-likelihood approach toestimating corresponds to replacing the conditional expectation in the de-nominator (9.13) by an unconditional expectation, and hence includes effectsdue to sample variation in values of X .

These authors also suggest that variance estimation for estimators obtainedby solving the sample-based likelihood estimating equations (9.10) can bedefined by assuming that the left hand sides of these equations define a scorefunction and then calculating the inverse of the expected information matrixcorresponding to this function. For estimators defined by an estimating equa-tion like (9.12), they advocate either sandwich-type variance estimation (assum-ing the weights in these equations are treated as fixed constants) or bootstrapmethods. See Section 12.4.

9.3. WHEN TO WEIGHT? when to weight?

In Section 6.3.2 Skinner discusses whether survey weights are relevant forparameteric inference. Here we address the related problem of deciding whetherto use the sample distribution-based methods explored in Part C of this book.These methods are significantly more complex than standard methods. How-ever, they are applicable where the sampling method is informative. That is,where sample inclusion probabilities depend on the values of the variable Y .Both Chapter 11 (Chambers, Dorfman and Sverchkov) and Chapter 12(Pfeffermann and Sverchkov) contain numerical results which illustrate thebehaviour of the different approaches under informative sampling.

These results indicate that more complicated sample distribution-basedestimators should be used instead of standard unweighted estimators if thesampling procedure is informative. They also show that use of sample distribu-tion-based estimators when the sampling method is noninformative can lead tosubstantial loss of efficiency. Since in practice it is unlikely that a secondaryanalyst of the survey data will have enough background information aboutthe sampling method to be sure about whether it is informative or non-informative, the question then becomes one of deciding, on the basis of theevidence in the available sample data, whether they have been obtained via aninformative sampling method.

130 INTRODUCTION TO PART C

+�(��* ���&�

)�(��&���*�

��&�

Page 154: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Two basic approaches to resolving this problem are presented in whatfollows. In Chapter 11, Chambers, Dorfman and Sverchkov consider thisproblem from the viewpoint of nonparametric regression estimation. Theysuggest two test methods. The first corresponds to the test statistic approach,with a Wald-type statistic used to test whether there is a significant difference inthe fits of the estimated regression functions generated by the standard andsample distribution-based approaches at a number of values of X . Their second(and more promising) approach is to test the hypothesis that the coefficients of apolynomial approximation to the difference between the population and sampleregression functions are all zero. In Chapter 12, Pfeffermann and Sverchkovtackle the problem by developing a Hotelling-type test statistic for the expecteddifference between the estimating functions defining the standard and sampledistribution-based estimators. As they point out, �the sampling process can beignored for inference if the corresponding parameter equations are equivalent�.Numerical results for these testing methods presented in Chapters 11 and 12 arepromising.

WHEN TO WEIGHT? 131

Page 155: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 156: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 10

Graphical Displays ofComplex Survey Datathrough Kernel SmoothingD. R. Bellhouse, C. M. Goia and J. E. Stafford

10.1. INTRODUCTION introduction

Many surveys contain measurements made on a continuous scale. Due to theprecision at which the data are recorded for the survey file and the size of thesample, there will be multiple observations at many of the distinct values. Thisfeature of large-scale survey data has been exploited by Hartley and Rao (1968,1969) in their scale-load approach to the estimation of finite population par-ameters. Here we exploit this same feature of the data to obtain kernel densityestimates for a continuous variable y and to graphically examine relationshipsbetween a variate y and covariate x through local polynomial regression. In thecase of independent and identically distributed random variables, kernel dens-ity estimation in various contexts is discussed, for example, by Silverman(1986), Jones (1989), Scott (1992) and Simonoff (1996). Local polynomialregression techniques are described in Hardle (1990), Wand and Jones (1995),Fan and Gijbels (1996), Simonoff (1996) and Eubank (1999). Here we examinehow these kernel smoothing techniques may be applied in the survey context.Further, we examine the effect of the complex design on estimates and second-order moments. With particular reference to research in complex surveys,Buskirk (1999) has examined density estimation techniques using a model-based approach and Bellhouse and Stafford (1999) have examined the sametechniques under both design- and model-based approaches. Further, Kornand Graubard (1998b) provided a technique for local regression smoothingbased on data at the micro level, but did not provide any properties for theirprocedures. Bellhouse and Stafford (2001) have provided a design-based ap-proach to local polynomial regression. Bellhouse and Stafford (1999, 2001)have analyzed data from the Ontario Health Survey. Chesher (1997) has

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 157: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

applied nonparametric regression techniques to model data from the BritishNational Food Survey and Hardle (1990) has done the same for the BritishFamily Expenditure Survey. In these latter cases it does not appear that thesurvey weights were used in the analysis.

One method of displaying the distribution of continuous data, univariate orbivariate, is through the histogram or binning of the data. This is particularlyuseful when dealing with large amounts of data, as is the case in large-scalesurveys. In view of the multiple observations per distinct value and the largesize of the sample that are both characteristic of survey data, we can bin to theprecision of the data, at least for histograms on one variable. On their own,univariate histograms are usually easy to interpret. However, interpretationcan be difficult when comparing several histograms, especially when they aresuperimposed. Similarly, bivariate histograms, even on their own, can be diffi-cult to interpret. We overcome these difficulties in interpretation by smoothingthe binned data through kernel density estimation. What is gained by increasedinterpretability through smoothing is balanced by an increase in the bias ofthe density estimate of the variable under study. The basic techniquesof smoothing binned data are described in Section 10.2 and applications ofthe methodology to the 1990 Ontario Health Survey are presented in Section10.3. A discussion of bias adjustments to the density estimate is given inSection 10.4.

Rather than binning to the precision of the data in y, we could instead bin ona covariate x . For each distinct value of x , we compute an estimate of theaverage response in y. We can then investigate the relationship between y and xthrough local polynomial regression. The properties of the estimated relation-ship are examined in Section 10.5 and applications to the Ontario HealthSurvey are discussed in Section 10.6.

10.2. BASIC METHODOLOGY FOR HISTOGRAMS ANDSMOOTHED BINNED DATA basic methodology for histograms and smoothed binned data

For density estimation from a finite population it is necessary to define thefunction to be estimated. We take the point of view that the quantity of interestis a model function obtained through asymptotic arguments based on a nestedsequence of finite populations of size 1, 2, 3, . . . with as

. For a given population l we can define the finite population distribu-tion function FNl ( y) as the empirical distribution function for that finite popu-lation. Then as we assume F( y) where F( y) is a smoothfunction. The function of interest is rather than the empiricaldistribution function for the given finite population.

For independent and identically distributed observations on a single variablef ( y) is the model density function. One avenue for estimation of f ( y) is theconstruction of a histogram. This is a straightforward procedure. The range ofthe data is divided into I nonoverlapping intervals or bins, and either thenumber or the sample proportion of the observations that fall into each bin is

134 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

��# � � �� � �� � �

� � � ��� + �/ � + �/ � � �+ �/

Page 158: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

calculated. This sample proportion for the ith bin, where i 1, . . . , I , is anunbiased estimate of pi, the probability that an observation taken from theunderlying probability distribution falls in the ith bin. The same sample pro-portion divided by the bin size b can be used as an estimate of f ( y) at ally-values within the bin. The methodology can be easily extended to twovariables. Bins are constructed for each variable, I for the first variable andJ for the second, yielding a cross-classification of IJ bins in total. In this case pijis the probability that an observation taken from the underlying bivariatedistribution falls in the (i,j)th bin, where i 1, . . . , I and j 1, . . . , J .

A minor modification is needed to obtain binned data for large-scale surveys.In this case survey weights are attached to each of the sample units. These areused to obtain the survey estimates of pi and of pij , where pi and pij arenow finite population bin proportions. Within the appropriate bin interval, thehistogram estimate of the univariate density function is for any y in the ithbin. Similarly, divided by the bin area provides a bivariate density estimate.For simplicity of presentation we assume throughout that the bin sizes b areequal for all bins. This assumption can be relaxed.

We denote the raw survey data by {ys, ws} where ys is the vector of observa-tions and ws is the vector of survey weights. If binning is carried out then thedata are reduced to {m, }, where m (m1, m2, . . . , mI ) is the vector of binmidpoints and is the vector of survey estimates ofbin proportions. When binning has been carried out to the level of precisionof the data, then, for the purpose of calculating density estimates, there is noloss in information in going from {ys, ws} to {m, }.

Visual comparison of density estimates is much easier once the histogram hasbeen smoothed. The smoothed histogram, examined by Bellhouse and Stafford(1999), is also a density estimate. For complex surveys based on the data{m, }, the smoothed histogram for a single variable y is given by

(10.1)

In (10.1), K(x ) is the kernel evaluated at the point x and h is the bandwidth orsmoothing parameter to be chosen. The kernel estimator at a value y is aweighted sum of the estimated bin proportions with higher weights placed onthe bins that are closer to the value y. The kernel K(x ) is typically chosen to be aprobability density function with mean value 0 and finite higher moments, inparticular

(10.2)

In the examples that we provide we have chosen the kernel to be the standardnormal distribution. The estimate in (10.1) depends on both the bin size b andthe window width h. In practical terms for complex surveys, the smallest valuethat b can take will be the precision to which the data have been recorded. Forthe variables body mass index and desired body mass index in the Ontario

BASIC METHODOLOGY FOR HISTOGRAMS AND SMOOTHED BINNED DATA 135

� �

��� ����

���������

���� � + ���# ��2# � � � # ��� /

��

��

� + �/ � �

�����

��� �����

� ��

�2 ��� 2 +�/%� � ��

Page 159: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Health Survey, discussed in Section 10.3, the smallest value for b is 0.1. Asnoted already, there is no loss in information between {ys, ws} and {m, } whenthe level of binning is at the precision of the data. As a result, kernel densityestimation from the raw data while taking into account the survey weights isequivalent to (10.1) with the smallest possible bin size.

Quantile estimates can be obtained from using methodology given inBellhouse and Stafford (1999). Denote the estimated ath quantile by . Theestimate may be obtained as the solution to the equation

The solution is obtained through Newton�Raphson iteration by setting

for i 1, 2, 3, . . . until convergence is achieved. If the kernel is a standardnormal density, then has the same functional form as ) with thedensity replaced by the standard normal cumulative distribution function. Aninitial estimate may be obtained through a histogram estimate. Bellhouse(2000) has sho that the estimated variance of may be expressed as

where )), the estimated variance of the proportion of the observationsless than or equal to , can be obtained as the survey variance estimate of aproportion. In a limited empirical study carried out by Bellhouse (2000), thisvariance estimate has compared favourably to the Woodruff (1952) procedure.

The population may consist of a number D of disjoint subpopulations ordomains, such as gender or age groups. Then a histogram or a smoothedhistogram ( y) may be calculated for each domain d, d 1, . . . , D. By super-imposing plots of d( y) for d 1, . . . , D we can make visual comparisons of thedomains. We could also make visual comparisons of different variables for thesame domain. This provides a much less cluttered view than superimposinghistograms, and is visually more effective than trying to compare separateplots.

For two variables, x and y, the smoothed two-dimensional (2-D) histogram isgiven by

(10.3)

where K (x , y) is the kernel evaluated at the point (x , y), where h1 and h2 are thebandwidths in the x direction and y direction respectively, and where m1i andm2j are the midpoints of the ith and jth bins defined for x and y respectively.Again, K(t, s) is usually a density function; we choose K(x ,y) to be the productof two standard normal densities. A 3-D plot of (10.3) may be easier tointerpret than the associated histogram. Additionally contour plots based on

136 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

��

� + �/�"�

�� +�"�/� � ���"���

� +�/%�� � � �

�"�� � �"���� � �� +�"���� /�� +�"���� /

��� +�"��/

� +�"��

�"���' "

�# +�"�/ � �# + �� +�"�//�H � +�"�/I2#

�# + �� +�"��"�

�$ ���

� + �# �/ � �

���2

�����

�����

���� �����

��#���2�

�2

� �#

Page 160: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

(10.3) may also be obtained, thus providing more easily accessible informationabout the relationship between y1 and y2.

For a given value of y, the finite population mean square error of a densityestimator ( y) of f ( y) is E ( ( y) f ( y))2 where the operator E denotes expect-ation with respect to the sampling design. A measure of global performance of

( y) is its integrated mean square error given by f ( y))2dy, where theintegral is taken over the range of values of y. The limiting value of thismeasure, taken in the increasing sequence of finite populations and as thebandwidth and bin size decrease, is the asymptotic integrated mean squareerror (AIMSE). In the case of independent and identically distributed randomvariables Silverman (1986) and Scott (1992), among others, provide formaldefinitions of the AIMSE. We examine the AIMSE under complex samplingdesigns for two reasons. The first is to see the effect of the design on this globalmeasure in comparison to the standard case. The second is that the idealwindow size h may be obtained through the AIMSE.

Under the finite population expectation operator, Bellhouse and Stafford(1999) have obtained

(10.4)

as the AIMSE for the smoothed histogram estimator of f ( y) in the surveycontext, where 2 is given in (10.2). In (10.4) higher order terms in 1/n, h and bare ignored. Further in (10.4): isthe design effect for the ith bin; cov

is the covariance design effect whenWhen (10.4) reduces to that obtained by Jones (1989) in thestandard case. The first three terms in (10.4) comprise the asymptotic integratedvariance and the last term is the asymptotic integrated square bias. It may benoted that the complex design has no effect on the bias but changes the varianceby factors related to the design effects in the survey. The same is true for theAIMSE of the histogram estimate of the density function. The bias term in(10.4) increases with the bin size b. Smoothing the histogram generally leads toan increase in bias of the estimate of the density function, i.e. ( y) has greaterbias than .

For smoothed histograms the bin size b and the window width h are related.Jones (1989) sets b ah in (10.4) for the case when d d 1 and the termR ( f )/n is negligible. This yields an AIMSE depending on a, denoted byAIMSEa. Jones then examined the ratio AIMSEa AIMSE0, where AIMSE0 isthe value for the ideal kernel density estimate. Jones found that the relationshipb 1.25 h was reasonable. Using the same approach but without the assump-tion that d d 1, Bellhouse and Stafford (1999) have obtained the samerelationship for b and h. Consequently, the ideal window size does not changewhen going from the standard case to complex surveys.

BASIC METHODOLOGY FOR HISTOGRAMS AND SMOOTHED BINNED DATA 137

� � ��

�&+ � + �/�

�1$�� + � + �// � $'+ /

��� $ � '+ /

�� +$ � $ �/ �'+ /'+ /

� �

?�2�2 � �2

�2

� �2

'+ ��/

�$ � ��

��� $��(# '���� $� � ��� +��/�H ��+�� ��/I$ � � ��

�����

��� $���(2# '���� $�� � ��

� �� �J � % '+�/ � ��+�/2%�!+ ���# ���/�H ����I

$ � $ � � �#

�����

� � � �

�� � �

Page 161: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

10.3. SMOOTHED HISTOGRAMS FROM THE ONTARIOHEALTH SURVEY smoothed histograms from the ontario health survey

We illustrate our techniques with results from the Ontario Health Survey. Thissurvey was carried out in 1990 in order to measure the health status of thepopulation and to collect data relating to the risk factors of major causes ofmorbidity and mortality in the Province of Ontario. A total sample size of 61 239people was obtained by stratified two-stage cluster sampling. Each public healthunit in the province, which is further divided into rural and urban areas, formedthe strata, a total of 86 in number. The first-stage units within a stratum wereenumeration areas. These enumeration areas, taken from the 1986 Census ofCanada, are the smallest geographical units from which census counts can beobtained automatically. An average of 46 enumeration areas was chosen withineach stratum. At the second stage of sampling, within an enumeration areadwellings were selected, approximately 15 from an urban enumeration areaand 20 from a rural enumeration area. Information was collected on membersof the household within the dwelling, which constituted the cluster. See OntarioMinistry of Health (1992) for a detailed description of the survey.

Several health characteristics were measured in the Ontario Health Survey.We will focus on two continuous variables from the survey, body mass index(BMI) and desired body mass index (DBMI). For this survey the measurementswere taken to three digits of accuracy and vary between 7.0 and 45.0. The BMIis a measure of weight status and is calculated from the weight in kilogramsdivided by the square of the height in meters. The DBMI is the same measurewith actual weight replaced by desired weight. The index is not applicable toadolescents, adults over 65 years of age and pregnant or breast feeding women(see, for example, US Department of Health and Human Services, 1990).Consequently, there is a reduction in sample size when only the applicablecases are analyzed. A total of 44 457 responses was obtained for BMI and41 939 for DBMI. With over 40 000 responses and approximately 380 distinctvalues for the response, there can be multiple observations at each distinctvalue. Other variables that will be examined in the analysis that follows are age,sex and the smoking status of the respondent.

We constructed density estimates ( y) for each of the domains based ongender, age and smoking status for both BMI and DBMI. The ease of inter-pretation of superimposed smoothed histograms may be seen in Figures 10.1,10.2 and 10.3.

F igure 10.1 shows the estimated distributions of BMI and DBMI for femalesalone. The superimposed images are distributions for two variables on a singledomain. It is easily seen that for women BMI has a greater mean and variancethan DBMI. There is also greater skewness in the BMI distribution than in theDBMI distribution. Inspection of the graph also lends some support to thenotion that women generally want to lose weight whatever their current weightmay be.

For a single variable we can also superimpose distributions obtained fromseveral domains. We calculated the density estimates for BMI based on

138 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

Page 162: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

10 20 30 40

x

Den

sity

est

imat

e

BMI (female)DBMI (female)

Figure 10.1 Comparison of the distributions of BMI and DBMI for females.

10 20 30 40BMI

Never smokedFormer smokerCurrent occasional smokerCurrent daily smoker

Den

sity

est

imat

e

Figure 10.2 Distributions of BMI for smoking status.

SMOOTHED HISTOGRAMS FROM THE ONTARIO HEALTH SURVEY 139

Page 163: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Den

sity

estim

ate

10 20 30 40BMI

Den

sity

estim

ate

10 20 30 40BMI

Den

sity

estim

ate

10 20 30 40BMI

Den

sity

estim

ate

10 20 30 40BMI

Den

sity

estim

ate

10 20 30 40BMI

Den

sity

estim

ate

10 20 30 40BMI

Den

sity

estim

ate

10 20 30 40BMI

Den

sity

estim

ate

10 20 30 40BMI

Den

sity

estim

ate

10 20 30 40BMI

Figure 10.3 Change in the distributions of BMI for increasing age.

smoking status taking males and females together at all ages. In particular, weconstructed density estimates ( y) for each of the domains defined by thegroups �never smoked�, �former smoker�, �current occasional smoker� and �cur-rent daily smoker�. These superimposed distributions are shown in Figure 10.2.As superimposed histograms, the equivalent picture would have been quitecluttered. From Figure 10.2, however, it may be easily seen that three of thefour distributions are very similar. Although the distribution for the group�former smoker� has the same shape as the other three, it is shifted to the rightof the others. Consequently, giving up smoking is associated with a mean shiftin the distribution of the BMI. Weight gain associated with giving up smokingis well known and has been reported both more generally (US Department ofHealth and Human Services, 1990) and with particular reference to the OntarioHealth Survey (Østbye et al., 1995).

In comparisons of the BMI over domains defined by age groups, it has beenfound that BMI tends to increase with age except at the upper end of the rangenear age 65 (Østbye et al., 1995). F igure 10.3 shows in a dynamic way how BMIchanges with age, again with both males and females together. The graph at thetop left is the distribution of BMI for the age group 20�24. As each new five-year age group is added (25�29, 30�39, . . . , 60�64), the solid line shows thecurrent density estimate and the dotted lines show all the density estimates atearlier ages. It may be seen from Figure 10.3 that BMI tends to increase with

140 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

Page 164: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

age until the upper end of the age range, and the variance also increases with ageuntil about the age of 40.

As noted from Figure 10.1, BMI and DBMI may be related. This may bepursued further by examining the estimate of the joint density function. Figure10.4 shows a 3-D plot and a contour plot for ( x , y) for females, where x isBMI and y is DBMI. The speculation that was made from Figure 10.1 is seenmore clearly in the contour plots. In Figure 10.4, if a vertical line is drawn fromthe BMI axis then one can visually obtain an idea of the DBMI distributionconditional on the value of BMI where the vertical line crosses the BMI axis.When the vertical line is drawn from BMI 20 then DBMI 20 is in the uppertail of the distribution. At BMI 28 or 29, the upper end of the contour plotfalls below DBMI 28. This illustrates that women generally want to weighless for any given weight. It also shows that average desired weight lossincreases with the value of BMI.

10.4. BIAS ADJUSTMENT TECHNIQUES bias adjustment techniques

The density estimate ( y), as defined in (10.1), is biased for f ( y) with the biasgiven by Bias f ( y). This is a model bias rather than a finitepopulation bias since f ( y) is obtained from a limiting value for a nestedsequence of increasing finite populations. If we can obtain an approximationto the bias then we can get a bias-adjusted estimate ( y) from

(10.5)

An approximation to the bias can be obtained in at least one of two ways:obtain either a mathematical or a bootstrap approximation to the bias at eachy.

The mathematical approximation to the bias of the estimator in (10.1) is thesame as that shown in Silverman (1986), namely

(10.6)

Since f ( y) is unknown we substitute ( y). Recall that h is the window widthof the kernel smoother and b is the bin width of the histogram. When K( y) is astandard normal kernel, then σ2 1 and

(10.7)

The bias-corrected estimate is obtained from (10.5) and (10.6) after replacingf ( y) by (10.7).

To develop the appropriate bootstrap algorithm, it is necessary to look at theparallel �Real World� and then construct the appropriate �Bootstrap World� asoutlined in Efron and Tibshirani (1993, p. 87). For the �Real World� assumethat we have a density f. The �Real World� algorithm is:

BIAS ADJUSTMENT TECHNIQUES 141

� ��

�H � + �/I � &H � + �/I�

� + �/ � � + �/� -��� H � + �/I�

-���H � + �/I � �2�2

2� �2

2?

��+ �/�

�� � ��

� ��+ �/ � �

�<

�(���

���+ ����/2

�2� �

� � �����

� ��

��

Page 165: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

10

20BMI

3040

10

20

30

40

DBMI

00.

005

0.01

50.

010.

020.

025

f(x,y)

Kernel estimate of BMI and DBMI (for female)10

2030

40

DB

MI

10 20 30 40

BMI

Figure 10.4 Joint distribution of BMI and DBMI for females.

(a) Obtain a sample {y1, y2, . . . , yn} from f.(b) Using the midpoints m (m1, m2, . . . , mK ) bin the data to obtain

142 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

��� � + ���# ��2# � � � # �� /!

Page 166: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

(c) Smooth the binned data, or equivalently compute the kernel densityestimate on the binned data.

(d) Repeat steps (a)�(d) B times to get(e) Compute The bias is estimated by .

Here in the estimate of bias an expected value is replaced by a mean thusavoiding what might be a complicated analytical calculation. However, inpractice this bias estimate cannot be computed because f is unknown and thesimulation cannot be conducted. The solution is then to replace f with inthe first step and mimic the rest of the algorithm. This strategy is effective ifthe relationship between the f and the mean is similar to the relationshipbetween and the corresponding bootstrap quantities given in the algorithm;that is, if the bias in the bootstrap mimics the bias in the �Real World�.Assuming that we have only the binned data {m, }, the algorithm for the�Bootstrap World� is:

(a) Smooth the binned data } to get as in (10.1) and obtain a sample{y1, y2, . . . , yn} from .

(b) Bin the data {y1, y2, . . . , yn} to get {m, }.(c) Obtain a kernel density estimate on this binned data by smoothing

{m, }.(d) Repeat steps (b)�(d) B times to get b1 b2 bB

(e) Compute . The bias is .

The sample is step (a) may be obtained by the rejection method in Rice (1995, p.91) or by the smooth bootstrap technique in Efron and Tibshirani (1993). From(10.5) and step (e), the bias-corrected estimate is

We first illustrate the effect of bias adjustment techniques with simulateddata. We generated data from a standard normal distribution, binned the dataand then smoothed the binned data. The results of this exercise are shown inFigure 10.5 where the standard normal density is shown as the solid line, thekernel density estimate based on the raw data is the dotted line and thesmoothed histogram is the dashed line. The smoothed histogram is given forthe ideal window size and then decreasing window sizes. The ideal window sizeh is obtained from the relationship b 1.25h given in both Jones (1989) for thestandard case and Bellhouse and Stafford (1999) for complex surveys. It isevident from Figure 10.5 that the smoothed histogram has increased bias eventhough the ideal window size is used. This is due to iterating the smoothingprocess � smoothing, binning and smoothing again. The bias may be reduced tosome extent by picking the window size to be smaller (top right plot). However,the bottom two plots in Figure 10.5 show that as the window size decreases thebehaviour of the smoothed histogram becomes dominated by the bins and nofurther reduction in bias can be satisfactorily made. A more effective biasadjustment can be produced from the bootstrap scheme described above. Theapplication of this scheme is shown in Figure 10.6, which shows bias-adjustedversions of the top two plots in Figure 10.5. In Figure 10.6 the solid line is the

BIAS ADJUSTMENT TECHNIQUES 143

�����# ��2# � � � # ��-!

�� ��

����-! &+ � /� �� �

���

��

D�# �� � �

��� �

��� # � # � � � # � !

� � � �� ���- � � � �

2� + �/� � � + �/!

Page 167: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Figure 10.5 Density estimates of standard normal data.

Figure 10.6 Bias-adjusted density estimates for standard normal data.

kernel density estimate based on the raw data, the dotted line is the smoothedhistogram and the dashed line is the bias-adjusted smoothed histogram.

The plots in Figure 10.7 show the mathematical approximation approach(dotted line) and the bootstrap approach (dashed line) to bias adjustmentapplied to DBMI for the whole sample. We used a bin size larger than theprecision of the data so that we could better illustrate the effect of bias adjust-ment. It appears that the bootstrap technique provides a better adjustment in the

144 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

Den

sity

est

imat

e

y−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

Den

sity

est

imat

e

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

Den

sity

est

imat

e

y−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

Den

sity

est

imat

e

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

Den

sity

est

imat

e

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

y

Den

sity

est

imat

e

Page 168: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

10 20 30 40

0.0

0.05

0.10

0.15

Den

sity

estim

ates

10 20 30 40

0.0

0.05

0.10

0.15

Den

sity

estim

ates

Desired body mass index

Body mass index

Figure 10.7 Bias-adjusted density estimates of BMI and DBMI.

centre of the distribution than the adjustment based on mathematical approxi-mation. The reverse occurs in the tails of the distribution, especially in both lefttails where the bootstrap provides a negative estimate of the density function.

BIAS ADJUSTMENT TECHNIQUES 145

Page 169: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

10.5. LOCAL POLYNOMIAL REGRESSION local polynomial regression

Local polynomial regression is a nonparametric technique used to discover therelationship between the variate of interest y and the covariate x . Suppose thatx has I distinct values, or that it can be categorized into I bins. Let x i be thevalue of x representing the ith distinct value or the ith bin, and assume that thevalues of x i are equally spaced. The finite population mean for the variate ofinterest y at x i is yUi . On using the survey weights and the measurements on yfrom the survey data, we calculate the estimate i of yUi . For large surveys, aplot of against x i may be more informative and less cluttered than a plot ofthe raw data. The survey estimates i have variance�covariance matrix V . Theestimate of V is .

As in density estimation we take an asymptotic approach to determine thefunction of interest. As before, we assume that we have a nested sequence ofincreasing finite populations. Now we assume that for a given range on x in thesequence of populations, I and the spacing between the x approacheszero. Further, the limiting population mean at x , denoted by yx or m(x ), isassumed continuous in x . The limiting function m(x ) is the function of interest.

We can investigate the relationship between y and x through local polyno-mial regression of yi on x i. The regression relationship is obtained on plottingthe fit against x for each particular choice of x . The term 0 is theestimated slope parameter in the weighted least squares objective function

(10.8)

where K(x ) is the kernel evaluated at a point x , and h is the window width. Theweights in this procedure are , where pi is the estimate of theproportion pi of observations with distinct value x i. Korn and Graubard(1998b) have considered an objective function similar to (10.8) for the rawdata with pi replaced by the sampling weights, but provided no properties fortheir procedure.

The estimate (x ) can be written as

(10.9)

and its variance estimate as

(10.10)

where

and

146 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

�����

���#

� �

��+�/ � �� ��

�����

��� ��� � � � ��+�� � �/� � �"+�� � �/" �2

++�� � �/��/��#

��� +�� � �/��� ��� �

��

��+�/ � ��+0���1�0�/

��0���1�

��

��+0���1�0�/

��0���1�

�# �1�0�+0���1�0�/

���#

�� � +���# � � � # ��� /�# 0� �

� �� � � � � � +�� � �/"� �2 � � � � � +�2 � �/"��� ��

� ��� ��

� �� � � � � � +�� � �/"

�����

�����

Page 170: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

On assuming that y is approximately multivariate normal, then confidencebands for m(x ) can be obtained from (10.9) and (10.10).

The asymptotic bias and variance of m(x ), as defined by Wand and Jones(1995), turn out to be the same as those obtained in the standard case by Wandand Jones (1995) for the fixed design framework as opposed to the randomdesign framework. Details are given in Bellhouse and Stafford (2001). Inaddition to the asymptotic framework assumed here, we also need to assumethat yi is asymptotically unbiased for yi in the sense of Sarndal, Swensson andWretman (1992, pp. 166�7).

10.6. REGRESSION EXAMPLES FROM THE ONTARIO HEALTHSURVEY regression examples from the ontario health survey

In minimizing (10.10) to obtain local polynomial regression estimates, there aretwo possibilities for binning on x . The first is to bin to the accuracy of the data sothat yx is calculated at each distinct outcome of x . In other situations it may bepractical to pursue a binning on x that is rougher than the accuracy of the data.

When there are only a few distinct outcomes of x , binning on x is done in anatural way. For example, in investigating the relationship between BMI andage, the age of the respondent was reported only at integral values. The soliddots in Figure 10.8 are the survey estimates of the average BMI (yi) for women

26

20 30 40 50 60

2524

2322

Bod

y m

ass

inde

x fo

r fe

mal

es

Age groups

bandwidth=7bandwidth=14

Figure 10.8 Age trend in BMI for females.

REGRESSION EXAMPLES FROM THE ONTARIO HEALTH SURVEY 147

�1� ��

�%��� ��� +�� � �/��� �# ��2 +�2 � �/��� �# � � � # ��� +�� � �/��� � ��

Page 171: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

at each of the ages 18 through 65 (x i). The solid and dotted lines show the plotof m(x ) against x using bandwidths h 7 and h 14 respectively. It may beseen from Figure 10.8 that BMI increases approximately linearly with age untilaround age 50. The increase slows in the early fifties, peaks at age 55 or so, andthen begins to decrease. On plotting the trend lines only for BMI and DBMI forfemales as shown in Figure 10.9, it may be seen that, on average, women desireto reduce their BMI at every age by approximately two units.

On examining Figure 10.8 it might be conjectured that the trend line followsa second-degree polynomial. Since the full set of data was available, a second-degree polynomial was fitted to the data using SUDAAN. The 95% confidencebands for the general trend line m(x ) were calculated according to (10.9) and(10.10). F igure 10.10 shows the second-degree polynomial line superimposedon the confidence bands for m(x ). Since the polynomial line falls at the upperlimit of the band for women in their thirties and outside the band for women intheir sixties, the plot indicates that a second-degree polynomial may not ad-equately describe the trend.

In other situations it is practical to construct bins on x wider than theprecision of the data. To investigate the relationship between what womendesire for their weight (DBMI yi) and what women actually weigh(BMI x i) the x i were grouped. Since the data were very sparse for values ofBMI below 15 and above 42, these data were removed from consideration. Theremaining groups were 15.0 to 15.2, 15.3 to 15.4 and so on, with the value of x ichosen as the middle value in each group. The binning was done in this way toobtain a wide range of equally spaced nonempty bins. For each group thesurvey estimate yi was calculated. The solid dots in Figure 10.11 shows the

20

20

30 40 50 60

BMIDBMI

Age groups

2122

2324

2526

Bod

y m

ass

indi

ces

for

fem

ales

Figure 10.9 Age trends in BMI and DBMI for females.

148 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

� � �

� �

Page 172: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

18 22 26 30 3834 42 46 50 54 58 62Age

26.5

25.5

25

24.5

24

23.5

23

22.5

22

21.5

26

Bod

y m

ass

inde

x

Figure 10.10 Confidence bands for the age trend in BMI for females.

3025

20D

esire

d bo

dy m

ass

inde

x fo

r fe

mal

es

15 20 25 30 35 40BMI groups

bandwidth=7bandwidth=14

Figure 10.11 BMI trend in DBMI for females.

survey estimates of women�s DBMI for each grouped value of their respectiveBMI. The scatter at either end of the line reflects that sampling variability dueto low sample sizes. The plot shows a slight desire to gain weight when the BMIis at 15. This desire is reversed by the time the BMI reaches 20 and the gapbetween the desire (DBMI) and reality (BMI) widens as BMI increases.

REGRESSION EXAMPLES FROM THE ONTARIO HEALTH SURVEY 149

Page 173: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

ACKNOWLEDGEMENTS acknowledgements

Some of the work in this chapter was done while the second author was anM.Sc. student at the University of Western Ontario. She is now a Ph.D. studentat the University of Toronto. J. E. Stafford is now at the University of Toronto.This research was supported by grants from the Natural Sciences and Engin-eering Research Council of Canada.

150 GRAPHICAL DISPLAYS OF COMPLEX SURVEY DATA

Page 174: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 11

Nonparametric Regressionwith Complex Survey DataR. L. Chambers, A. H. Dorfman andM. Yu. Sverchkov

11.1. INTRODUCTION introduction

The problem considered here is one familiar to analysts carrying out explora-tory data analysis (EDA) of data obtained via a complex sample survey design.How does one adjust for the effects, if any, induced by the method of samplingused in the survey when applying EDA methods to these data? In particular, areadjustments to standard EDA methods necessary when the analyst�s objectiveis identification of �interesting� population (rather than sample) structures?

A variety of methods for adjusting for complex sample design when carryingout parametric inference have been suggested. See, for example, Skinner, Holtand Smith (1989) (abbreviated to SHS), Pfeffermann (1993) and Breckling et al.(1994). However, comparatively little work has been done to date on extendingthese ideas to EDA, where a parametric formulation of the problem is typicallyinappropriate.

We focus on a popular EDA technique, nonparametric regression or scatter-plot smoothing. The literature contains a limited number of applications of thistype of analysis to survey data, usually based on some form of sampleweighting. The design-based theory set out in Chapter 10, with associatedreferences, provides an introduction to this work. See also Chesher (1997).

The approach taken here is somewhat different. In particular, it is modelbased, building on the sample distribution concept discussed in Section 2.3.Here we develop this idea further, using it to motivate a number of methods foradjusting for the effect of a complex sample design when estimating a popula-tion regression function. The chapter itself consists of seven sections. InSection 11.2 we describe the sample distribution-based approach to inference,and the different types of survey data configurations for which we developestimation methods. In Section 11.3 we set out a number of key identities that

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 175: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

allow us to reexpress the population regression function of interest in terms ofrelated sample regression quantities. In Section 11.4 we use these identities tosuggest appropriate smoothers for the sample data configurations described inSection 11.2. The performances of these smoothers are compared in a smallsimulation study reported in Section 11.5. In Section 11.6 we digress to explorediagnostics for informative sampling. Section 11.7 provides a conclusion with adiscussion of some extensions to the theory.

Before moving on, it should be noted that the development in this chapter isan extension of Smith (1988) and Skinner (1994), see also Pfeffermann, Kriegerand Rinott (1998) and Pfeffermann and Sverchkov (1999). The notation weemploy is largely based on Skinner (1994).

To keep the discussion focused, we assume throughout that nonsamplingerror, from whatever source (e.g. lack of coverage, nonresponse, interviewerbias, measurement error, processing error), is not a problem as far as the surveydata are concerned. We are only interested in the impact of the uncertainty dueto the sampling process on nonparametric smoothing of these data. We alsoassume a basic familiarity with nonparametric regression concepts, comparableto the level of discussion in Hardle (1990).

11.2. SETTING THE SCENE setting the scene

Since we are interested in scatterplot smoothing we suppose that two (scalar)random variables Y and X can be defined for a target population U of sizeN andvalues of these variables are observed for a sample taken from U. We areinterested in estimating the smooth function gU (x ) equal to the expected valueof Y given X x over the target population U. Sample selection is assumed to beprobability based, with denoting the value of the sample inclusion probabilityfor a generic population unit. We assume that the sample selection process can be(at least partially) characterized in terms of the values of a multivariate sampledesign variable Z (not necessarily scalar and not necessarily continuous). Forexample, Z can contain measures of size, stratum indicators and cluster indica-tors. In the case of ignorable sampling, is completely determined by the valuesin Z . In this chapter, however, we generalize this to allow to depend on thepopulation values of Y , X and Z . The value is therefore itself a realization of arandom variable, which we denote by . Define the sample inclusion indicator I,which, for every unit in U, takes the value 1 if that unit is in the sample and is zerootherwise. The distribution of I for any particular population unit is completelyspecified by the value of for that unit, and so

11.2.1. A key assumption

In many cases it is possible to assume that the population values of the rowvector (Y , X , Z ) are jointly independent and identically distributed (iid).

152 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

��

��

��

��)� � ��� � �"� � �"� � �"� � �* � ��)� � ��� � �* � ��

Page 176: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Unfortunately, the same is usually not true for the sample values of thesevariables. However, since the methods developed in this chapter depend, to agreater or lesser extent, on some form of exchangeability for the sample data wemake the following assumption:

Random indexing: The population values of the random row vector (Y , X, Z , I,) are iid.

That is, the values of Y , X and Z for any two distinct population units aregenerated independently, and, furthermore, the subsequent values of I andfor a particular population unit only depend on that unit�s values of Y , X and Z .Note that in general this assumption does not hold, e.g. where the populationvalues of Y and X are clustered. In any case the joint distribution of thebivariate random variable (I, ) will depend on the population values of Z(and sometimes on those of Y and X as well), so an iid assumption for (Y , X, Z ,I, ) fails. However, in large populations the level of dependence betweenvalues of (I, ) for different population units will be small given their respect-ive values of Y , X and Z , and so this assumption will be a reasonable one. Asimilar assumption underpins the parametric estimation methods described inChapter 12, and is, to some extent, justified by asymptotics described inPfeffermann, Krieger and Rinott (1998).

11.2.2. What are the data?

The words �complex survey data� mask the huge variety of forms in whichsurvey data appear. A basic problem with any form of survey data analysistherefore is identification of the relevant data for the analysis.

The method used to select the sample will have typically involved a combin-ation of complex sample design procedures, including multi-way stratification,multi-stage clustering and unequal probability sampling. In general, the infor-mation available to the survey data analyst about the sampling method canvary considerably and hence we consider below a number of alternative datascenarios. In many cases we are secondary analysts, unconnected with theorganization that actually carried out the survey, and therefore denied accessto sample design information on confidentiality grounds. Even if we are pri-mary analysts, however, it is often the case that this information is not easilyaccessible because of the time that has elapsed since the survey data werecollected.

What is generally available, however, is the value of the sample weightassociated with each sample unit. That is, the weight that is typically appliedto the value of a sample variable before summing these values in order to�unbiasedly� estimate the population total of the variable. For the sake ofsimplicity, we shall assume that this sample weight is either the inverse of thesample inclusion probability of the sample unit, or a close proxy. Our datasettherefore includes the sample values of these inclusion probabilities. This leadsus to:

SETTING THE SCENE 153

��

Page 177: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Data scenario A: Sample values of Y , X and are known. No other infor-mation is available.

This scenario is our base scenario. We envisage that it represents the minimuminformation set where methods of data analysis which allow for complexsampling are possible. The methods described in Chapter 10 and Chapter 12are essentially designed for sample data of this type.

The next situation we consider is where some extra information about howthe sampled units were selected is also available. For example, if a stratifieddesign was used, we know the strata to which the different sample units belong.Following standard practice, we characterize this information in terms of thevalues of a vector-valued design covariate Z known for all the sample units.Thus, in the case where only stratum membership is known, Z corresponds to aset of stratum indicators. In general Z will consist of a mix of such indicatorsand continuous size measures. This leads us to:

Data scenario B: Sample values of Y , X, Z and are known. No other infor-mation is available.

Note that will typically be related to Z . However, this probability need not becompletely determined by Z .

We now turn to the situation where we not only have access to sample data,but also have information about the nonsampled units in the population. Theextent of this information can vary considerably. The simplest case is where wehave population summary information on Z , say the population average zu.Another type of summary information we may have relates to the sampleinclusion probabilities . We may know that the method of sampling usedcorresponds to a fixed size design, in which case the population average of isn/N . Both these situations are combined in:

Data scenario C: Sample values of Y , X, Z and are known. The populationaverage zu of Z is known, as is the fact that the population average of is n/N .

F inally, we consider the situation where we have access to the values of both Zand for all units in the population, e.g. from a population frame. This leads to:

Data scenario D: Sample values of Y , X, Z and are known, as are thenonsample values of Z and .

11.2.3. Informative sampling and ignorable sample designs

A key concern of this chapter is where the sampling process somehow con-founds standard methods for inference about the population characteristics ofinterest. It is a fundamental (and often unspoken) �given� that such standardmethods assume that the distribution of the sample data and the correspondingpopulation distribution are the same, so inferential statements about the former

154 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

��

Page 178: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

apply to the latter. However, with data collected via complex sample designsthis situation no longer applies.

A sample design where the distribution of the sample values and populationvalues for a variable Y differ is said to be informative about Y . Thus, if anunequal probability sample is taken, with inclusion probabilities proportionalto a positive-valued size variable Z , then, provided Y and Z are positivelycorrelated, the sample distribution of Y will be skewed to the right of itscorresponding population distribution. That is, this type of unequal probabilitysampling is informative.

An extreme type of informative sampling discussed in Chapter 8 by Scott andWild is case�control sampling. In its simplest form this is where the variable Ytakes two values, 0 (a control) and 1 (a case), and sampling is such that all casesin the population (of which there are n N ) are selected, with a correspondingrandom sample of n of the controls also selected. Obviously the populationproportion of cases is n /N . However, the corresponding sample proportion(0.5) is very different.

In some cases an informative sampling design may become uninformativegiven additional information. For example, data collected via a stratified designwith nonproportional allocation will typically be distributed differently fromthe corresponding population distribution. This difference is more marked thestronger the relationship between the variable(s) of interest and the stratumindicator variables. Within a stratum, however, there may be no differencebetween the population and sample data distributions, and so the overalldifference between these distributions is completely explained by the differencein the sample and population distributions of the stratum indicator variable.

It is standard to characterize this type of situation by saying a samplingmethod is ignorable for inference about the population distribution of a vari-able Y given the population values of another variable Z if Y is independent ofthe sample indicator I given the population values of Z . Thus, if Z denotes thestratum indicator referred to in the previous paragraph, and if sampling iscarried out at random within each stratum, then it is easy to see that I and Y areindependent within a stratum and so this method of sampling is ignorablegiven Z .

In the rest of this chapter we explore methods for fitting the populationregression function gU(x ) in situations where an informative sampling methodhas been used. In doing so, we consider both ignorable and nonignorablesampling situations.

11.3. RE-EXPRESSING THE REGRESSION FUNCTION re-expressing the regression function

In this section we develop identities which allow us to re-express gU (x ) in termsof sample-based quantities as well as quantities which depend on Z . Theseidentities underpin the estimation methods defined in Section 11.4.

We use fU (w) to denote the value of the population density of a variableW atthe value w, and fs(w) to denote the corresponding value of the sample density

RE-EXPRESSING THE REGRESSION FUNCTION 155

Page 179: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

of this variable. This sample density is defined as the density of the conditionalvariable W 1. See also Chapter 12. We write this (conditional) density asfs(w) fU (w 1). To reduce notational clutter, conditional densitiesf (w V v) will be denoted f (w v). We also use EU (W ) to denote the expect-ation ofW over the population (i.e. with respect to fU ) and Es(W ) to denote theexpectation of W over the sample (i.e. with respect to fs). Since development ofexpressions for the regression of Y on one or more variables will be our focus,we introduce special notation for this case. Thus, the population and sampleregressions of Y on another (possibly vector-valued) variableW will be denotedgU (w) EU (Y ) EU (Y w) and gs(w) Es(Y ) Es(Y w) re-spectively below.

We now state two identities. Their proofs are straightforward given thedefinitions of I and and the random indexing assumption of Section 11.2.1:

(11.1)

and

(11.2)

Consequently

and so

Recollect that gU (x ) is the regression of Y on X at X x in the population.Following an application of Bayes� theorem, one can then show

From the right hand side of (11.4) we see that gU (x ) can be expressed in termsof the ratio of two sample-based unconditional expectations. As we see later,these quantities can be estimated from the sample data, and a plug-in estimateof gU (x ) obtained.

11.3.1. Incorporating a covariate

So far, no attempt has been made to incorporate information from the designcovariate Z . However, since the development leading to (11.4) holds for arbi-trary X , and in particular when X and Z are amalgamated, and sincegU (x ) EU (gU (x , Z ) x ), we can apply (11.4) twice to obtain

156 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

�� ��� ��

� ��

� � � ��' � & � ��' � &

��)&��* � �� )&��*

�� )�* � ��)�*)� )�*�� � ��)�*�)�)�)���**�

�� )&* �������)&��* ��)�*$�

)� ���� � (11.3)

)� )' * � )� ���)�)' ��*� ��)� �

��� ��

�� )�* �)� �

����)���*��)�"�*� �)� �

����)���*� � (11.4)�

� �

�� )�* �)� �

����)���*)� �� )�"�*��"�� �� �)� �

����)���*� � (11.5a)

Page 180: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where

An important special case is where the method of sampling is ignorable given Z ;that is, the random variables Y and X are independent of the sample indicator I(and hence ) given Z . This implies that gU (x , z) gs(x , z) and hence

Under ignorability given Z , it can be seen that Es(gs(x , Z ) x , gs(x , ), andhence (11.6) reduces to (11.4). Further simplification of (11.4) using thisignorability then leads to

which can be compared to (11.4).

11.3.2. Incorporating population information

The identities (11.4), (11.5) and (11.7) all express gU (x ) in terms of samplemoments. However, there are situations where we have access to populationinformation, typically about Z and . In such cases we can weave this infor-mation into estimation of gU (x ) by expressing this function in terms of estim-able population and sample moments.

To start, note that

and so we can rewrite (11.4) as

The usefulness of this reexpression of (11.4) depends on whether the ratio ofpopulation moments on the right hand side of (11.8) can be evaluated from theavailable data. For example, suppose all we know is that EU ( , andthat fs(x ) fs(x ). Here n is the sample size. Then (11.8) reduces to

Similarly, when population information on both and Z is available, we canreplace (11.5) by

where

RE-EXPRESSING THE REGRESSION FUNCTION 157

�� )�" �* �)� �

����)�" ���*��)�" �"�*� �)� �

����)�" ���*� � (11.5%*

� �

�� )�* �)� �

����)���*)�)��)�"�*��"�*� �

)� �����)���*

� � � (11.6)

� �* � �

�� )�* � )� �����)���*��)�"�*� �

�)� �����)���*

� �" (11.7)

��)� ������ � � ) ���)���*� ��) ��)���*� �

�� )�* �)� �

����)���*��)�"�*� �

)� ��)���*� �)� ���)���*� �)� ��)���*� � � (11.8)

�* � ����� �

�� )�* ��

�)� �

����)�"�*� �

�� )�* �)� �

����)���*)� �� )�"�*��"�� �� �)� ��)���*� �

)� ���)���*� �)� ��)���*� � (11.9a)

Page 181: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The expressions above are rather complicated. Simplification does occur, how-ever, when the sampling method is ignorable given Z . As noted earlier, in thiscase gU (x , z) gs(x , z), so gU (x ) EU (gs(x , Z ) x ). However, since fU (x z)fs(x z) it immediately follows

A method of sampling where fU ( y x ) fs( y x ), and so gU (x ) gs(x ), is non-informative. Observe that ignorability given Z is not the same as beingnoninformative since it does not generally lead to gU (x ) gs(x ). For this wealso require that the population and sample distributions of Z are the same, i.e.fU (z) fs(z).

We now combine these results on gU (x ) obtained in the previous section withthe data scenarios earlier described to develop estimators that capitalize on theextent of the survey data that are available.

11.4. DESIGN-ADJUSTED SMOOTHING design-adjusted smoothing

11.4.1. Plug-in methods based on sample data only

The basis of the plug-in approach is simple. We replace sample-based quantitiesin an appropriately chosen representation of gU (x ) by corresponding sampleestimates. Effectively this is a method of moments estimation of gU (x ). Thus, inscenario A in Section 11.2.2 we only have sample data on Y , X and . Theidentity (11.4) seems most appropriate here since it depends only on the samplevalues of Y , X and . Our plug-in estimator of gU (x ) is

where denotes the value at x of a nonparametric estimate of the condi-tional density of the sample X -values given , and (x , ) denotes thevalue at (x , ) of a nonparametric smooth of the sample Y -values against thesample X - and -values. Both these nonparametric estimates can be computedusing standard kernel-based methods, see Silverman (1986) and Hardle (1990).

Under scenario B we have extra sample information, consisting of the samplevalues of Z . If these values explain a substantial part of the variability in , thenit is reasonable to assume that the sampling method is ignorable given Z , andrepresentation (11.7) applies. Our plug-in estimator of gU (x ) is consequently

If the information in Z is not sufficient to allow one to assume ignorability thenone can fall back on the two-level representation (11.5). That is, one firstcomputes an estimate of the population regression of Y on X and Z ,

158 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

�� )�" �* �)� �

����)�" ���*��)�" �"�*� �

)� ��)�" ���*� �)� ���)�" ���*� �)� ��)�" ���*� � � (11.9b)

� � �� ��

�� )�* � )� ��)���*��)�"�*� ��)� ��)���*� �� (11.10)

� �� �

���)���*

��� )�* ���

���� ���)����*���)�" ��*��

���� ���)����* (11.11)

� � � ��� ��

��� )�* ���

���� ��)����*���)�" ��*��

���� ��)����*� (11.12)

Page 182: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

and then smooths this estimate further (as a function of Z ) against X and toobtain

where denotes the value at (x , t) of a sample smooth of thevalues (x , zt) against the sample X -and -values.

11.4.2. Examples

The precise form and properties of these estimators will depend on the nature ofthe relationship between Y , X , Z and . To illustrate, we consider two situ-ations, corresponding to different sample designs.

Stratified sampling on ZWe assume a scenario B situation where Z is a mix of stratum indicators Z 1 andauxiliary covariates Z 2. We further suppose that sampling is ignorable within astratum, so (11.12) applies. Let h index the overall stratification, with shdenoting the sample units in stratum h. Then (11.12) leads to the estimator

where denotes a nonparametric density estimate based on the sample datafrom stratum h. In some circumstances, however, we will be unsure whether it isreasonable to assume ignorability given Z . For example, it could be the casethat is actually a function of Z (Z 1, Z 2) and an unobserved third variableZ 3 that is correlated with Y and X . Here the two-stage estimator (11.13) isappropriate, leading to

and hence

where denotes an estimate of the probability that a sample unit withis in stratum h, and denotes the value at (x , ) of a

nonparametric smooth of the sample gz1(x , z2)-values defined by (11.15a)

against the sample (X , )-values.Calculation of (11.15) requires �smoothing within smoothing� and so will be

computer intensive. A further complication is that the sample gz1(x , z2) values

smoothed in (11.15b) will typically be discontinuous between strata, so thatstandard methods of smoothing may be inappropriate.

DESIGN-ADJUSTED SMOOTHING 159

��� )�" �* ���

���� ���)�" ����*���)�" �" ��*��

���� ���)�" ����*" (11.13a)

��� )�* ���

���� ���)����* �)� ��� )�"�*��" ��� ���

���� ���)����* (11.13b)

�)� ��� )�"�*��" ��� � ���� �

��� )�* ��.

���.

���� ���.)���9�*���.)�" �9�*��

.

���.

���� ���.)���9�* (11.14)

���.

� �

��� )�" �� � ." �9* � ��.)�" �9* ��� �

���

���.)�" �9���* ���.)��*���.)�" �9" ��*�� ���� ���.)�" �9���* ���.)��*

(11.15a)

��� )�* ���

���� ���)����* �)�)����� )�" �9�*��" ��*��

���� ���)����* (11.15b)

���.)�*� � � �)� ���� )�" �9*��" �

� �

Page 183: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Array Sampling on XSuppose the random variable X takes n distinct values {x t ; t 1, 2, . . . , n} onthe population U. Suppose furthermore that Z is univariate, taking mt distinct(and strictly positive) values {zjt ; j 1, 2, . . . ,mt} when X xt , and that wehave Y X Z . The population values of Y and Z so formed can thus bethought of as defining an array, with each row corresponding to a distinct valueof X . F inally suppose that the sampling method chooses one population unitfor each value x t of X (so the overall sample size is n) with probability

Given this set-up, inspection of the sample data (which includes the values of) allows one to immediately observe that gs(x , z) x z and to calculate the

realized value of st as where zst and st denote the sample values of Z andcorresponding to X xt . Furthermore, the method of sampling is ignorable

given Z , so gU (x , z) gs(x , z) and hence gU (x ) x , where EU (Z ). If thetotal population size N were known, an obvious (and efficient) estimator ofwould be

and so gU (x ) could be estimated with considerable precision. However, we donot know N and so are in scenario B. The above method of sampling ensuresthat every distinct value of X in the sample is observed once and only once.Hence . Using (11.12), our estimator of gU (x ) then becomes

This is an approximately unbiased estimator of gU (x ). To see this we notethat, by construction, each value of st represents an independent realizationfrom a distribution defined on the values { jt} with probabilities { jt}, and soEU ( ) mt . Hence

11.4.3. Plug-in methods which use population information

We now turn to the data scenarios where population information is available.To start, consider scenario C. This corresponds to having additional summary

160 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

�0� � �0���0��0� � �0�����

� ��

�� �

� ���

������� �

�� ������ �

��� �������

�� �����

�������� ���

���)���* � ���

��� )�* ������

�����

��� �����

����� )� ���* �

� ������

�����

��� �����

����� ���

��

(11.16)

�� �

)� )��� )�** � ������ �

��������� �

�� � � � �� )�*�

����� �

Page 184: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

population information, typically population average values, available for Zand . More formally, we know the value of the population size N and one orboth of the values of the population averages and

How this information can generally be used to improve upon direct sample-based estimation of gU (x ) is not entirely clear. The fact that this informationcan be useful, however, is evident from the array sampling example described inthe previous section. There we see that, given the sample data, this function canbe estimated very precisely once either the population mean of Z is known, or,if the method of sampling is known, we know the population size N . Thisrepresents a considerable improvement over the estimator (11.16) which is onlyapproximately unbiased for this value.

However, it is not always the case that such dramatic improvement ispossible. For example, suppose that in the array sampling situation we arenot told (i) the sample values of Z and (ii) that sampling is proportional to Zwithin each row of the array. The only population information is the value ofN . This is precisely the situation described following (11.8), and so we could usethe estimator

where is the estimated sample regression of Y on X and . It is notimmediately clear why (11.17) should in general represent an improvement over(11.16). Note that the regression function ) in (11.17) is not �exact� (unlikegs(x , z) ) and so (11.17) will include an error arising from this approxi-mation.

F inally, we consider scenario D. Here we know the population values ofZ and . In this case we can use the two-level representation (11.9) to define aplug-in estimator of gU (x ) if we have reason to believe that sampling is notignorable given Z . However, as noted earlier, this approach seems overlycomplex. Equation (11.10) represents a more promising alternative providedthat the sampling is ignorable given Z , as will often be the case for this typeof scenario (detailed population information available). This leads to theestimator

Clearly, under stratified sampling on Z , (11.18) takes the form

which can be compared to (11.14). In contrast, knowing the population valuesof Z under array sampling on X means we know the values mt and st for eacht and so we can compute a precise estimate of gU (x ) from the sample data.

DESIGN-ADJUSTED SMOOTHING 161

��� ���

��� )�* � ��������

����� ���)�" ���* (11.17)

���)�" �* �

���)�" �� � �

��� )�* ������

���)����*���)�" ��*���

������)����*� (11.18)

��� )�* ��.

��.

���.)���9�*���.)�" �9�*��

.

��.

���.)���9�* (11.19)

Page 185: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

11.4.4. The estimating equation approach

The starting point for this approach is Equation (11.4), which can be equiva-lently written

providing Es( ) > 0. That is, gU (x ) can always be represented as thesolution of the equation

Replacing the left hand side of (11.21) by a kernel-based estimate of theregression function value leads to the estimating equation (hs(x ) is the band-width)

which has the solution

This is a standard Nadaraya�Watson-type estimator of the sample regressionof Y on X , but with kernel weights modified by multiplying them by the inversesof the sample inclusion probabilities.

The Nadaraya�Watson nonparametric regression estimator is known to beinefficient compared to local polynomial alternatives when the sample regres-sion function is reasonably smooth (Fan, 1992). Since this will typically be thecase, a popular alternative solution to (11.22) is one that parameterizes (x ) asbeing locally linear in x . That is, we write

and hence replace (11.22) by

The parameters (x ) and (x ) are obtained by weighted local linear leastsquares estimation. That is, they are the solutions to

Following arguments outlined in Jones (1991) it can be shown that, for either(11.23) or (11.24), a suitable bandwidth hs(x ), in terms of minimizing the meansquared error Es (x ) gU (x ) 2, must be of order n

One potential drawback with this approach is that there seems no straight-forward way to incorporate population auxiliary information. These estimators

162 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

�� )�* � )�)���� �� � �*�)�)����� � �* (11.20)

����� � �

)�L���)� � �� )�**�� � �M � 7� (11.21)

��

3�� ��.�)�*

����� )�� � ��� )�** � 7 (11.22)

��� )�* ���

3�� ��.�)�*

����� ��

���

3�� ��.�)�*

����� � (11.23)

���

��� )�* � ��)�* �%)�*� (11.24)

��

3�� ��.�)�*

����� ) �� � ��)�*� �%)�*��* � 7� (11.25)

�� �%

��

���� 3�� ��.�)�*

��� � ��)�*� �%)�*��

� ��

��

�� 7

7

�� (11.26)

���� �� ���<�

Page 186: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

are essentially scenario A estimators. One ad hoc solution to this is to replacethe inverse sample inclusion probabilities in (11.22) and (11.26) by sampleweights which reflect this information (e.g. weights that are calibrated toknown population totals). However, it is not obvious why this modificationshould lead to improved performance for either (11.23) or (11.24).

11.4.5. The bias calibration approach

Suppose s(x ) is a standard (i.e. unweighted) nonparametric estimate of thesample regression of Y on X at x . The theory outlined in Section 11.3 showsthat this estimate will generally be biased for the value of the populationregression of Y on X at x if the sample and population regression functionsdiffer. One way around this problem therefore is to nonparametrically bias-calibrate this estimate. That is, we compute the sample residuals, rt yts(x t), and re-smooth these against X using a methodology that gives a consist-

ent estimator of their population regression on X . This smooth is then added tos(x ). For example, if (11.11) is used to estimate this residual population

regression, then our final estimate of gU (x ) is

where gsR (x , ) denotes the value at (x , ) of a sample smooth of the residuals rtagainst the sample X and values. Other forms of (11.27) can be easily writtendown, based on alternative methods of consistently estimating the populationregression of the residuals at x .

This approach is closely connected to the concept of �twicing� or doublesmoothing for bias correction (Tukey, 1977; Chambers, Dorfman and Wehrly,1993).

11.5. SIMULATION RESULTS simulation results

Some appreciation for the behaviour of the different nonparametric regressionmethods described in the previous section can be obtained by simulation. Wetherefore simulated two types of populations, both of size N 1000, and foreach we considered two methods of sampling, one noninformative and theother informative. Both methods had n 100, and were based on applicationof the procedure described in Rao, Hartley and Cochran (1962), with inclusionprobabilities as defined below.

The first population simulated was defined by the equations:

SIMULATION RESULTS 163

��

��

��

��� )�* � ���)�*�� �

���

���)����*����)�" ��*�� ���� ���)����*

(11.27)

� �

� � ��

� � � � �� �� (11.28a)

� � 4 7�<� �� (11.28b)

� � 4 9�� (11.28c)

Page 187: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where Y , X and Z are independent standard normal variates. It is straight-forward to see that gU (x ) EU (Y X x ) 1 x x 2. F igure 11.1 shows arealization of this population. The two sampling procedures we used for thispopulation were based on inclusion probabilities that were approximatelyproportional to the population values of Z (PPZ : an informative samplingmethod) and X (PPX : a noninformative sampling method). These probabilitieswere defined by

and

respectively.The second population type we simulated reflected the heterogeneity that is

typical of many real populations and was defined by the equations

where Y is a standard normal variate, X is distributed as Gamma(2) inde-pendently of Y , and Z is a binary variable that takes the values 1 and 0 with

120

2 3 4 5 6 7 8 9 10

X

110

100

90

80

70

60

50

40

30

20

0

10

Figure 11.1 Scatterplot of Y vs. X for a population of type 1. Solid line is gU (x ).

164 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

44�� �� � �77)�� ���� )�* 7��*��� ��

)� ���� )�* 7��*

44� � �� � �77)�� ���� )� * 7��*��� ��

)� ���� )� * 7��*

� � �� � �� �

� � 27 � 7�777<��9 2�������������

�)���90 *

� � 97 �77�� )���90!*

��

Page 188: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

probabilities 0.4 and 0.6 respectively, independently of both Y and X . F igure11.2 shows a realization of this population. Again, straightforward calculationshows that for this case gU (x ) EU (Y ) 30 x 0.0002x 2.

As with the first population type (11.28), we used two sampling methods withthis population, corresponding to inclusion probabilities proportional to Z(PPZ : an informative sampling method) and proportional to X (PPX : a non-informative sampling method). These probabilities were defined by

and

respectively. Note that PPZ above corresponds to a form of stratified sam-pling, in that all population units with Z 1 have a sample inclusion probabil-ity that is three times greater than that of population units with Z 0.

Each population type was independently simulated 200 times and for eachsimulation two samples were independently drawn using PPZ and PPX inclu-sion probabilities. Data from these samples were then used to fit a selection ofthe nonparametric regression smooths described in the previous section. Thesemethods are identified in Table 11.1. The naming convention used in this table

00

100200300400500600700800900

100011001200130014001500

200 400 600 800 1000 1200

X

Figure 11.2 Scatterplot of Y vs. X for a population of type 2. Solid line is gU (x ).

SIMULATION RESULTS 165

44�� �� � �77)�� 7�<*��� ��

)� 7�<*

44� � �� � �77��

��� ���

� �

�� � � ��

��

Page 189: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 11.1 Nonparametric smoothing methods evaluated in the simulation study.

Name Estimator type Definition

M( s) Scenario A plug-in (11.11)M(Z s s) Scenario B plug-in (11.12)M(Z strs strs) Stratified scenario B plug-in (11.14)M(Z ) Scenario D plug-in (11.18)M(Z str) Stratified scenario D plug-in (11.19)Elin( s) Weighted local linear smooth (11.24)Elin Elin( s) Unweighted weighted smooth (11.27b)Elin Unweighted local linear smooth (11.24a)Elin Elin Unweighted unweighted smooth (11.27b)

a Based on (11.25) with 1.b These are bias-calibrated or �twiced� methods, representing two variations on theform defined by (11.27). In each case the first component of the method�s name is thename of the initial (potentially biased) smooth and the second component (after � �)is the name of the second smooth that is applied to the residuals from the first. Notethat the �twiced� standard smooth Elin Elin was only investigated in the case ofPPZ (i.e. informative) sampling.

denotes a plug-in (moment) estimator by �M� and one based on solution of anestimating equation by �E�. The amount of sample information required inorder to compute an estimate is denoted by the symbols that appear withinthe parentheses associated with its name. Thus, M( s) denotes the plug-inestimator defined by (11.11) that requires one to know the sample values of

and nothing more (the sample values of Y and X are assumed to be alwaysavailable). In contrast, M(Z ) is the plug-in estimator defined by (11.25) thatrequires knowledge of the population values of Z before it can be computed.Note the twiced estimator Elin Elin. This uses an unweighted locally linearsmooth at each stage. In contrast, the twiced estimator Elin Elin( s) uses aweighted local linear smoother at the second stage. We included Elin Elin inour study for the PPZ sampling scenarios only to see how much bias frominformative sampling would be soaked up just by reapplication of thesmoother. We also investigated the performance of the Nadaraya�Watsonweighted smooth in our simulations. However, since its performance wasuniformly worse than that of the locally linear weighted smooth Elin( s), wehave not included it in the results we discuss below.

Each smooth was evaluated at the 5th, 6th, . . . , 95th percentiles of thepopulation distribution of X and two measures of goodness of fit were calcu-lated. These were the mean error, defined by the average difference between thesmooth and the actual values of gU (x ) at these percentiles, and the root meansquared error (RMSE), defined by the square root of the average of the squaresof these differences. These summary measures were then averaged over the 200simulations. Tables 11.2 to 11.5 show these average values.

166 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

���

��

�� �

Page 190: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 11.2 Simulation results for population type 1 (11.28) under PPZ sampling.

Method

Bandwidth coefficient

0.5 1 2 3 4 5 6 7

Average mean error

M( s) 0.20 0.09 0.31 0.35 0.17 0.09 0.29 0.39M(Z s s) 1.51 1.79 2.09 2.23 2.33 2.39 2.43 2.44M(Z ) 0.22 0.02 0.36 0.38 0.18 0.08 0.26 0.35Elin( s) 0.28 0.24 0.40 0.64 0.84 0.93 0.95 0.91Elin Elin( s) 0.27 0.17 0.13 0.17 0.25 0.36 0.45 0.52Elin 1.85 1.91 2.13 2.38 2.56 2.64 2.64 2.59Elin Elin 1.82 1.83 1.86 1.91 1.99 2.09 2.18 2.24

Average root mean squared error

M( s) 3.80 2.48 2.21 3.54 4.81 5.66 6.13 6.36M(Z s s) 14.08 5.03 2.93 2.65 2.65 2.71 2.77 2.83M(Z ) 3.36 2.28 2.16 3.54 4.79 5.60 6.03 6.24Elin( s) 2.50 1.94 1.48 1.40 1.43 1.48 1.54 1.61Elin Elin( s) 2.95 2.27 1.67 1.44 1.33 1.26 1.25 1.29Elin 2.93 2.57 2.45 2.60 2.74 2.84 2.88 2.90Elin Elin 3.30 2.77 2.36 2.26 2.67 2.32 2.40 2.48

Table 11.3 Simulation results for population type 1 (11.28) under PPX sampling.

Method

Bandwidth coefficient

0.5 1 2 3 4 5 6 7

Average mean error

M( s) 0.01 0.04 0.05 2.25 0.05 0.10 0.18 0.25M(Z s s) 0.17 0.29 0.31 0.44 0.53 0.57 0.58 0.56M(Z ) 1.13 1.19 1.24 0.95 0.52 0.12 0.14 0.26Elin( s) 0.09 0.14 0.36 0.62 0.82 0.92 0.93 0.89Elin Elin( s) 0.05 0.06 0.11 0.18 0.26 0.35 0.43 0.50Elin 0.09 0.14 0.35 0.58 0.73 0.78 0.75 0.68

Average root mean squared error

M( s) 3.79 2.93 1.92 4.34 1.41 1.37 1.44 1.55M(Z s s) 12.71 5.47 2.26 1.69 1.60 1.61 1.66 1.74M(Z ) 4.05 2.49 2.59 3.81 4.98 5.77 6.21 6.42Elin( s) 2.46 1.92 1.47 1.38 1.42 1.49 1.57 1.64Elin Elin( s) 2.96 2.29 1.70 1.45 1.33 1.27 1.27 1.32Elin 2.46 1.93 1.48 1.37 1.39 1.45 1.53 1.63

SIMULATION RESULTS 167

��

��

��

��

��

�� � � � �

��

��

Page 191: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 11.4 Simulation results for population type 2 (11.29) under PPZ sampling.

Method

Bandwidth coefficient

0.5 1 2 3 4 5 6 7

Average mean error

M( s) 1.25 1.01 1.05 1.17 1.35 1.45 1.48 1.52M(Z s s) 7.90 9.49 8.97 9.05 9.40 9.74 10.11 10.46M(Z strs strs) 1.72 1.11 1.10 1.28 1.38 1.42 1.39 1.36M(Z ) 1.12 0.95 0.99 1.11 1.29 1.39 1.42 1.45M(Z str) 1.67 1.06 1.06 1.25 1.35 1.39 1.37 1.33Elin( s) 1.52 1.29 1.44 1.58 1.70 1.80 1.86 1.93Elin Elin( s) 5.28 5.57 5.58 5.99 1.07 6.02 5.93 5.81Elin 8.06 8.28 8.90 9.41 9.75 9.99 10.15 10.30Elin Elin 0.20 0.21 0.30 0.41 0.43 0.51 0.64 0.78

Average root mean squared error

M( 28.19 17.11 10.40 8.98 8.18 7.68 7.44 7.38M( 75.21 41.54 23.44 16.65 15.03 14.92 15.21 15.79M( 15.78 12.20 9.66 8.61 8.15 7.90 7.73 7.67M( 28.47 17.51 10.34 8.90 8.07 7.55 7.27 7.20M( 15.75 12.12 9.56 8.54 8.07 7.79 7.61 7.52Elin( s) 15.41 12.77 10.58 9.49 8.87 8.54 8.39 8.41Elin Elin( s) 17.60 14.86 12.38 11.43 10.82 10.40 10.13 9.90Elin 17.15 15.48 14.50 14.36 14.46 14.72 15.05 15.47Elin Elin 16.05 12.94 9.91 8.72 8.00 7.58 7.33 7.21

All kernel-based methods used an Epanechnikov kernel. In order to examinethe impact of bandwidth choice on the different methods, results are presentedfor a range of bandwidths. These values are defined by a bandwidth coefficient,corresponding to the value of b in the bandwidth formula

Examination of the results shown in Tables 11.2�11.5 shows a number offeatures:

1. RMSE-optimal bandwidths differ considerably between estimationmethods, population types and methods of sampling. In particular, twi-cing-based estimators tend to perform better at longer bandwidths, whileplug-in methods seem to be more sensitive to bandwidth choice thanmethods based on estimating equations.

2. For population type 1 with PPZ (informative) sampling we see a substan-tial bias for the unweighted methods Elin and Elin Elin, and conse-quently large RMSE values. Somewhat surprisingly the plug-in methodM(Z s s) also displays a substantial bias even though its RMSE values are

168 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

���

��

� � � � � � � �

��*

����*

����������*

�*

����*

��

! �$��$�� � %� � ���� � ��� # � � ����<�

Page 192: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 11.5 Simulation results for population type 2 (11.29) under PPX sampling.

Method

Bandwidth coefficient

0.5 1 2 3 4 5 6 7

Average mean error

M( 0.09 0.07 0.03 0.09 0.11 0.09 0.13 0.19M( 16.62 0.58 0.50 0.51 0.51 0.59 0.59 0.54M( 0.36 0.47 0.72 0.80 0.82 0.75 0.60 0.34M(Z ) 0.09 0.46 0.66 0.71 0.77 0.70 0.50 0.28M(Z str) 0.31 0.42 0.66 0.74 0.76 0.71 0.56 0.29Elin( 0.27 0.49 0.83 1.06 1.19 1.27 1.32 1.37Elin 0.43 0.37 0.32 0.18 0.02 0.08 0.18 0.31Elin 0.27 0.47 0.61 0.71 0.73 0.65 0.48 0.30

Average root mean squared error

M( 20.38 14.44 10.03 8.26 7.37 7.01 6.97 7.10M( 217.7 29.72 12.04 10.13 9.43 8.94 8.74 8.79M( 12.06 9.67 7.89 7.18 6.92 6.97 7.19 7.51M(Z ) 15.94 10.23 7.88 7.01 6.62 6.52 6.66 6.99M(Z str) 11.92 9.47 7.70 6.97 6.71 6.75 6.96 7.26Elin( 12.79 9.90 7.82 7.08 6.69 6.49 6.42 6.44Elin 15.76 12.49 9.52 8.40 7.61 7.08 6.75 6.58Elin 12.79 9.94 8.03 7.42 7.16 7.10 7.23 7.51

not excessive. Another surprise is the comparatively poor RMSE perform-ance of the plug-in method M(Z ) that incorporates population informa-tion. Clearly the bias-calibrated method Elin Elin( s) is the best overallperformer in terms of RMSE, followed by Elin( s).

3. For population type 1 with PPX (noninformative) sampling there is little tochoose between any of the methods as far as bias is concerned. Again, wesee that the plug-in method M(Z ) that incorporates population informa-tion is rather unstable and the overall best performer is bias-calibratedmethod Elin Elin( s). Not unsurprisingly, for this situation there isvirtually no difference between Elin and Elin( s).

4. For population type 2 and PPZ (informative) sampling there are furthersurprises. The plug-in method M(Z s s), the standard method Elin and thebias-calibrated method Elin Elin( s) all display bias. In the case ofM(Z s s) this reflects behaviour already observed for population type 1.Similarly, it is not surprising that Elin is biased under informative sam-pling. However, the bias behaviour of Elin Elin( s) is hard to under-stand, especially when compared to the good performance of theunweighted twiced method Elin Elin. In contrast, for this populationall the plug-in methods (with the exception of M(Z s s)) work well. F inally

SIMULATION RESULTS 169

��*

����*

����������*

��*

����)��*

��*

����*

����������*

� � � � �

� � � �

��*

����)��*

��

��

��

Page 193: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

we see that in this population longer bandwidths are definitely better, withthe optimal bandwidth coefficients for all methods except Elin andM(Z s s) most likely greater than b 7.

5. F inally, for population type 2 and PPX (noninformative) sampling, allmethods (with the exception of M(Z s s)) have basically similar bias andRMSE performance. Although the bias performance of M(Z s s) is unre-markable we see that in terms of RMSE, M(Z s s) is again a relatively poorperformer. Given the PPX sampling method is noninformative, it is a littlesurprising that the unweighted smoother Elin performs worse than theweighted smoother Elin( s). In part this may be due to the heteroskedas-ticity inherent in population type 2 (see Figure 11.2), with the lattersmoother giving less weight to the more variable sample values.

6. The preceding analysis has attempted to present an overview of the per-formance of the different methods across a variety of bandwidths. Inpractice, however, users may well adopt a default bandwidth that seemsto work well in a variety of situations. In the case of local linear smoothing(the underlying smoothing method for all the methods for which we presentresults), this corresponds to using a bandwidth coefficient of b 3. Conse-quently it is also of some interest to just look at the performances of thedifferent methods at this bandwidth. Here we see a much clearer picture.For population type 1, Elin( s) is the method of choice, while M(Z str) isthe method of choice for population type 2.

It is not possible to come up with a single recommendation for what design-adjusted smoothing method to use on the basis of these (very limited) simulationresults. Certainly they indicate that the bias-calibrated method Elin Elin( s),preferably with a larger bandwidth than would be natural, seems an acceptablegeneral purpose method of nonparametrically estimating a population regres-sion function. However, it can be inefficient in cases where plug-in estimatorslike M(Z strs strs) and M(Z str) can take advantage of nonlinearities in populationstructure caused by stratum shifts in the relationship between Y and X . Incontrast, the plug-in method M(Z s s) that combines sample information onboth and Z seems too unstable to be seriously considered, while both the basicplug-in method M( s) and the more complex population Z -based M(Z ) arerather patchy in their performance � both are reasonable with population type 2,but are rather unstable with population type 1. In the latter case this seems toindicate that it is not always beneficial to include auxiliary information intoestimation of population regression functions, even when the sampling method(like PPZ ) is ignorable given this auxiliary information.

11.6. TO WEIGHT OR NOT TO WEIGHT? (WITH APOLOGIESTO SMITH, 1988) to weight or not to weight?

The results obtained in the previous section indicate that some form of designadjustment is advisable when the sampling method is informative. However,

170 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

��

��

Page 194: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

adopting a blanket rule to always use a design-adjusted method may lead toloss of efficiency when the sampling method is actually noninformative. On theother hand it is not always the case that inappropriate design adjustment leadsto efficiency loss, compare Elin and Elin( s) in Table 11.5.

Is there a way of deciding whether one should use design-adjusted (e.g.weighting) methods? Equivalently, can one, on the basis of the sample data,check whether the sample is likely to have been drawn via an informativesampling method?

A definitive answer to this question is unavailable at present. However, wedescribe two easily implementable procedures that can provide some guidance.

To start, we observe that if a sampling method is noninformative, i.e. iffU ( y x ) fs( y x ), then design adjustment is unnecessary and we can estimategU (x ) by simply carrying out a standard smooth of the sample data. As wasdemonstrated in Pfeffermann, Krieger and Rinott (1998) and Pfeffermann andSverchkov (1999), noninformativeness is equivalent to conditional independ-ence of the sample inclusion indicator I and Y given X in the population, whichin turn is equivalent to the identity

This identity holds if the sample values of and Y are independent given X .Hence, one way of checking whether a sample design is noninformative is totest this conditional independence hypothesis using the sample data. If thesample size is very large, this can be accomplished by partitioning the sampledistributions of , Y and X , and performing chi-square tests of independencebased on the categorized values of and Y within each category of X . Unfor-tunately this approach depends on the overall sample size and the degree ofcategorization, and it did not perform well when we applied it to the sampledata obtained in our simulation study.

Another approach, and one that reflects practice in this area, is to only use adesign-adjusted method if it leads to significantly different results compared toa corresponding standard (e.g. unweighted) method. Thus, if we let s(x ) denotethe standard estimate of gU (x ), with U (x ) denoting the corresponding design-adjusted estimate, we can compute a jackknife estimate J (x ) of var( s(x )U (x )) and hence calculate the standardized difference

We would then only use U (x ) if W (x ) > 2. A multivariate version of this test,based on a Wald statistic version of (11.30), is easily defined. In this case wewould be testing for a significant difference between these two estimates at kdifferent x -values of interest. This would require calculation of a jackknifeestimate of the variance�covariance matrix of the vector of differences betweenthe g-values for the standard method and those for the design-adjusted methodat these k different x -values. The Wald statistic value could then be comparedto, say, the 95th percentile of a chi-squared distribution with k degrees of

TO WEIGHT OR NOT TO WEIGHT? 171

�� �

)�)����� � �"� � �* � )� )��� � �"� � �*

� )� )��� � �* � )�)����� � �*��

��

����

����

� �

' )�* � L���)�*� ��� )�*M� �����������

�5 )�*�

� (11.30)

��

Page 195: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

freedom. Table 11.6 shows how this approach performed with the sample dataobtained in our simulations. We see that it works well at identifying the factthat the PPX schemes are noninformative. However, its ability to detect theinformativeness of the PPZ sampling schemes is not as good, particularly in thecase of population type 2. We hypothesize that this poor performance is due tothe heteroskedasticity implicit in this population�s generation mechanism.

The problem with the Wald statistic approach is that it does not test thehypothesis that is really of interest to us in this situation. This is the hypothesisthat gU (x ) gs(x ). In addition, nonrejection of the noninformative sampling�hypothesis� may result purely because the variance of the design-adjustedestimator is large compared to the bias of the standard estimator.

A testing approach which focuses directly on whether gU (x ) gs(x ) can bemotivated by extending the work of Pfeffermann and Sverchkov (1999). Inparticular, from (11.21) we see that this equality is equivalent to

The twicing approach of Section 11.4.5 estimates the left hand side of (11.31),replacing gs(x ) by the standard sample smooth s(x ). Consequently, testinggU (x ) gs(x ) is equivalent to testing whether the twicing adjustment is signifi-cantly different from zero.

A smooth function can be approximated by a polynomial, so we can alwayswrite

Table 11.6 Results for tests of informativeness. Proportion of samples where nullhypothesis of noninformativeness is rejected at the (approximate) 5 % level ofsignificance.

Test method

PPX sampling PPZ sampling(noninformative) (informative)

Populationtype 1

Populationtype 2

Populationtype 1

Populationtype 2

Wald statistictesta

0.015 0.050 1.000 0.525

Correlationtestb

0.010 0.000 0.995 0.910

a The Wald statistic test was carried out by setting Elin and Elin( s),both with bandwidth coefficient b 3. The x -values where these two estimatorswere compared were the deciles of the sample distribution of X .b The correlation test was carried out using Elin with bandwidth coefficientb 3. A sample was rejected as being �informative� if any one of the subhypothesesH0,H1 and H2 was rejected at the 5 % level.

172 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

)�L���)� � ��)�* *�� � �M � 7� (11.31)

��

)�)���)� � ��)�* *�� � �* �

�0 7�0�

0� (11.32)

��� � ��� � ��

��� ��

Page 196: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

We can test H : gU (x ) gs(x ) by testing whether the coefficients aj of theregression model (11.32) are identically zero. This is equivalent to testing thesequence of subhypotheses

In practice gs(x ) in these subhypotheses is replaced by (x ). Assuming normal-ity of all quantities, a suitable test statistic for each of these subhypotheses isthen the corresponding t-value generated by the cor.test function within S-Plus(Statistical Sciences, 1990). We therefore propose that the hypothesis H berejected at a �95 % level� if any one of the absolute values of these t-statisticsexceeds 2. Table 11.6 shows the results of applying this procedure to the sampledata in our simulations, with the testing restricted to the subhypotheses H 0,H1and H2. Clearly the test performs rather well across all sampling methods andpopulation types considered in our simulations. However, even in this case wesee that there are still problems with identifying the informativeness of PPZsampling for population type 2.

11.7. DISCUSSION discussion

The preceding development has outlined a framework for incorporating infor-mation about sample design, sample selection and auxiliary population infor-mation into nonparametric regression applied to sample survey data. Ourresults provide some guidance on choosing between the different estimatorsthat are suggested by this framework.

An important conclusion that can be drawn from our simulations is that itpays to use some form of design-adjusted method when the sample data havebeen drawn using an informative sampling scheme, and there is little loss ofefficiency when the sampling scheme is noninformative. However, we see thatthere is no single design-adjusted method that stands out. Local linearsmoothing incorporating inverse selection probability weights seems to be anapproach that provides reasonable efficiency at a wide range of bandwidths.But locally linear plug-in methods that capitalize on �stratum structure� in thedata can improve considerably on such a weighting approach. For nonpara-metric regression, as with any other type of analysis of complex survey data, ithelps to find out as much as possible about the population structure the sampleis supposed to represent. Unfortunately there appears to be no general advan-tage from incorporating information on an auxiliary variable into estimation,and in fact there can be a considerable loss of efficiency probably due to the useof higher dimensional smoothers with such data.

DISCUSSION 173

77� 2��)���" )� � ��)� * * * � 7

7�� 2��)���)� � ��)� * *"� * � 7

79� 2��)���)� � ��)� * *"�9* � 7

���

���

Page 197: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

An important ancillary question we have also considered is identifyingsituations where in fact the sampling method is informative. Here an examin-ation of the correlations between weighted residuals from a standard (un-weighted) nonparametric smooth and powers of the X variable is a promisingdiagnostic tool.

Our results indicate there is a need for much more research in this area. Themost obvious is development of algorithms for choosing an optimal bandwidth(or bandwidths) to use with the different design-adjusted methods described inthis chapter. In this context Breunig (1999, 2001) has investigated optimalbandwidth choice for nonparametric density estimation from stratified andclustered sample survey data, and it should be possible to apply these ideas tothe nonparametric regression methods investigated here. The extension toclustered sample data is an obvious next step.

174 NONPARAMETRIC REGRESSION WITH COMPLEX SURVEY DATA

Page 198: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 12

F itting Generalized LinearModels under InformativeSamplingDanny Pfeffermann and M. Yu. Sverchkov

12.1. INTRODUCTION introduction

Survey data are frequently used for analytic inference on population models.Familiar examples include the computation of income elasticities from house-hold surveys, the analysis of labor market dynamics from labor force surveysand studies of the relationships between disease incidence and risk factors fromhealth surveys.

The sampling designs underlying sample surveys induce a set of weights forthe sample data that reflect unequal selection probabilities, differential non-response or poststratification adjustments. When the weights are related to thevalues of the outcome variable, even after conditioning on the independentvariables in the model, the sampling process becomes informative and themodel holding for the sample data is different from the model holding in thepopulation (Rubin, 1976; Little, 1982; Sugden and Smith, 1984). Failure toaccount for the effects of informative sampling may yield large biases anderroneous conclusions. The books edited by Kasprzyk et al. (1989) and bySkinner, Holt and Smith (SHS hereafter) (1989) contain a large number ofexamples illustrating the effects of ignoring informative sampling processes. Seealso the review articles by Pfeffermann (1993, 1996).

In fairly recent articles by Krieger and Pfeffermann (1997), Pfeffermann,Krieger and Rinott (1998) and Pfeffermann and Sverchkov (1999), the authorspropose a new approach for inference on population models from complexsurvey data. The approach consists of approximating the parametric distribu-tion of the sample data (or moments of that distribution) as a function of thepopulation distribution (moments of this distribution) and the samplingweights, and basing the inference on that distribution. The (conditional) sample

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 199: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

probability density function (pdf ) fs( yt x t) of an outcome variable Y , corres-ponding to sample unit t, is defined as f ( yt x t ; t s) where s denotes the sampleand x t (x t0, x t1, . . . , x tk ) represents the values of auxiliary variables X0 . . . Xk(usually x t0 1 for all t). Denoting the corresponding population pdf (beforesampling) by fU ( yt x t), an application of Bayes� theorem yields the relationship

Note that unless Pr(t s yt , x t) Pr(t s x t) for all possible values yt , thesample and population pdfs differ, in which case the sampling process isinformative. Empirical results contained in Krieger and Pfeffermann (1997)and Pfeffermann and Sverchkov (1999), based on simulated and real datasets,illustrate the potentially better performance of regression estimators and testsof distribution functions derived from use of the sample distribution as com-pared to the use of standard inverse probability weighting. For a brief discus-sion of the latter method (with references) and other more recent approachesthat address the problem of informative sampling, see the articles by Pfeffer-mann (1993, 1996).

The main purpose of the present chapter is to extend the proposed method-ology to likelihood-based inference, focusing in particular on situations wherethe models generating the population values belong to the family of generalizedlinear models (GLM; McCullagh and Nelder, 1989). The GLM consists ofthree components:

1. The random component; independent observations yt , each drawn from adistribution belonging to the exponential family with mean t .

2. The systematic component; covariates x 0t . . . xkt that produce a linear pre-dictor

3. A link function h( t) between the random and systematiccomponents.

The unknown parameters consist of the vector ( 0, 1, . . . , k ) and pos-sibly additional parameters defining the other moments of the distribution ofyt . A trivial example of the GLM is the classical regression model with thedistribution of yt being normal, in which case the link function is the identityfunction. Another familiar example considered in the empirical study of thischapter and in Chapters 6, 7, and 8 is logistic regression, in which case the linkfunction is h( t) log [ t (1 t)]. See the book by McCullagh and Nelder(1989) for many other models in common use. The GLM is widely used formodeling discrete data and for situations where the expectations t are non-linear functions of the covariates.

In Section 12.2 we outline the main features of the sample distribution.Section 12.3 considers three plausible sets of estimating equations for fittingthe GLM to the sample data that are based on the sample distribution anddiscusses their advantages over the use of the randomization (design-based)distribution. A simple test statistic for testing the informativeness of thesampling process is developed. Section 12.4 proposes appropriate variance

176 FITTING GLMS UNDER INFORMATIVE SAMPLING

�� �

� �

��

��0 �����7 � � 0 �����4 � � �7 � ��0� � ����+ ��7�� 0 �����7���0� � ����7� (12.1)

� � � � �

����; �����#

� � ����; �����

� � � � � �

� � � � � �

Page 200: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

estimators. Section 12.5 contains simulation results used to assess the perform-ance of the point estimators and the test statistic proposed in Sections 12.3 and12.4. We close in Section 12.6 with a brief summary and possible extensions.

12.2. POPULATION AND SAMPLE DISTRIBUTIONS population and sample distributions

12.2.1. Parametric distributions of sample data

Let U {1 . . .N } define the finite population of size N . In what follows weconsider single-stage sampling with selection probabilities t Pr(t s),t 1 . . .N . In practice, t may depend on the population valuesyU ( y1 . . . yN ) of the outcome variable, the values xU [x 1 . . . xN ] of theauxiliary variables and values zU [z1 . . . zN ] of design variables used for thesample selection but not included in the working model under consideration.We express this by writing t gt( yU , xU , zU ) for some function gt .

The probabilities t are generally not the same as the probabilitiesPr(t s yt , x t) defining the sample distribution in (12.1) because the latterprobabilities condition on (yt , x t) only. Nonetheless, by regarding t as arandom variable, the following relationship holds:

(12.2)

where EU (.) defines the expectation operator under the corresponding popula-tion distribution. The relationship (12.2) follows by defining It to be the sampleinclusion indicator and writing Pr(t s yt , x t) E (It yt , x t) E [E (It yu, xu, zu)yt , x t] E ( t yt , x t). Substituting (12.2) in (12.1) yields an alternative, and oftenmore convenient, expression for the sample pdf,

(12.3)

It follows from (12.3) that for a given population pdf, the marginal sample pdf isfully determined by the conditional expectation ( yt , x t) EU ( t yt , x t). Thepaper by Pfeffermann, Krieger and Rinott (1998, hereafter PKR) containsmany examples of pairs [ fU ( y, x ), ( y, x )] for which the sample pdf is of thesame family as the population pdf, although possibly with different parameters.In practice, the form of the expectations ( yt , x t) is often unknown, but it cangenerally be identified and estimated from the sample data (see below).

For independent population measurements PKR establish an asymptoticindependence of the sample measurements with respect to the sample distribu-tion, under commonly used sampling schemes for selection with unequal prob-abilities. The asymptotic independence assumes that the population sizeincreases holding, the sample size fixed. Thus, the use of the sample distributionpermits in principle the application of standard statistical procedures likelikelihood-based inference and residual analysis to the sample data.

The representation (12.3) enables also the establishment of relationshipsbetween moments of the population and the sample distributions. Denote theexpectation operators under the two distributions by EU and Es and let

POPULATION AND SAMPLE DISTRIBUTIONS 177

�� � �

� �� � � �

� �

� ��

��

��0� � ����+ ��7 � "� 0�����+ ��7

� � � � � � �� �

��0 �����7 � "� 0�����+ ��7�� 0 �����7�"� 0�����7�

� � �

Page 201: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

wt 1 t define the sampling weights. Pfeffermann and Sverchkov (1999,hereafter PS) develop the following relationship for pairs of (vector) randomvariables (ut , vt):

(12.4)

For ut yt ; vt x t , the relationship (12.4) can be utilized for semi-parametricestimation of moments of the population distribution in situations where theparametric form of this distribution is unknown. The term semi-parametricestimation refers in this context to the estimation of population moments (forexample, regression relationships) from the corresponding sample momentsdefined by (12.4), using least squares, the method of moments, or other estima-tion procedures that do not require full specification of the underlying distribu-tion. See PS for examples. Notice, also, that by defining ut t and vt ( yt , x t)or vt x t in (12.4), one obtains

(12.5)

Equation (12.5) shows how the conditional population expectations of t thatdefine the sample distribution in (12.3) can be evaluated from the sample data.The relationship (12.4) can be used also for nonparametric estimation of regres-sion models; see Chapter 11. The relationships between the population and thesample distributions and between the moments of the two distributions areexploited in subsequent sections.

12.2.2. Distinction between the sample and the randomization distributions

The sample distribution defined by (12.3) is different from the randomization(design) distribution underlying classical survey sampling inference. The ran-domization distribution of a statistic is the distribution of the values for thisstatistic induced by all possible sample selections, with the finite populationvalues held fixed. The sample distribution, on the other hand, accounts for the(superpopulation) distribution of the population values and the sample selec-tion process, but the sample units are held fixed. Modeling of the sampledistribution requires therefore the specification of the population distribution,but it permits the computation of the joint and marginal distributions of thesample measurements. This is not possible under the randomization distribu-tion because the values of the outcome variable (and possibly also the auxiliaryvariables) are unknown for units outside the sample. Consequently, the use ofthis distribution for inference on models is restricted mostly to estimationproblems and probabilistic conclusions generally require asymptotic normalityassumptions.

Another important advantage of the use of the sample distribution is that itallows conditioning on the values of auxiliary variables measured for thesample units. In fact, the definition of the sample distribution already uses aconditional formulation. The use of the randomization distribution for condi-tional inference is very limited at present; see Rao (1999) for discussion andillustrations.

178 FITTING GLMS UNDER INFORMATIVE SAMPLING

� �

"� 0 ����7 � "�0$� ����7�"�0$����7�� �

� ��

"� 0�����+ ��7 � ��"�0$����+ ��74"� 0�����7 � ��"�0$����7�

Page 202: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The sample distribution is different also from the familiar distribution,defined as the combined distribution over all possible realizations of the finitepopulation measurements (the population distribution) and all possiblesample values for a given population (the randomization p distribution). The

distribution is often used for comparing the performance of design-basedestimators in situations where direct comparisons of randomization variancesor mean square errors are not feasible. The obvious difference between thesample distribution and the distribution is that the former conditions onthe selected sample (and values of auxiliary variables measured for units in thesample), whereas the latter accounts for all possible sample selections.

F inally, rather than conditioning on the selected sample when constructingthe sample distribution (and hence the sample likelihood), one could computeinstead the joint distribution of the selected sample and the correspondingsample measurements. Denote by ys {yt , t s} the outcome variable valuesmeasured for the sample units and by x s {x t , t s} and x {x t , t s} thevalues of the auxiliary variables corresponding to the sampled and nonsampledunits. Assuming independence of the population measurements and independ-ent sampling of the population units (Poisson sampling), the joint pdf of(s, ys) (x s, x ) can be written as

(12.6)

where ( yt , x t) EU ( t yt , x t) and (x t) EU ( t x t). Note that the product ofthe terms in the first set of square brackets on the right hand side of (12.6) is thejoint sample pdf, fs( ys x s, s), for units in the sample as obtained from (12.3). Theuse of (12.6) for likelihood-based inference has the theoretical advantage ofemploying the information on the sample selection probabilities for units outsidethe sample, but it requires knowledge of the expectations (x t) EU ( t x t) forall t U and hence the values x s. This information is not needed when inferenceis based on the sample pdf, fs( ys x s, s). When the values x are unknown, it ispossible in theory to regard the values {x t , t s} as random realizations fromsome pdf g (x t) and replace the expectations (x t) for units t s by the uncondi-tional expectations (t) (x t)g (x t x t . See, for example, Rotnitzky andRobins (1997) for a similar analysis in a different context. However, modelingthe distribution of the auxiliary variables might be formidable and the resultinglikelihood f (s, ys x s, x ) could be very cumbersome.

12.3. INFERENCE UNDER INFORMATIVE PROBABILITYSAMPLING inference under informative probability sampling

12.3.1. Estimating equations with application to the GLM

In this section we consider four different approaches for defining estimatingequations under informative probability sampling. We compare the variousapproaches empirically in Section 12.5.

INFERENCE UNDER INFORMATIVE PROBABILITY SAMPLING 179

��

��

��

� �� � �

����

���� 0�+ �����+ ���7 �

����

A�0 ��+��7�� 0 �����7��0��7B����

�0��7�����

A�� �0��7B

� � � � � � � �

� � � �� �

� ��

��� � �

� � �� �� 7�

� ��

��

Page 203: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Suppose that the population measurements ( yU , xU ) {( yt , x t), t 1 . . .N }can be regarded as N independent realizations from some pdf fy, x . Denoteby fU ( y x ; ) the conditional pdf of yt given x t . The true value of the vectorparameter ( 0, 1, . . . , k ) is defined as the unique solution of the equa-tions

(12.7)

where dUt (dUt, 0dUt, 1 . . . dUt, k ) U ( yt x t ; ) is the tth score func-tion. We refer to (12.7) as the �parameter equations� since they define the vectorparameter . For the GLM defined in the introduction with the distribution ofyt belonging to the exponential family, fU ( y; , ) exp {[y b( )]a( ) c( y, )} where a(.) > 0, b(.) and c(.) are known functions, and isknown. It follows that ( ) h(x ) for somefunction h(.) with derivative g(.),

(12.8)

The �census parameter� (Binder, 1983) corresponding to (12.7) is defined as thesolution BU of the equations

(12.9)

Note that (12.9) defines the maximum likelihood estimating equations based onall the population values.

Let s {1 . . . n} denote the sample, assumed to be selected by some probabil-ity sampling scheme with first-order selection probabilities t Pr(t s). (Thesample size n can be random.) The first approach that we consider for specifyingthe estimating equations involves redefining the parameter equations withrespect to the sample distribution fs( yt x t) (Equation (12.3)), rather than thepopulation distribution as in (12.7). Assuming that the form of the conditionalexpectations EU ( t yt , x t) is known (see Section 12.3.2) and that the expect-ations EU ( t x t) y EU ( t y, x t) fU ( y x t ; )dy are differentiable with respect to

, the parameter equations corresponding to the sample distribution are

(12.10)

(The second equation follows from (12.3) and (12.5).) The parameters areestimated under this approach by solving the equations

(12.11)

Note that (12.11) defines the sample likelihood equations.

180 FITTING GLMS UNDER INFORMATIVE SAMPLING

� �

� �� � � � � �

(� 0 �7 ������

"� A������B � ;

���+ � � G�� � �0�0����77HA!0�

���7B��+ � �

log f

E ( y) b( ) so that if

� � � � � ���

�� � � �� � �

� � � �� � � � � � ��� � � �

( �� 0 �7 �

�����

��� � ;�

�� � �

� �� � � �

� � � ��

(��0 �7 ���

"�GA� ��� ��0 �����4 �7���B���H

���

"�GA��� � � ���"�0$����7���B���H � ;�

(��+ �0 �7 ���

A��� � � ���"�0$����7���B � ;�

Page 204: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The second approach uses the relationship (12.4) in order to convert thepopulation expectations in (12.7) into sample expectations. Assuming arandom sample of size n from the sample distribution, the parameter equationsthen have the form

(12.12)

where qt wt Es(wt x t). The vector is estimated under this approach bysolving the equations

(12.13)

The third approach is based on the property that if solves Equations (12.7),then it solves also the equations

where the expectation Ex is over the population distribution of the x t . Applica-tion of (12.4) to each of the terms EU (dUt) (without conditioning on x t) yieldsthe following parameter equations for a random sample of size n from thesample distribution:

(12.14)

The corresponding estimating equations are

(12.15)

An interesting feature of the equations in (12.15) is that they coincide with thepseudo-likelihood equations as obtained when estimating the census equations(12.9) by the Horvitz�Thompson estimators. (For the concept and uses ofpseudo-likelihood see Binder, 1983; SHS; Godambe and Thompson, 1986;Pfeffermann, 1993; and Chapter 2.) Comparing (12.13) with (12.15) showsthat the former equations use the adjusted weights, qt wt Es(wt x t) insteadof the standard weights wt used in (12.15). As discussed in PS, the weights qtaccount for the net sampling effects on the target conditional distribution ofyt x t , whereas the weights wt account also for the sampling effects on themarginal distribution of x t . In particular, when w is a deterministic functionof x so that the sampling process is noninformative, qt 1 and Equations(12.13) reduce to the ordinary likelihood equations (see (12.16) below). Theuse of (12.15) on the other hand may yield highly variable estimators in suchcases, depending on the variability of the x t .

The three separate sets of estimating equations defined by (12.11), (12.13),and (12.15) all account for the sampling effects. On the other hand, ignoring the

INFERENCE UNDER INFORMATIVE PROBABILITY SAMPLING 181

(�0 �7 ���

"�0,�������7 � ;

� � � �

(�+ �0 �7 ���

,���� � ;�

�(� 0 �7 ������

"� 0���7 � "�

�����

"� 0������7

� �� ;

(9�0 �7 ���

"�0$����7�"�0$�7 � ;�

(9�+ �0 �7 ���

$���� � ;�

� � �

Page 205: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

sampling process results in the use of the ordinary (face value) likelihoodequations

(12.16)

We consider the solution to (12.16) as a benchmark for the assessment of theperformance of the other estimators.

Comment The estimating equations proposed in this section employ the scoresdUt U ( yt x t ; ) . However, similar equations can be obtained for otherfunctions dUt ; see Bickel et al. (1993) for examples of alternative definitions.

12.3.2. Estimation of

The estimating equations defined by (12.11) and (12.13) contain the expect-ations Es(wt x t) that depend on the unknown parameters . When the wtare continuous as in probability proportional to size (PPS) sampling witha continuous size variable, the form of these expectations can be identifiedfrom the sample data by the following three-step procedure that utilizes(12.5):

1. Regress wt against (yt , x t) to obtain an estimate of Es(wt yt , x t).2. Integrate y EU ( t y, x t) fU ( y x t ; )dy y [1 Es(wt y, x t)]fU ( y x t ; )dy to

obtain an estimate of EU ( t x t) as a function of .3. Compute Es(wt x t) 1 EU ( t x t).

(The computations in steps 2 and 3 use the estimates obtained in the previousstep.) The articles by PKR and PS contain several plausible models forEU ( t yt , x t) and examples for which the integral in step 2 can be carried outanalytically. In practice, however, the specific form of the expectationEU ( t yt , x t) will usually be unknown but the expectation Es(wt yt , x t) can beidentified and estimated in this case from the sample. (The expectation dependson unknown parameters that are estimated in step 1.)

Comment 1 Rather than estimating the coefficients indexing the expectationEs(wt yt , x t) from the sample (step 1), these coefficients can be considered asadditional unknown parameters, with the estimating equations extended ac-cordingly. This, however, may complicate the solution of the estimating equa-tions and also result in identifiability problems under certain models. See PKRfor examples and discussion.

Comment 2 For the estimating equations (12.13) that use the weightsqt wt Es(wt x t), the estimation of the expectation Es(wt x t) can be carriedout by simply regressing wt against x t , thus avoiding steps 2 and 3. This is sobecause in this case there is no need to express the expectation as a function ofthe parameters indexing the population distribution.

182 FITTING GLMS UNDER INFORMATIVE SAMPLING

log f

(6�+ �0 �7 ���

��� � ;�

� � � � ���

� �

��0�����7

��� � � � � �

� � � �� � �

� � � � �

� �

� � �

� � � �

Page 206: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The discussion so far focuses on the case where the sample selection prob-abilities are continuous. The evaluation of the expectation Es(wt x t) in the caseof discrete selection probabilities is simpler. For example, in the empirical studyof this chapter we consider the case of logistic regression with a discreteindependent variable x and three possible values for the dependent variable y.For this case the expectation Es(wt x t k) is estimated as

(12.17)

where . Here I(A ) is theindicator function for the event A . Substituting the logistic function forPr( yt a x t k) in the last expression of (12.17) yields the required specifica-tion. The estimators wak are considered as fixed numbers when solving theestimating equations.

For the estimating equations (12.13), the expectations Es(wt x t k) in(12.13) in this example are estimated by

(12.18)

rather than using (12.17) that depends on the unknown logistic coefficients (seeComment 2 above).

For an example of the evaluation of the expectation Es(wt x t) with discreteselection probabilities but continuous outcome and explanatory variables, seePS (section 5.2).

12.3.3. Testing the informativeness of the sampling process

The estimating equations developed in Section 12.3.1 for the case of informa-tive sampling involve the use of the sampling weights in various degrees ofcomplexity. It is clear therefore that when the sampling process is in factnoninformative, the use of these equations yields more variable estimatorsthan the use of the ordinary score function defined by (12.16). See Tables12.2 and 12.4 below for illustrations. For the complex sampling schemes incommon use, the sample selection probabilities are often determined by thevalues of several design variables, in which case the informativeness of theselection process is not always apparent. This raises the need for test proceduresas a further indication of whether the sampling process is ignorable or not.Several tests have been proposed in the past for this problem. The common

INFERENCE UNDER INFORMATIVE PROBABILITY SAMPLING 183

� �"�0$���� � �7 ���"� 0����� � �7+

"� 0����� � �7 ��9���

��� 0 �� � ���� � �7"� 0����� � �+ �� � �7

��9���

��� 0 �� � ���� � �7�"�0$���� � �+ �� � �7

���9���

��� 0 �� � ���� � �7�$��

$�� � A�

� $�#0 �� � �+ �� � �7B�A�

� #0 �� � �+ �� � �7B#

� � �

� �

"�0$���� � �7 ��$� ���

$�#0�� � �7

� � ��

#0�� � �7

� ��

Page 207: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

feature of these tests is that they compare the probability-weighted estimatorsof the target parameters to the ordinary (unweighted) estimators that ignore thesampling process, see Pfeffermann (1993) for review and discussion. For theclassical linear regression model, PS propose a set of test statistics that comparethe moments of the sample distribution of the regression residuals to thecorresponding moments of the population distribution. The use of these testsis equivalent to testing that the correlations under the sample distributionbetween powers of the regression residuals and the sampling weights are allzero. In Chapter 11 the tests developed by PS are extended to situations wherethe moments of the model residuals are functions of the regressor variables, asunder many of the GLMs in common use.

A drawback of these test procedures is that they involve the use of a series oftests with dependent test statistics, such that the interpretation of the results ofthese tests is not always clear-cut. For this reason, we propose below a singlealternative test that compares the estimating equations that ignore the samplingprocess to estimating equations that account for it. As mentioned before, thequestion arising in practice is whether to use the estimating equations (12.16)that ignore the sample selection or one of the estimating equations (12.11),(12.13), or (12.15) that account for it, so that basing the test on these equationsis very natural.

In what follows we restrict attention to the comparison of the estimatingequations (12.13) and (12.16) (see Comment below). From a theoretical pointof view, the sampling process can be ignored for inference if the correspondingparameter equations are equivalent or . Denot-ing , the null hypothesis is therefore

(12.19)

Note that dim(Rn) k 1 dim( ). If were known, the hypothesis could betested by use of the Hotelling test statistic,

(12.20)

where

In practice, is unknown and the score dUt in (x t) has to be evaluated at asample estimate of . In principle, any of the estimates defined in Section 12.3.1could be used for this purpose since under H0 all the estimators are consistentfor , but we find that the use of the solution of (12.16) that ignores thesampling process is the simplest and yields the best results.

184 FITTING GLMS UNDER INFORMATIVE SAMPLING

�� "�0������7 �

�� "�0,�������

00��7 � "�0������7� "�0,�������

1;�0� � �����

00��7 � ;�

� � � � �

1007 � �� 0�� �7

�� ��0���

���

�0� 1; 2���+ ��0���7

�0� � �����

�00��74 �00��7 � 0��� � ,����7 � �

�� � �����

0 �00��7� �0�70 �00��7� �0�7��

� �0�

7#

7+

Page 208: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Let define the value of dUt evaluated at � the solution of (12.16) � andlet , and n be the corresponding values of , and Sn obtainedafter substituting for in (12.20). The test statistic is therefore

(12.21)

Note that 0 by virtue of (12.16), so . Therandom variables are no longer independent since 0, but utiliz-ing the property that 1 implies that under the null hypothesisvar var var thus justifyingthe use of as an estimator of var in the construction of the teststatistic in (12.21).

Comment The Hotelling test statistic uses the estimating equations (12.13) forthe comparison with (12.16) and here again, one could use instead the equa-tions defined by (12.11) or (12.15): that is, replace qtdUt in the definition of

, or by wtdUt respectively. The use of (12.11) ismore complicated since it requires evaluation of the expectation Es(wt x t) as afunction of (see Section 12.3.2). The use of (12.15) is the simplest but it yieldsinferior results to the use of (12.13) in our simulation study.

12.4. VARIANCE ESTIMATION variance estimation

Having estimated the model parameters by any of the solutions of the estimat-ing equations in Section 12.3.1, the question arising is how to estimatethe variances of these estimators. Unless stated otherwise, the true (estimated)variances are with respect to the sample distribution for a given sample of units,that is, the variance under the pdf obtained by the product of the samplepdfs (12.3). Note also that since the estimating equations are only for the

-parameters, with the coefficients indexing the expectations Es(wt yt , x t) heldfixed at their estimators of these values, the first four variance estimators belowdo not account for the variability of the estimated coefficients.

For the estimator 1s defined by the solution to the estimating equations(12.11), that is, the maximum likelihood estimator under the sample distribu-tion, a variance estimator can be obtained from the inverse of the informationmatrix evaluated at this estimator. Thus,

(12.22)

For the estimators 2s solving (12.13), we use a result from Bickel et al. (1993).By this result, if for the true vector parameter 0, the left hand side of anestimating equation Wn( ) 0 can be approximated as Wn( 0) n 1

s ( yt , x t ; 0) Op(n 1 2) for some function satisfying E ( ) 0 andE ( 2) , then under some additional regularity conditions on the order ofconvergence of certain functions,

VARIANCE ESTIMATION 185

�� �� J�00��7+ �0�

�� �00��7+ �0��� �

�1007 � �� 0�� �7

�� ��0�

������

�0� �1; 2���+ ��0���7�

��

������ � �0� � ����

�� ,�

����

,����� � ��� �

��A�

� ,�����B �A

�� 0,�

���� � ����7B ��

� �0 ���� � ,�����7+

����0�� �7

"�0,����7

� �

���00 7

�00��7 ," ��� � � ��� A"�0$����7B����

� �

��

�4 0����7 � G� "�A�(��+ �0 �7����B�� ����

H���

���

� � � � ��� � � � � � � �� � �

Page 209: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

(12.23)

where n is the solution of and W ( ) 0is the parameter equation with assumed to be nonsingular.

For the estimating equations (12.13), ( yt , x t ; ) qtdUt , implying that thevariance of 2s solving (12.13) can be estimated as

(12.24)

where and dUt( 2s) is the value of dUt evaluatedat 2s. Note that since (and also theestimator (12.24) estimates the conditional variance that is,the variance with respect to the conditional sample distribution of the outcome y.

The estimating equations (12.15) have been derived in Section 12.3.1 by twodifferent approaches, implying therefore two separate variance estimators.Under the first approach, these equations estimate the parameter equations(12.14), which are defined in terms of the unconditional sample expectation (theexpectation under the sample distribution of {( yt , x t), t s}). Application ofthe result from Bickel et al. (1993) mentioned before yields the followingvariance estimator (compare with (12.24)):

(12.25)

where . For this case (and alsoso that (12.25) estimates the unconditional variance over the

joint sample distribution of {( yt , x t), t s}.Under the second approach, the estimating equations (12.15) are the ran-

domization unbiased estimators of the census equations (12.9). As such, thevariance of 3s can be evaluated with respect to the randomization distributionover all possible sample selections, with the population values held fixed.Following Binder (1983), the randomization variance is estimated as

(12.26)

where is design (randomization) consistent for andis an estimator of the randomization variance of

s wtdUt( ), evaluated at 3s.In order to illustrate the difference between the variance estimators (12.25)

and (12.26), consider the case where the sample is drawn by Poisson samplingsuch that units are selected into the sample independently by Bernoulli trials

186 FITTING GLMS UNDER INFORMATIVE SAMPLING

���0��� � �;7 � �����

A �( 0 �;7B���0 ��+ ��4 �;7� �0�7

�4�0���7 � A �(�+ �0���7B�� �

A,����0���7B

A �(�+ �0���7B

��

�� (�0 �7 � ;+ �( 0 �;7 � A�( 0 �7����B���; � �

� � ���

�( 0 �;7

�(�+ �0���7 � A�(�+ �0 �7���B�� ���

���� "�0,�������7 � ;

�� ,����0���7 � ;7+

4�0����G��+ � � �H74

�4�0��9�7 � A �(9�+ �0��9�7B�� �

A$����0��9�7B

A �(9�+ �0��9�7B

��

�400��9�7 � A ��(�� 0

��9�7B�� �40

��

$����0��9�7

� �A ��(

�� 0

��9�7B��

�(9�+ �0��9�7 � A�(9�+ �0 �7����B�� ��9�

"�0$����7 � ;�� $����0��9�7 � ;7

��

��(�� 0

��9�7 A�( �����

�B���;�40A

�� $����0��9�7B�

� ��

Page 210: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

with probabilities of success t Pr(t s). Simple calculations imply that forthis case the randomization variance estimator (12.26) has the form

(12.27)

where . Thus, the difference between the estimator de-fined by (12.25) and the randomization variance estimator (12.26) is in this casein the weighting of the products wtdUt( 3s) by the weights (1 t) in the latterestimator. Since 0 < (1 t) < 1, the randomization variance estimators aresmaller than the variance estimators obtained under the sample distribution.This is expected since the randomization variances measure the variationaround the (fixed) population values and if some of the selection probabilitiesare large, a correspondingly large portion of the population is included in thesample (in high probability), thus reducing the variance.

Another plausible variance estimation procedure is the use of bootstrapsamples. As mentioned before, under general conditions on the sample selec-tion scheme listed in PKR, the sample measurements are asymptotically inde-pendent with respect to the sample distribution, implying that the use of the(classical) bootstrap method for variance estimation is well founded. In con-trast, the use of the bootstrap method for variance estimation under therandomization distribution is limited, and often requires extra modifications;see Sitter (1992) for an overview of bootstrap methods for sample surveys. Lets stand for any of the preceding estimators and denote by b

s the estimatorcomputed from bootstrap sample b (b 1 . . . B), drawn by simple randomsampling with replacement from the original sample (with the same samplesize). The bootstrap variance estimator of s is defined as

(12.28)

where

It follows from the construction of the bootstrap samples that the estimator(12.28) estimates the unconditional variance of s. A possible advantage of theuse of the bootstrap variance estimator in the present context is that it accountsin principle for all the sources of variation. This includes the identification ofthe form of the expectations Es(wt yt , x t) when unknown, and the estimation ofthe vector coefficient indexing that expectation, which is carried out for eachof the bootstrap samples but not accounted for by the other variance estimationmethods unless the coefficients are considered as part of the unknown modelparameters (see Section 12.3.2).

VARIANCE ESTIMATION 187

� � �

�400��9�7 � A ��(�� 0

��9�7B�� �

0�� ��7A$����0��9�7B

A ��(

�� 0

��9�7B��

��(�� 0

��9�7 � �(9�+ �0��9�7#

�� � �� �

�� �����

�4)�0���7 � *���*)��

0��)� � �)�70

��)� � �)�7

�)� � *���*)��

��)� �

��

��

Page 211: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

12.5. SIMULATION RESULTS simulation results

12.5.1. Generation of population and sample selection

In order to assess and compare the performance of the parameter estimators,variance estimators, and the test statistic proposed in Sections 12.3 and 12.4, wedesigned a Monte Carlo study that consists of the following stages:

A. Generate a univariate population of x -values of size N 3000, drawnindependently from the discrete U[1, 5] probability function,Pr(X j) 0.2, j 1 . . . 5.

B. Generate corresponding y-values from the logistic probability function,

(12.29)

where C 1 exp ( 10 11x t) exp ( 20 21x t).Stages A and B were repeated independently R 1000 times.

C. From every population generated in stages A and B, draw a single sampleusing the following sampling schemes (one sample under each scheme):

Ca. Poisson sampling: units are selected independently with probabilities, where n 300 is the expected sample size and the

values zt are computed in two separate ways:

(12.30)

The notation Int[ ] defines the integer value and ut U (0, 1).Cb. Stratified sampling: the population units are stratified based on

either the values zt(1) (scheme Cb(1)) or the values zt(2) (schemeCb(2)), yielding a total of 13 strata in each case. Denote by S (h)( j) thestrata defined by the values zt( j) such that for unitst Sh( j), zt( j) z(h)( j), j 1, 2. Let N (h)( j) represent the corres-ponding strata sizes. The selection of units within the strata wascarried out by simple random sampling without replacement(SRSWR), with the sample sizes n(h)( j) fixed in advance. The samplesizes were determined so that the selection probabilities are similarto the corresponding selection probabilities under the Poisson sam-pling scheme and h nh( j) 300, j 1, 2.

The following points are worth noting:

1. The sampling schemes that use the values zt(1) are informative, as theselection probabilities depend on the y-values. The sampling schemes that

188 FITTING GLMS UNDER INFORMATIVE SAMPLING

� � �

��0 �� � ����7 � A �) 0 ��; � �����7B�-

��0 �� � ���7 � A �) 0 �; � ����7B�-

��0 �� � 9���7 � �� ��0 �� � ����7� ��0 �� � ���7

� � � � � � � � ��

�� � � ����

�� �

��0�70 �0�7 � #��A0F�17�� � � ��B4

��0�70 �07 � #��AF � � ��B�

� �

� � �

Page 212: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

use the values zt(2) are noninformative since the selection probabilitiesdepend only on the x -values and the inference is targeted at the populationmodel of the conditional probabilities of yt x t defined by (12.29).

2. For the Poisson sampling schemes, EU [ t {( yu, xu), u 1 . . .N }] dependsonly on the values (yt , x t) when zt zt(1), and only on the value x t whenzt zt(2). (With large populations, the totals N

t 1 zt( j) can be regarded asfixed.) For the stratified sampling schemes, however, the selection prob-abilities depend on the strata sizes N (h)( j) that are random (they varybetween populations), so that they depend in principle on all the popula-tion values {( yt , x t), t 1 . . .N }, although with large populations the vari-ation of the strata sizes between populations will generally be minor.

3. The stratified sampling scheme Cb corresponds to a case�control studywhereby the strata are defined based on the values of the outcome variable(y) and possibly some of the covariate variables. See Chapter 8. (Suchsampling schemes are known as choice-based sampling in the econometricliterature.) For the case where the strata are defined based only on y-values generated by a logistic model that contains intercept terms and thesampling fractions within the strata are fixed in advance, it is shown inPKR that the standard MLE of the slope coefficients (that is, the MLE thatignores the sampling scheme) coincides with the MLE under the sampledistribution. As illustrated in the empirical results below, this is no longertrue when the stratification depends also on the x -values. (As pointed outabove, under the design Cb the sampling fractions within the strata havesome small variation. We considered also the case of a stratified samplewith fixed sampling fractions within the strata and obtained almost identi-cal results as for the scheme Cb.) In order to assess the performance of theestimators derived in the previous sections in situations where the stratifi-cation depends only on the y-values, we consider also a third stratifiedsampling scheme:

Cc. Stratified sampling: The population units are stratified based on thevalues yt (three strata); select nh units from stratum h by SRSWRwith the sample sizes fixed in advance, h nh 300.

12.5.2. Computations and results

The estimators of the logistic model coefficients k {( k0, k1), k 1, 2} in(12.29), obtained by solving the estimating equations (12.11), (12.13), (12.15),and (12.16), have been computed for each of the samples drawn by the sam-pling methods described in Section 12.5.1, yielding four separate sets of estima-tors. The expectations Es(wt x t) have been estimated using the proceduresdescribed in Section 12.3.2. For each point estimator we computed the corres-ponding variance estimator as defined by (12.22), (12.24), and (12.25) for thefirst three estimators and by use of the inverse information matrix (ignoring thesampling process) for the ordinary MLE. In addition, we computed for each ofthe point estimators the bootstrap variance estimator defined by (12.27). Due

SIMULATION RESULTS 189

�� �

�� �

� �

� � � � �

Page 213: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

to computation time constraints, the bootstrap variance estimators are basedon only 100 samples for each of 100 parent samples. F inally, we computed foreach sample the Hotelling test statistic (12.21) for sample informativenessdeveloped in Section 12.3.3.

The results of the simulation study are summarized in Tables 12.1�12.5.These tables show for each of the five sampling schemes and each pointestimator the mean estimate, the empirical standard deviation (Std) and themean of the Std estimates (denoted Mean Std est. in the tables) over the 1000samples; and the mean of the bootstrap Std estimates (denoted Mean Std est.Boot in the tables) over the 100 samples. Table 12.6 compares the theoreticaland empirical distribution of the Hotelling test statistic under H0 for the twononinformative sampling schemes defined by Ca(2) and Cb(2).

The main conclusions from the results set out in Tables 12.1�12.5 are asfollows:

Table 12.1 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients. Poisson sampling, informative scheme Ca (1).

Coefficients 10 11 20 21

MLE (sample pdf, Equation (12.11)):True values 1.00 0.30 0.50 0.50Mean estimate 1.03 0.30 0.54 0.50Empirical Std 0.62 0.18 0.61 0.18Mean Std est. 0.58 0.18 0.58 0.18Mean Std est. Boot 0.56 0.18 0.56 0.18

Q-weighting (Equation (12.13)):True values 1.00 0.30 0.50 0.50Mean estimate 0.99 0.31 0.51 0.51Empirical Std 0.63 0.19 0.63 0.19Mean Std est. 0.59 0.18 0.59 0.18Mean Std est. Boot 0.68 0.21 0.66 0.20

W -weighting (Equation (12.15)):True values 1.00 0.30 0.50 0.50Mean estimate 1.00 0.31 0.53 0.51Empirical Std 0.67 0.21 0.67 0.20Mean Std est. 0.63 0.19 0.62 0.19Mean Std est. Boot 0.65 0.21 0.64 0.20

Ordinary MLE (Equation (12.16)):True values 1.00 0.30 0.50 0.50Mean estimate 0.29 0.43 0.06 0.59Empirical Std 0.60 0.19 0.60 0.18Mean Std est. 0.59 0.18 0.58 0.18Mean Std est. Boot 0.57 0.18 0.58 0.18

190 FITTING GLMS UNDER INFORMATIVE SAMPLING

� � � �

Page 214: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 12.2 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients. Poisson sampling, noninformative scheme Ca (2).

Coefficients 10 11 20 21

MLE (sample pdf, Equation (12.11)):True values 1.00 0.30 0.50 0.50Mean estimate 1.01 0.31 0.53 0.50Empirical Std 0.66 0.20 0.67 0.20Mean Std est. 0.62 0.20 0.62 0.20Mean Std est. Boot 0.62 0.20 0.62 0.20

Q-weighting (Equation (12.13)):True values 1.00 0.30 0.50 0.50Mean estimate 0.99 0.31 0.51 0.51Empirical Std 0.67 0.20 0.68 0.20Mean Std est. 0.63 0.19 0.63 0.20Mean Std est. Boot 0.66 0.22 0.66 0.22

W -weighting (Equation (12.15)):True values 1.00 0.30 0.50 0.50Mean estimate 1.00 0.31 0.52 0.51Empirical Std 0.71 0.22 0.72 0.22Mean Std est. 0.65 0.21 0.66 0.21Mean Std est. Boot 0.68 0.22 0.69 0.22

Ordinary MLE (Equation (12.16)):True values 1.00 0.30 0.50 0.50Mean estimate 0.99 0.31 0.50 0.51Empirical Std 0.63 0.19 0.63 0.19Mean Std est. 0.60 0.19 0.61 0.19Mean Std est. Boot 0.62 0.20 0.63 0.20

1. The three sets of estimating equations defined by (12.11), (12.13), and(12.15) perform well in eliminating the sampling effects under informativesampling. On the other hand, ignoring the sampling process and using theordinary MLE (Equation (12.16)) yields highly biased estimators.

2. The use of Equations (12.11) and (12.13) produces very similar results. Thisoutcome, found also in PS for ordinary regression analysis, is importantsince the use of (12.13) is much simpler and requires fewer assumptionsthan the use of the full equations defined by (12.11); see also Section 12.3.2.The use of simple weighting (Equation (12.15)) that corresponds to theapplication of the pseudo-likelihood approach again performs well ineliminating the bias, but except for Table 12.5 the variances under thisapproach are consistently larger than under the first two approaches,illustrating the discussion in Section 12.3.1. On the other hand, the ordin-ary MLE that does not involve any weighting has in most cases the smalleststandard deviation. The last outcome is known also from other studies.

SIMULATION RESULTS 191

� � � �

Page 215: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 12.3 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients. Stratified sampling, informative scheme Cb (1).

Coefficients 10 11 20 21

MLE (sample pdf, Equation (12.11)):True values 1.00 0.30 0.50 0.50Mean estimate 1.05 0.30 0.53 0.51Empirical Std 0.56 0.17 0.59 0.17Mean Std est. 0.56 0.17 0.56 0.17Mean Std est. Boot 0.50 0.17 0.50 0.17

Q-weighting (Equation (12.13)):True values 1.00 0.30 0.50 0.50Mean estimate 1.05 0.30 0.53 0.50Empirical Std 0.55 0.16 0.58 0.17Mean Std est. 0.58 0.18 0.58 0.18Mean Std est. Boot 0.66 0.21 0.64 0.20

W -weighting (Equation (12.15)):True values 1.00 0.30 0.50 0.50Mean estimate 1.07 0.29 0.55 0.50Empirical Std 0.58 0.18 0.61 0.18Mean Std est. 0.61 0.19 0.61 0.19Mean Std est. Boot 0.63 0.21 0.63 0.21

Ordinary MLE (Equation (12.16)):True values 1.00 0.30 0.50 0.50Mean estimate 0.37 0.41 0.16 0.56Empirical Std 0.52 0.15 0.55 0.16Mean Std est. 0.57 0.18 0.57 0.18Mean Std est. Boot 0.52 0.18 0.52 0.18

3. The MLE variance estimator (12.22) and the semi-parametric estimators(12.24) and (12.25) underestimate in most cases the true (empirical) vari-ance with an underestimation of less than 8 %. As anticipated in Section12.4, the use of the bootstrap variance estimators corrects for this under-estimation by better accounting for all the sources of variation, but thisonly occurs with the estimating equations (12.13) and (12.15). We have noclear explanation for why the bootstrap variance estimators perform lesssatisfactory for the MLE equations defined by (12.11) and (12.16) but weemphasize again that we have only used 100 bootstrap samples for thevariance estimation, which in view of the complexity of the estimatingequations is clearly not sufficient. It should be mentioned also in thisrespect that the standard deviation estimators (12.22), (12.24), and(12.25) are more stable (in terms of their standard deviation) than thebootstrap standard deviation estimators, which again can possibly beattributed to the relatively small number of bootstrap samples. (The stand-ard deviations of the standard deviation estimators are not shown.)

192 FITTING GLMS UNDER INFORMATIVE SAMPLING

� � � �

Page 216: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 12.4 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients. Stratified sampling, noninformative scheme Cb(2).

Coefficients 10 11 20 21

MLE (sample pdf, Equation (12.11)):True values 1.00 0.30 0.50 0.50Mean estimate 1.06 0.29 0.53 0.50Empirical Std 0.63 0.19 0.64 0.20Mean Std est. 0.60 0.19 0.61 0.19Mean Std est. Boot 0.56 0.20 0.57 0.20

Q-weighting (Equation (12.13)):True values 1.00 0.30 0.50 0.50Mean estimate 1.04 0.30 0.51 0.51Empirical Std 0.63 0.20 0.64 0.20Mean Std est. 0.60 0.19 0.61 0.19Mean Std est. Boot 0.62 0.22 0.63 0.22

W -weighting (Equation (12.15)):True values 1.00 0.30 0.50 0.50Mean estimate 1.05 0.30 0.52 0.51Empirical Std 0.66 0.21 0.67 0.21Mean Std est. 0.62 0.20 0.63 0.20Mean Std est. Boot 0.63 0.22 0.65 0.23

Ordinary MLE (Equation (12.16)):True values 1.00 0.30 0.50 0.50Mean estimate 1.03 0.30 0.49 0.52Empirical Std 0.60 0.19 0.61 0.20Mean Std est. 0.58 0.19 0.59 0.19Mean Std est. Boot 0.56 0.20 0.57 0.20

4. Our last comment refers to Table 12.5 that relates to the sampling schemeCc by which the stratification is based only on the y-values. For this casethe first three sets of parameter estimators and the two semi-parametricvariance estimators perform equally well. Perhaps the most notable out-come from this table is that ignoring the sampling process in this case andusing Equations (12.16) yields similar mean estimates (with smaller stand-ard deviations) for the two slope coefficients as the use of the otherequations. Note, however, the very large biases of the intercept estimators.As mentioned before, PKR show that the use of (12.16) yields the correctMLE for the slope coefficients under Poisson sampling, which is close tothe stratified sampling scheme underlying this table.

Table 12.6 compares the empirical distribution of the Hotelling test statistic(12.21) over the 1000 samples with the corresponding nominal levels of thetheoretical distribution for the two noninformative sampling schemes Ca(2)and Cb(2). As can be seen, the empirical distribution matches almost perfectlythe theoretical distribution. We computed the test statistic also under the

SIMULATION RESULTS 193

� � � �

Page 217: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 12.5 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients. Stratified sampling, informative scheme Cc.

Coefficients 10 11 20 21

MLE (sample pdf, Equation (12.11)):True values 1.00 0.30 0.50 0.50Mean estimate 1.00 0.32 0.48 0.53Empirical Std 0.42 0.16 0.42 0.15Mean Std est. 0.40 0.14 0.38 0.14Mean Std est. Boot 0.39 0.15 0.37 0.14

Q-weighting (Equation (12.13)):True values 1.00 0.30 0.50 0.50Mean estimate 0.99 0.31 0.48 0.52Empirical Std 0.42 0.16 0.42 0.15Mean Std est. 0.43 0.15 0.42 0.15Mean Std est. Boot 0.44 0.16 0.43 0.16

W -weighting (Equation (12.15)):True values 1.00 0.30 0.50 0.50Mean estimate 1.00 0.31 0.48 0.52Empirical Std 0.42 0.16 0.42 0.15Mean Std est. 0.43 0.15 0.42 0.15Mean Std est. Boot 0.44 0.16 0.43 0.16

Ordinary MLE (Equation (12.16)):True values 1.00 0.30 0.50 0.50Mean estimate 0.01 0.31 0.04 0.52Empirical Std 0.36 0.14 0.36 0.13Mean Std est. 0.40 0.14 0.38 0.14Mean Std est. Boot 0.39 0.15 0.37 0.14

Table 12.6 Nominal levels and empirical distribution of Hotelling test statistic underH 0 for noninformative sampling schemes Ca(2) and Cb(2).

Nominal levels 0.01 0.025 0.05 0.10 0.90 0.95 0.975 0.99Emp. dist (Ca(2)) 0.01 0.02 0.04 0.09 0.89 0.95 0.98 0.99Emp. dist (Cb(2)) 0.01 0.03 0.05 0.08 0.91 0.96 0.98 0.99

three informative sampling schemes and in all the 3 1000 samples, the null-hypothesis of noninformativeness of the sampling scheme had been rejected atthe 1 % significance level, indicating very high power.

12.6. SUMMARY AND EXTENSIONS summary and extensions

This chapter considers three alternative approaches for the fitting of GLMunder informative sampling. All three approaches utilize the relationships

194 FITTING GLMS UNDER INFORMATIVE SAMPLING

� � � �

Page 218: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

between the population distribution and the distribution of the sample obser-vations as defined by (12.1), (12.3), and (12.4) and they are shown to performwell in eliminating the bias of point estimators that ignore the sampling process.The use of the pseudo-likelihood approach, derived here under the frameworkof the sample distribution, is the simplest, but it is shown to be somewhatinferior to the other two approaches. These two approaches require the model-ing and estimation of the expectation of the sampling weights, either as afunction of the outcome and the explanatory variables, or as a function ofonly the explanatory variables. This additional modeling is not always trivialbut general guidelines are given in Section 12.3.2. It is important to emphasizein this respect that the use of the sample distribution as the basis for inferencepermits the application of standard model diagnostic tools so that the goodnessof fit of the model to the sample data can be tested.

The estimating equations developed under the three approaches allow theconstruction of variance estimators based on these equations. These estimatorshave a small negative bias since they fail to account for the estimation of theexpectations of the sampling weights. The use of the bootstrap method that iswell founded under the sample distribution overcomes this problem for two ofthe three approaches, but the bootstrap estimators seem to be less stable.F inally, a new test statistic for the informativeness of the sampling processthat compares the estimating equations that account for the sampling processwith estimating equations that ignore it is developed and shown to performextremely well in the simulation study.

An important use of the sample distribution not considered in this chapter isfor prediction problems. Notice first that if the sampling process is informative,the model holding for the outcome variable for units outside the sample is againdifferent from the population model. This implies that even if the populationmodel is known with all its parameters, it cannot be used directly for theprediction of outcome values corresponding to nonsampled units. We mentionin this respect that the familiar �model-dependent estimators� of finite popula-tion totals assume noninformative sampling. On the other hand, it is possible toderive the distribution of the outcome variable for units outside the sample,similarly to the derivation of the sample distribution, and then obtain theoptimal predictors under this distribution. See Sverchkov and Pfeffermann(2000) for application of this approach.

Another important extension is to two-level models with application tosmall-area estimation. Here again, if the second-level units (schools, geographicareas) are selected with probabilities that are related to the outcome values, themodel holding for the second-level random effects might be different from themodel holding in the population. Failure to account for the informativeness ofthe sampling process may yield biased estimates for the model parameters andbiased predictions for the small-area means. Appropriate weighting may elim-inate the first bias but not the second. Work on the use of the sample distribu-tion for small-area estimation is in progress.

SUMMARY AND EXTENSIONS 195

Page 219: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 220: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

PART D

Longitudinal Data

Page 221: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 222: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 13

Introduction to Part DC. J. Skinner

13.1. INTRODUCTION introduction

The next three chapters bring the time dimension into survey analysis. WhereasParts B and C have been primarily concerned with analysing variation in aresponse variable Y across a finite population fixed in time, the next threechapters consider methods for modelling variation in Y not only across thepopulation but also across time.

In this part of the book, we shall use t to denote time, since this use is sonatural and widespread in the literature on longitudinal data, and use i todenote a population unit, usually an individual person. As a result, the basicvalues of the response variable of interest will now be denoted yit with twosubscripts, i for unit and t for time. This represents a departure from thenotation used so far, where t has denoted unit.

Models will be considered in which time may be either discrete or continuous.In the former case, t takes a sequence of possible values t1, t2, t3, . . . usuallyequally spaced. In the latter case, t may take any value in a given interval. Theresponse variable Y may also be either discrete or continuous, corresponding tothe distinction between Parts B and C of this book.

The case of continuous Y will be discussed first, in Section 13.2 and inChapter 14 (Skinner and Holmes). Only discrete time will be considered inthis case with the basic values to be modelled consisting of{yit ; i 1, . . . , N ; t 1, . . . , T } for the N units in the population and T equallyspaced time points. The case of discrete Y will then be discussed in Section 13.3and in Chapters 15 (Lawless) and 16 (Mealli and Pudney). The emphasis inthese chapters will be on continuous time.

A basic longitudinal survey design involves a series of waves of data collec-tion, often equally spaced, for all units in a fixed sample, s. At a given wave,either the current value of a variable or the value for a recent reference periodmay be measured. In this case, the sample data are recorded in discrete time andwill take the form {yit ; i s, t 1, . . . , T } for variables that are measured at allwaves, in the absence of nonresponse. In the simplest case, for discrete time

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd

� �

� �

Page 223: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

models of the kind considered by Skinner and Holmes (Chapter 14), the times1, . . . , T at which data are collected correspond to the time points for which themodel of interest is specified. In this case, longitudinal analysis may be con-sidered as a form of multivariate analysis of the vectors of T responses( yil , . . . , yiT ) and the question of how to handle complex sampling schemes inthe selection of s is a natural generalisation of this question for cross-sectionsurveys.

More generally, however, sampling needs to be considered in the broadercontext of the �observation plan� of the longitudinal survey. Thus, even if thesame sample of units is retained for all time points, it is necessary to considernot only how the sample was selected, but also for what values of t Y will bemeasured for sample units. Particular care may be required when models arespecified in continuous time, as by Lawless (Chapter 15) and by Mealli andPudney (Chapter 16). Sometimes, continuous time data are collected retro-spectively, over some time �window�. Sometimes, only the current value of Ymay be collected at each wave in a longitudinal survey. Methods of makinginference about continuous time models will need to take account of theseobservation plans. For example, if yit is binary, indicating whether unit i hasexperienced an event (first marriage, for example) by time t, then Lawless(1982, section 7.3) shows how data on yit at a finite set of time points t canstill be used to fit continuous time models, such as a survival model of age atwhich the event first occurs.

There are several further ways in which the sampling and observation planmay be more complicated (Kalton and Citro, 1993). There will usually beattrition and other forms of nonresponse as discussed by Skinner and Holmes(Chapter 14). The sample may be subject to panel rotation and may be updatedbetween waves to allow for new entrants to the population. Problems withtracking and tracing may lead to additional sources of incomplete data. Thetime windows used in data collection may also create complex censoring andtruncation patterns for model fitting, as discussed further in Section 15.3.

In addition to incomplete data issues, there are potentially new problems ofmeasurement error in longitudinal surveys. Recall error is common in retro-spective measurement and there may be difficulties in �piecing together� differ-ent fragments of an event history from multiple waves of a longitudinal survey(Skinner, 2000). Lawless (Chapter 15) refers to some of these problems.

13.2. CONTINUOUS RESPONSE DATA continuous response data

In this section we discuss the basic discrete time continuous response case,where values yit and x it are measured on a continuous response variable Y and avector of covariates X respectively, at equally spaced time points t 1, . . . , Tfor units i in a sample s. A basic regression model relating Y to X is then givenby

(13 1)

200 INTRODUCTION TO PART D

��� � ����� �� � ��� �

Page 224: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where is a vector of regression coefficients and ui and vit are disturbanceterms. The term ui allows for a �permanent� effect of the ith unit, above orbelow that predicted by the covariates X .

The term vit will usually be assumed to be the outcome of a random variable,which has mean 0 (conditional on the covariates). This is analogous to thestandard disturbance term in a regression model.

The term ui may sometimes be assumed to be a fixed unknown parameter, afixed effect, or sometimes, like vit , the outcome of a random variable, a randomeffect. The two approaches may be linked by viewing the fixed effects specifi-cation as a conditional version of the random effects specification, i.e. withinference being conditional on the effects in the sample (Hsiao, 1986, p. 42).From a survey sampling perspective, this fixed effects approach seems some-what unnatural. The actual units included in the sample will be arbitrary and afunction of the sampling scheme � it seems preferable that the interpretation ofthe parameters should not be dependent upon the actual sample drawn. It ismore natural therefore to specify the model in terms of random effects, wherethe distribution of the random effects may be interpreted in terms of a super-population from which the finite population has been drawn. This is theapproach adopted by Skinner and Holmes (Chapter 14).

The potential disadvantage of the random effects specification is that itinvolves stronger modelling assumptions than the fixed effects model; in par-ticular it requires that the ui have mean 0, conditional on the x it , whereas thefixed effects specification allows for dependence between the ui and the x it .Thus, if such dependence exists, estimation of , based upon the randomeffects model, may be biased. See Hsiao (1986, section 3.4) and Baltagi (2001,section 2.3.1) for further discussion.

There are many possible ways of extending the basic random effects model in(13.1) (Goldstein, 1995, Ch. 6). The coefficients may vary randomly acrossunits, giving a random coefficient model in contrast to (13.1), which may beviewed as a random intercept model. The vit terms may be serially correlated, asconsidered by Skinner and Holmes (Chapter 14) or the model may be dynamicin the sense that lagged values of the dependent variable, yit 1, yit 2, . . . may beincluded amongst the covariates. There are also alternative approaches to theanalysis of continuous response discrete time longitudinal data, such as the useof �marginal� regression models (Diggle et al., 2002), which avoid specifyingrandom effects.

A traditional approach to inference about random effects models is throughmaximum likelihood. The likelihood is determined by making parametricassumptions about the distribution of random effects, for example by assumingthat the ui and vit in (13.1) are normally distributed. It will usually not bepossible to maximise the likelihood analytically and, instead, iterative numer-ical methods may be used. One approach is to use extensions of estimationmethods for regression models, such as iterative generalised least squares(Goldstein, 1995). Another approach is to view the model as a �covariancestructure model� for the covariance matrix of the yit , treated as a multivariateoutcome for the different time points. Both these approaches are described by

CONTINUOUS RESPONSE DATA 201

� �

Page 225: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Skinner and Holmes (Chapter 14) together with their extensions to allow forcomplex sampling schemes.

13.3. DISCRETE RESPONSE DATA discrete response data

We now turn to the case when the response variable Y is discrete. One approachto the analysis of such data is to use methods based upon discrete responseextensions of the models discussed in the previous section. These extensionsmay be obtained in the same way that continuous response regression modelsare commonly generalised to discrete response models. For example, logistic orprobit regression models might be used for the case of binary Y (Hsiao, 1986,Ch. 7; Baltagi, 2001, Ch. 11).

An alternative approach is to use the methods of event history analysis whichfocus on the �spells� of time which units spend in the different categories or�states� of Y . The transitions between these states define certain �events�. Forexample, if Y is marital status then the transition from the state �single� to thestate �married� arises at the event of marriage. Event history models provide aframework for representing spell durations and transitions and their depend-ence over time and on covariates.

The choice between the two approaches is not a matter of which model iscorrect. Both kinds of model may be valid at the same time � they may simplybe alternative representations of the distribution of the observed outcomes.The choice depends rather on the objectives of the analysis. In simplified terms,the first approach is appropriate when the aim is, as in Section 13.2, to model thedependence of the response variable Y on the covariates and when the temporalnature of the variables is usually not central to interpretation. The event historyapproach is appropriate when the aim is to model the durations in states and/ormodel the transitions between states or equivalently the timing of events.

The first approach is usually based upon discrete time data, as in Section13.2, with the observations usually corresponding to the waves of the longitu-dinal survey. We shall not consider this approach further in this chapter, butrefer the reader to discussions of discrete response longitudinal data in Hsiao(1986, Ch. 7) and Diggle et al. (2002).

The alternative event history approach will be considered by Lawless(Chapter 15) and Mealli and Pudney (Chapter 15) and is introduced furtherin this section. This approach focuses on temporal social phenomena (events,transitions, durations), which are usually not tied to the timing of the waves ofthe longitudinal survey. For example, it is unlikely that the timing of births,marriages or job starts will coincide with the waves of a survey. Since suchtiming is central to the aims of event history analysis, it is natural to defineevent history models in a continuous time framework, unconstrained by thetiming of the survey data collection, and this is the framework adopted inChapters 15 and 16 and assumed henceforth in this section. Discrete timeanalogues are, nevertheless, sometimes useful � see some comments by Lawless(Chapter 15) and Allison (1982) and Yamaguchi(1991).

202 INTRODUCTION TO PART D

Page 226: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The period of time during which a unit is in a given state is referred to as anepisode or spell. The transition between two states defines an event. Forexample, an individual may move between three states: employed, unemployedand not in the labour force. The individual may thus experience a spell ofemployment or unemployment. The event of finding employment correspondsto the transition between the state of unemployment (or not in the labour force)and the state of employment. While changes in states define events, events mayconversely be taken to define states. For example, the event of marriage definesthe two states �never-married� and �ever-married�, which an individual maymove between. A discrete response process in continuous time may thus berepresented either as a sequence of states or as a sequence of events. This isformalised by Lawless (Chapter 15), who distinguishes two mathematicallyequivalent frameworks, which may be adopted for event history analysis. Themulti-state framework defines the event history in terms of Y i(t), the stateoccupied by unit i at time t. In our discussion above, Y i(t) is just the same asyit , the discrete response of unit i at time t. This is also the basic frameworkadopted by Mealli and Pudney (Chapter 16). The alternative framework con-sidered by Lawless is the event occurrence framework in which the event historyis defined in terms of the number of occurrences of events of different types atdifferent times. The latter framework may be more convenient when, forexample, there is interest in the frequency of certain repeated events, such asconsumer purchases. The remaining discussion in this section focuses on themulti-state framework.

In order to understand the general models and methods of event historyanalysis considered in this part of the book, it is sensible first to understand themodels and methods of survival analysis (e.g. Lawless,1982; Cox and Oakes,1984). As discussed by Lawless (Chapter 15), survival analysis deals with thespecial case when there are just two states and there is only one allowabletransition, from state 1 to state 2. The focus then is on the time spent in state1 before the transition to state 2 occurs. This time is called the survival time orduration. Basic survival analysis models treat this survival time as a randomvariable. The probability distribution of this random variable may be repre-sented in terms of the hazard function (t), which specifies the instantaneousrate of transition to state 2 given that the unit is still in state 1 at time t (seeEquation (15.7)). This hazard function is also called the transition intensity. Itmay depend on covariates x and otherwise on the unit.

The basic survival analysis model is extended to general event historymodels, by defining a series of transition intensity functions kl(t) for eachpair of distinct states k and l. The function is given in Equations (15.4) and(16.2) and represents the instantaneous probability of a transition from statek into state l at time t. This function may depend on the past history of the Y i(t)process, in particular on the duration in the current state k , it may depend oncovariates x and it may depend otherwise on the unit i.

In the simple survival analysis case, knowledge of the hazard function isequivalent to knowledge of the distribution of the duration (Equations (15.8)and (15.9) show how the latter distribution may be derived from the hazard

DISCRETE RESPONSE DATA 203

Page 227: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

function). In the more general event history model, the intensity functionscapture two separate features of the process, which may each be of interest.F irst, the intensity functions imply the relative probabilities of transition toalternative states, given that a transition occurs. Second, as for survival analy-sis, the intensity functions imply the distribution of the duration in any givenstate (Equation (15.6)).

The corresponding distinction in the data between states and durations isemphasised in Mealli and Pudney�s notation (Chapter 16). Consider the data{Y i(t); t1 t t2} generated by observing a unit i over a finite interval of time[t1, t2]. Mealli and Pudney summarise these data in terms of 1, 2, . . . , m andr0, r1, . . . , rm 1, where m is the number of observed episodes, j is the durationof the jth episode and rj is the destination state of this episode. The initial stateis r0 Y i(t1). The joint distribution of the j and the rj may then be representedin terms of the intensity functions in Equation (16.3). In Lawless� notation(Chapter 15) the times of the events for unit i are denoted ti1 ti2 ti3 . . .so the durations j are given by tij tij 1.

Event history models may be represented parametrically, semi-parametricallyor non-parametrically. A simple example of a parametric model is the Weibullmodel for the duration of breast feeding in Section 15.8. A more complexexample is given by Mealli and Pudney (Chapter 16), who discuss a four-stateapplication using a parametric model, which is primarily based upon the Burrform for the transition intensities in Equation (16.8). The most well-knownexample of a semi-parametric model is the proportional hazards model, definedin Equation (15.10) and illustrated also on the breast feeding data. Somediscussion of non-parametric methods such as the Kaplan�Meier estimationof the survivor function is given in Section 15.5.

Right-censoring is a well-known issue in survival analysis, arising as theresult of incomplete observation of survival times. A wide range of suchobservation issues may arise in sample surveys and these are discussed inSection 15.3.

Beyond the issues of inference about models in the presence of sampleselection, incomplete data and measurement error, there is the further questionof how to use the fitted models to address policy questions. Mealli and Pudney(Chapter 16) consider the question of how to measure the impact of a youthtraining programme and describe a simulation approach based upon a fittedevent history model.

204 INTRODUCTION TO PART D

� �� � �

� �

� �

� � �� � �

Page 228: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 14

Random Effects Models forLongitudinal Survey DataC. J. Skinner and D. J. Holmes

14.1. INTRODUCTION introduction

Random effects models have a number of important uses in the analysis oflongitudinal survey data. The main use, which we shall focus on in this chapter,is in the study of individual-level dynamics. Random effects models enablevariation in individual responses to be decomposed into variation between the�permanent� characteristics of individuals and temporal �transitory� variationwithin individuals.

Another important use of random effects models in the analysis of longitu-dinal data is in allowing for the effects of time-constant unobserved covariatesin regression models (e.g. Solon, 1989; Hsiao, 1986; Baltagi, 2001). Failure toallow for these unobserved covariates in regression analysis of cross-sectionalsurvey data may lead to inconsistent estimation of regression coefficients.Consistent estimation may, however, be achievable with the use of randomeffects models and longitudinal data.

A �typical� random effects model may be conceived of as follows. It issupposed that a response variable Y is measured at each of a number ofsuccessive waves of the survey. The measurement for individual i at wave t isdenoted yit and this value is assumed to be generated in two stages. F irst,�permanent� random effects i are generated from some distribution for eachindividual i. Then, at each wave, yit is generated from i. In the simplest casethis generation follows the same process independently at each wave. Forexample, we may have

Under this model, longitudinal data enable the �cross-sectional� varianceto be decomposed into the variance of the �permanent�

component i and the variance of the �transitory� component at each wave.

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd

yy

yi � N y, s21

ÿ �, yit j yi � N yi, s2

2ÿ �

: (14.1)

s21 � s2

2 s21

y s22

of yit

#

Page 229: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

This may aid understanding of the mobility of individuals over time in terms oftheir place in the distribution of the response variable. An example, which weshall focus on, is where the response variable is earnings, subject to a logtransformation, and a model of the form (14.1) enables us to study the degreeof mobility of an individual�s place in the earnings distribution (e.g. Lillard andWillis, 1978).

Rich classes of random effects models for longitudinal data have been de-veloped for the above purposes. A number of different terms have been used todescribe these models including variance component models, error componentmodels, mixed effects models, multilevel models and hierarchical models (Baltagi,2001; Hsiao, 1986; Diggle, Liang and Zeger, 1994; Goldstein, 1995).

The general aim of this chapter is to consider how to take account of complexsampling designs in the fitting of such random effects models. We shall supposethat there is a known probability sampling scheme employed to select thesample of individuals followed over the waves of the survey. Two additionalcomplications will be that there may be wave nonresponse, so that not allsampled individuals will respond at each wave, and that the target populationof individuals may not be fixed over time.

To provide a specific focus, we will consider data on earnings of maleemployees over the first five waves of the British Household Panel Survey(BHPS), that is over the period 1991�5. As a basic model for the log earningsyit of individual i at wave t 1, . . . , T we shall suppose that

where the random effect ui is the �permanent� random effect, referred to earlier,and the it are transitory random effects, whose effects on the response variablemay last beyond the current wave t via the first-order autoregressive (AR(1))model:

Both ui and it may include the effects of measurement errors (Abowd andCard, 1989). The random variables ui and it are assumed to be mutuallyindependent with

The unknown fixed parameters t (t 1, . . . , T ) represent annual (inflation)effects. Lillard and Willis (1978) considered this model (amongst others) forlog-earnings for seven years (1967�73) of data from the US Panel Study ofIncome Dynamics. Letting var( it) and assuming the it and it are mutu-ally independent and stationary, we obtain

We refer to the above model as Model B and to the more restricted �variancecomponents� model in which 0 as Model A. See Goldstein, Healy andRasbash (1994) for further discussion of such models.

206 RANDOM EFFECTS MODELS FOR LONGITUDINAL SURVEY DATA

�� �yit � bt � ui � nit, t � 1, . . . , T (14.2)

nit � rnitÿ1 � "it, t � 1, . . . , T : (14.3)

n"

E(ui) � E("it) � 0, var(ui) � s2u, var("it) � s2

":

b �

s2n � n " n

s2n � s2

" = 1ÿ r2ÿ �: (14.4)

n

r �

Page 230: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

We shall consider two broad approaches to fitting these models under acomplex sample design. The first is a covariance structure approach, followingChamberlain (1982) and Skinner, Holt and Smith (1989, section 3.4.5, hence-forth referred to SHS), in which the observations on the T waves are treated as amultivariate outcome with individuals as �single-level� units. This approach is setout in Section 14.2. The second approach treats the data as two-level (Goldstein,1995) with the level 1 units as the waves t 1, . . . , T and the level 2 units as theindividuals i. The aim is to apply the methods developed by Pfeffermann et al.(1998). This approach is set out in Section 14.3. A related approach is developedby Feder, Nathan and Pfeffermann (2000) for a model with time-varyingrandom effects. The application of both our approaches to earnings data fromthe British Household Panel Survey will be considered in Section 14.4.

14.2. A COVARIANCE STRUCTURE APPROACH a covariance structure approach

Following the notation in Section 14.1, let be the T 1vector representing the profile of values of individual i over the T waves ofthe survey. Under the model defined by (14.2)�(14.4), these multivariate out-comes are independent with mean vector and covariance matrix given respect-ively by

where JT is the T T matrix of ones and VT ( ) is the T T matrix with the(tt )th element given by

These equations define a �covariance structure� model in which the meanvector is unconstrained but the k T (T 1) 2 distinct elements of the covar-iance matrix are constrained to be functions of the parameter vector

Inference about these parameters may follow the approachoutlined in SHS (section 3.4.5).

Assuming first no nonresponse, let the data consist of the values yi for units iin a sample s. The usual survey estimator of the finite population covariancematrix S is given by

where

and where wi is the survey weight for individual i. Let vech( ) denote thek 1 vector of distinct elements of (the �vector half� of : see Fuller, 1987,p. 382) and let A ( ) vech[var( yi)] denote the corresponding vector of elem-ents of var( yi) from (14.6).

A COVARIANCE STRUCTURE APPROACH 207

yi � ( yi1, . . . , yiT ) 0 �

E( yi) � b � ( b1, . . . , bT ) 0, (14.5)

var( yi) � s2u JT � s2

nVT ( r), (14.6)

r �0 r(t 0ÿt) (1 � t � t 0 � T).

� � =

y � (s2u, s2

" , r)0.

S �X

s

wi( yi ÿ �y)( yi ÿ �y)0=X

s

wi, (14.7)

�y �X

wiyi=X

wi,

A � S

� S S

s s

y �

Page 231: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Following Chamberlain (1982), Fuller (1984) and SHS (section 3.4.5), ageneral class of estimators of is obtained by minimising

where V is a given k k non-singular matrix. A generalised least squares (GLS)estimator GLS is obtained by taking V to be a consistent estimator of thecovariance matrix of . One choice, Vc, is obtained from the linearisationmethod (Wolter, 1985) by approximating the covariance matrix of the elementsof by the covariance matrix of the corresponding elements of the linearstatistic

where wi is treated as a fixed variable. Theestimator Vc may allow in the usual way for the complex design (Wolter,1985).

Since A ( ) is a non-linear function of , iterative minimisation of (14.8) isrequired. It may be noted that, for a given value of under Model B, A ( ) islinear in and so closed form expressions may be determined for thevalues and which minimise (14.8) for given . The iterativeminimisation may thus be reduced to a scalar problem. A consistent estimatorof the covariance matrix of GLS is given by (Fuller, 1984)

whereAn advantage of the GLS approach is that it provides a ready-made good-

ness-of-fit test as the minimised value of the criterion in (14.8), namely the Waldstatistic:

If the model is correct and if the sample is large enough for Vc to be a goodapproximation to the covariance matrix of , then X 2

W should be distributedapproximately as chi-squared with k q degrees of freedom, where q 2 and 3for Models A and B respectively.

One potential problem with the GLS estimator is that the covariance matrixestimator may be unstable if it is based on a relatively small number of degreesof freedom. This may lead to departures from the null distribution of X 2

Wassumed above. In this case, it may be preferable to consider alternative choicesof V . One approach is to let V be an estimator of the covariance matrix of

based upon the (false) assumption that observations are independent andidentically distributed. Thus, if we write

208 RANDOM EFFECTS MODELS FOR LONGITUDINAL SURVEY DATA

Aÿ A(y)h i0

Vÿ1 Aÿ A(y)h i

(14.8)

yA

Xs

zi, (14.9)

zi � wi[(yi ÿ �y)( yi ÿ �y)0 ÿ S]=P

s

y yr y

y

S

(s2u, s2

n)s2

u( r) s2n( r), r

y

VL yGLS

ÿ � � _A yGLS

ÿ �0Vÿ1

c_A yGLS

ÿ �h iÿ1, (14.10)

_A(y) � ]A(y)=]y.

X2W � Aÿ A yGLS

ÿ �h i0Vcÿ1 Aÿ A yGLS

ÿ �h i: (14.11)

A,ÿ �

A

A �X

s

ai, (14.12)

Page 232: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where ai vech(zi) denotes the k 1 vector of distinct elements of zi, then wemay set V equal to

where and n denotes the sample size. Although Viid may be morestable than a variance estimator which allows for the complex design, thischoice of V is still correlated with and, as discussed by Altonji and Segal(1996), may lead to serious bias in the estimation of . To avoid this problem,an even simpler approach is to set V equal to the identity matrix, when theestimator of obtained by minimising (14.8) may be viewed as an ordinary leastsquares (OLS) estimator. In both the cases when V Viid and when V is theidentity matrix, the resulting estimator will still be consistent for but theWald statistic X 2

W will no longer follow a chi-squared distribution if the modelis true. The large-sample distribution will instead be a mixture of chi-squareddistributions and this may be approximated by a chi-squared distribution usingone or two moment Rao�Scott approximations (SHS, Ch. 4). It is also nolonger appropriate to use expression (14.10) to obtain standard errors for theelements of . Instead, as noted in SHS (Ch. 3), a consistent estimator of thecovariance matrix of is

where V 0 is the specified choice of V (Viid or the identity matrix) used todetermine and Vc is a consistent estimator of the covariance matrix of

under the complex design. Note that this expression reduces to (14.10)when V 0 Vc.

The approach considered so far in this section is based on the estimatedcovariance matrix in (14.7) and assumes no nonresponse. This is an unrealis-tic assumption. The simplest way of handling nonresponse is to consider onlythose individuals who respond on all T waves, the so-called �attrition sample�,sT , at wave T . For longitudinal surveys, designed for multipurpose longitudinalanalyses, it is common to construct longitudinal weights wit at each wave t,which are appropriate for longitudinal analysis based upon data for the attri-tion sample st of individuals who respond up to wave t (Lepkowski, 1989).Thus, the simplest approach is to use only data from attrition sample sT and toreplace the weights wi, e.g. in (14.7), by the weights wiT .

A more sophisticated approach, aimed at producing more efficient estimates,uses data from all attrition samples s1, . . . , sT . A recursive approach to theestimation of the covariance matrix of yi may then be developed. Let

and let denote the estimated t t covariance matrixof y(t)

i . Begin the recursion by setting

A COVARIANCE STRUCTURE APPROACH 209

� �

Viid � nX

s

(ai ÿ �a)(ai ÿ �a)0=(nÿ 1), (14.13)

�a �Ps ai=n

A

y

y�

y y

yy

V y; V � V0ÿ � � [ _A(y)0Vÿ1

0_A(y)]ÿ1[ _A(y)0Vÿ1

0 VcVÿ10

_A(y)][ _A(y)0Vÿ10

_A(y)]ÿ1,

yA

S

y(t)i � ( yi1, . . . , yit)0 S(t) �

S(1) �Xs1

wi1( yi1 ÿ �y1)2=Xs1

wi1,

^

Page 233: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where

as in (14.7). At the tth step of the recursion (t 2, . . . , T ) set the(t 1) (t 1) submatrix of (t) corresponding to y(t 1)

i equal to . Letb(t) be the vector of weighted regression coefficients of yit on y(t 1)

i given by

where

Then set the (tt)th element of (t), corresponding to the variance of yit , equal to

where

Finally, let (t)t, t 1 denote the 1 (t 1) vector of remaining elements of S (t)

corresponding to the covariances between yit and y(t 1)i and let

The recursive process is repeated for t 2, . . . , T . If yi is multivariate normaland there are no weights the resulting (t) is a maximum likelihood estimator(Holt, Smith and Winter, 1980) for data from the set of attrition samples. Ingeneral, the estimator may be viewed as a form of pseudo-likelihood estimator(see Chapter 2). If the weights do not vary greatly, if yi is approximatelymultivariate normal and the observations for most individuals fall into one ofthe attrition samples, the estimator (t) may be expected to be fairly efficient.

Weighting can become unwieldy if it is attempted to adjust for all possible wavenonresponse patterns in addition to the attrition samples. See, for example,Lepkowski (1989) for further discussion. For a more general discussion ofinference in the presence of nonresponse see Chapter 18. We return in Section14.4 to the application of the methods discussed in this section.

14.3. A MULTILEVEL MODELLING APPROACH a multilevel modelling approach

A second approach to handling complex survey designs in the fitting of themodels defined in Section 14.1 is by adapting standard approaches, such asiterative generalised least squares (IGLS), used for fitting random effectsmodels (Goldstein, 1995). Pfeffermann et al. (1998) have considered modifying

210 RANDOM EFFECTS MODELS FOR LONGITUDINAL SURVEY DATA

�y1 �Xs1

wi1yi1=Xs1

wi1

�ÿ � ÿ S(t) ÿ

S(tÿ1)ÿ

b(t) �X

st

wit( y(tÿ1)i ÿ �y(tÿ1))( y

(tÿ1)i ÿ �y(tÿ1))0

" #ÿ Xst

wit( y(tÿ1)i ÿ �y(tÿ1))yit

�y(tÿ1) �X

st

wity(tÿ1)i =

Xst

wit:

S

s2et � b(t)

0S(tÿ1)b(t),

1

s2et �

Xst

wit(eit ÿ �et)2=X

st

wit, eit � yit ÿ y(tÿ1)0i b(t), �et �

Xst

witeit=X

st

wit:

S ÿ � ÿÿ

S(t)t, tÿ1 � b(t)0 S(tÿ1):

�S

^

S

Page 234: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

IGLS estimation using an approach analogous to the pseudo-likelihoodmethod (see Chapter 2) for a model of the form (14.2), where the vit are notserially correlated. Here we consider the extension of their approach to alongitudinal context, allowing for serial correlation. A potential advantageof this approach is that covariates may be handled more directly in themodel. A potential disadvantage is that goodness-of-fit tests are not generatedso directly.

In multilevel modelling terminology (Goldstein, 1995), the individuals are thelevel 2 units and the repeated measurements at the different waves representlevel 1 units. Pfeffermann et al. (1998) allow for a two-stage sampling scheme,whereby the level 2 units i are selected with inclusion probabilities i and thelevel 1 units t with inclusion probabilities conditional on level 2 unit i beingselected. Weights wi and are then constructed equal to the reciprocals ofthese respective probabilities, which are assumed known. To adapt this ap-proach to our context of longitudinal surveys subject to wave nonresponse, itseems natural to let i denote the probability that individual i is sampled and

the probability that this individual responds at wave t. While we mayreasonably suppose that the i are known, it is not straightforward to estimatethe for general patterns of wave nonresponse (as noted in the covariancestructure approach of Section 14.2). We therefore restrict attention to estima-tion using only the data derived from the attrition samples st . As noted inSection 14.2, it is common for longitudinal weights wit to be available for usewith these attrition samples and we shall suppose here that these approximate

. We may then set wi equal to the design weight and equal toAlternatively, given wi1, . . . ,wiT , we may set and

(t 1 . . . T ). Note, in particular, that in this case 1 forall i. This approach treats the sample selection and the response process atthe first wave as a common selection process. In the approach of Pfeffermannet al. (1998), correction for bias by weighting tends to be more difficult at level1 than at level 2, because there tends to be more non-linearity in the IGLSestimator as a function of level 1 sums than of level 2 sums. Hence setting

may be preferable to setting because the resulting may beless variable and closer to one.

Having then constructed the weights wi and , the approach of Pfeffer-mann et al. (1998) may be applied to fit a model of form (14.2) where the vit arenot serially correlated. This is Model A. The basic approach is to modify theIGLS estimation procedure by weighting all sums over i by the weights wi andweighting all sums over t by the weights

Often survey weights are only available in a scaled form; for example, so thatthey sum to the sample size. For inference about many regression-type models,as in Parts B and C of this book, estimation procedures for the model param-eters are invariant to such scaling. Although this is also true for multilevelmodelling if the wi are scaled, it is not true if the weights are scaled.Pfeffermann et al. (1998) took advantage of this fact to choose a scaling tominimise small-sample estimation bias. In our context we consider scaling theweights to construct the scaled weights as

A MULTILEVEL MODELLING APPROACH 211

pptji

wtji

pptji

pptji

(pipt i)ÿ1 pÿ1i wt i

wit=wi. wi � wi1wtji � wit=wi1 � w1ji �

wi � pÿ1i

wtji

jj

wi � wi1 wtji

wtji.

wtji

wtji w�tji

Page 235: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where t (i) is the last wave at which individual i responds (1 T ).Hence the average weight for individual i across waves 1, . . . , t (i) isequal to one.

We now consider the question of how to adapt the approach of Pfeffermannet al. (1998) to allow for possible serial correlation of the vit in Model B. Wefollow an approach similar to that in Hsiao (1986, section 3.7), which is basedon observing that if we know then Model B may be transformed to the formof Model A by

The estimation procedure involves two steps:

Step 1. Eliminate the random effect ui by differencing the responses yit

and estimate the linear regression model

by OLS weighted by the weights wit for observations i in the attritionsamples st(t 2, . . . , T ), where the parameters t are unconstrained.Under Model B, the least squares estimator of is consistent for

Set

Step 2. Let and fit the model obtained from (14.14) for thetransformed data:

using the approach of Pfeffermann et al. (1998) with the assumptionsof Model A applying to the model in (14.15). The estimated varianceof is then divided by )2 to obtain the estimate

This two-step approach produces consistent estimators of the parameters ofModel B but the resulting standard errors of and will not allow foruncertainty in the estimation of .

F inally, we note that Pfeffermann et al. (1998) only allowed for the sample tobe clustered into level 2 units. In the application in Section 14.4 the samplingdesign will also lead to geographical clustering of the sample individuals into

212 RANDOM EFFECTS MODELS FOR LONGITUDINAL SURVEY DATA

w�tji � t�(i)wtji=Xt�(i)t�1

wtji

" #

� � t�(i) �w�tji

r

yit ÿ ryitÿ1 � ( bt ÿ rbtÿ1)� (1ÿ r)ui � "it: (14.14)

Dit � yit ÿ yitÿ1, i 2 st, t � 2, . . . , T

Dit � dt � gDitÿ1 � Zit

� dg g

g � cov(Dit, Ditÿ1)=var(Ditÿ1) t � 2, . . . , T

� [ÿ (1ÿ r)s2"=(1� r)]=[2s2

"=(1ÿ r)]� ÿ(1ÿ r)=2:

r � 1� 2g.

~yit � yit ÿ ryitÿ1

~yit � ~bt � ~ui � ~"it (14.15)

ÿ r s2u.

s2u s2

"

r

~ui (1

Page 236: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

primary sampling units. The procedure for standard error estimation proposedby Pfeffermann et al. (1998) therefore needs to be extended to handle this case.We shall not, however, consider this extension here, presenting only pointestimates for the multilevel modelling approach in the next section.

14.4. AN APPLICATION: EARNINGS OF MALE EMPLOYEES INGREAT BRITAIN an application

In this section we apply the approaches set out earlier to fit random effectsmodels to longitudinal data on the monthly earnings of male full-time em-ployees in Great Britain for the period 1991�5, using data from the BritishHousehold Panel Study (BHPS). The BHPS is a household panel survey,based on a sample of around 10 000 individuals. Data were first collected in1991 and successive waves have taken place annually (Berthoud and Gershuny,2000).

We base our analysis on the work of Ramos (1999). Like him, we consideronly men over the first five waves of the BHPS and divide the men into four agecohorts in order to control for life cycle effects. These cohorts consist of men (i)born before 1941, (ii) born between 1941 and 1950, (iii) born between 1951 and1960 and (iv) born after 1960.

The variable y is taken as the logarithm of earnings, with earnings beingdefined as the usual monthly earnings or salary payment before tax, for areference period determined in the survey. We avoid the problem of zeroearnings by defining the target population at wave t to consist of those menin the age cohorts who have positive earnings. It is thus possible for individualsto move in and out of the target population between waves. It is clearlyplausible that the earnings behaviour of those moving in and out of thetarget population will differ systematically from those remaining in the targetpopulation. For simplicity, we shall, however, assume that the models definedin Section 14.1 apply to all individuals when they have positive earnings.

The panel sample was selected by stratified multistage sampling, with postalsectors as primary sampling units (PSUs). We use the standard linearisationapproach to variance estimation for stratified multistage samples (e.g. SHS,p. 50). The BHPS involves a stratified sample of 250 PSUs. For the purpose ofvariance estimation, we approximate this stratified design as being defined by75 strata, obtained by first breaking down each of 18 regional strata into 2 or 3�major strata�, defined according to proportion of �head of households� inprofessional/managerial positions, and then by breaking down each of thesemajor strata into 2 �minor strata�, defined according to the proportion of thepopulation of pensionable age.

We first assess the fit of Models A and B (defined in Section 14.1) for each ofthe four cohorts. The results are presented in Table 14.1. We use goodness-of-fit tests based on the covariance structure approach of Section 14.2, with threechoices of the matrix V in (14.8):

AN APPLICATION 213

Page 237: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 14.1 Goodness-of-fit test statistics for Models A and B for four cohorts andthree estimation methods.

Model A Model B

Cohort(when born) OLS GLS (iid)

GLS(complex) OLS GLS (iid)

GLS(complex)

Before 1941 11.3 13.0 15.1 9.2 8.7 10.01941�50 41.2b 39.0b 39.9b 28.4b 27.0b 29.5b

1951�60 17.2 39.0b 43.3b 6.5 15.5 16.51960 29.1b 37.4b 35.5b 15.8 16.7 17.7

Notes: 1. Test statistics are weighted and are referred to the chi-squared distributionwith 13 df for Model A and 12 df for Model B.2. a significant at 5 % level; b significant at 1 % level.3. OLS and GLS (iid) test statistics involve Rao�Scott first-order correction.

OLS : V I , the identity matrix;GLS (iid): V Viid , as defined in (14.13);GLS (complex): V Vc, the linearisation estimator of the covariance matrix ofA� , based upon (14.9), allowing for the complex design.

For V Vc, the test statistic is given by X 2W in (14.11) with the null distribution

indicated in Section 14.2. For V I or Viid , the values of GLS and Vc in (14.11)are replaced by the corresponding values of and V and a first-order Rao�Scottadjustment is applied to the test statistic (SHS, Ch. 4). The same null distribu-tions as for Vc are used. Test statistics based upon second-order Rao�Scottapproximations were also calculated and led to similar results. All of the teststatistics are based on data from the attrition sample s5 at wave 5, for individualswho gave full interviews at each of the five waves. Longitudinal weights wi5 wereused, which allow both for unequal sampling probabilities and for differentialattrition from nonresponse over the five waves. To allow for the changingpopulation, the expression for the estimated covariance matrix in (14.7) wasmodified by including only those who reported positive earnings at each wave inthe estimation of the covariance between the log earnings at two waves.

The values of the test statistics in Table 14.1 are referred to a chi-squared nulldistribution with 13 degrees of freedom in the case of Model A and with 12degrees of freedom in the case of Model B. The results suggest that Model Aprovides an adequate fit for the cohort born before 1941 but not for the othercohorts and that Model B provides an adequate fit for all cohorts, except theone consisting of those born between 1941 and 1950.

The values of the test statistics vary according to the three choices of V . Thedifferences between the values of the test statistics for the GLS (iid) and GLS(complex) choices of V are not large, reflecting the fact that there is a largenumber of degrees of freedom for estimating the covariance matrix of(relative to the dimension of the matrix) and that the pairs of V matrices tendnot to be dramatically disproportionate. The value of the test statistic with V as

214 RANDOM EFFECTS MODELS FOR LONGITUDINAL SURVEY DATA

��

�� y

y

A

Page 238: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the identity matrix suggests a much better fit of both Models A and B for the1951�60 cohort and a somewhat better fit for the cohort born after 1960. Thismay be because this test statistic tends to be sensitive to different deviationsfrom the null hypothesis than the GLS test statistics. The 1951�60 cohort isdistinctive in having less variation among the estimated variances of log earn-ings over the five waves and, more generally, displays the least evidence of non-stationarity. Because of the high positive correlation between the elements ofthe test statistic with V as the identity matrix may be expected to attach greater�weight� to such departures from Model A than the GLS test statistics and thismay lead to the noticeable difference in values for the 1951�60 cohort. Stronggraphical evidence against Model A for this cohort is provided by Figure 14.1.This figure plots the elements of in (14.3) against and there is aclear tendency for the covariances to decline as the number of years betweenwaves increases. This suggests that the insignificant value of the test statistic forModel A, with V as the identity matrix, reflects lack of power.

Estimates of the parameters in Model B are presented in Table 14.2 forthe three cohorts for which Model B shows no significant lack of fit in Table14.1. Estimates are presented for the same three choices of V matrix as inTable 14.1. While the estimates based on the two GLS choices of V are fairlysimilar, the OLS estimates, with V as the identity matrix, can be noticeablydifferent, especially for the 1951�60 cohort. The effect of the differences for thecohort born after 1960 is illustrated in Figure 14.2, in which the estimatedvariances and covariances from (14.7) are presented together with fitted lines,joining the variances and covariances under Model B, implied by the parameterestimates in Table 14.2. The lines for the GLS choices of V are surprisinglylow, unlike the OLS line, which passes through the middle of the points. Similarunderfitting of the variances and covariances occurs for the other cohortsand this finding may reflect downward bias in such estimates employing

00

1 2 3 4

0.1

0.2

0.3

Var

ianc

es a

nd c

o-va

rianc

es

Years apart

Figure 14.1 Estimated variances and covariances for cohort born 1951�60.

AN APPLICATION 215

A,

Stt0 S jtÿ t0j

Page 239: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 14.2 Parameter estimates for Model B for three cohorts using covariancestructure approach.

Cohort(when born) Estimator

Parameter

Before 1941 OLS 0.37 (0.16) 0.165 (0.028) 0.049 (0.018)GLS (iid) 0.35 (0.16) 0.150 (0.024) 0.034 (0.011)GLS (complex) 0.32 (0.13) 0.143 (0.022) 0.034 (0.009)

1951�60 OLS 0.56 (0.11) 0.146 (0.021) 0.048 (0.015)GLS (iid) 0.85 (0.09) 0.109 (0.047) 0.026 (0.047)GLS (complex) 0.85 (0.09) 0.106 (0.044) 0.026 (0.045)

After 1960 OLS 0.49 (0.08) 0.155 (0.018) 0.071 (0.014)GLS (iid) 0.41 (0.07) 0.154 (0.016) 0.063 (0.010)GLS (complex) 0.40 (0.07) 0.150 (0.016) 0.061 (0.009)

Notes: 1. Standard errors in parentheses.2. Estimates are weighted and based only on data for attrition sample at wave 5.3. 1941�50 cohort is excluded because of lack of fit of Model B in Table 14.1.

0 1 2 3 40.1

0.2

0.3

Var

ianc

es a

nd c

o-va

rian

ces

Years apart

OLS

GLS (lid)

GLS (complex)

Figure 14.2 Estimated variances and covariances for cohort born after 1960 withvalues fitted under Model B.

sample-based V matrices, as discussed, for example, by Altonji and Segal (1996)and Browne (1984). The inversion of V implies that the lowest variances tend toreceive most �weight�, leading to the fitted line following more the lower envelopeof the points than the centre of them. The potential presence of non-negligible

216 RANDOM EFFECTS MODELS FOR LONGITUDINAL SURVEY DATA

r s2u s2

e

Page 240: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

bias suggests that choosing V as the identity matrix may be preferable here forthe purpose of parameter estimation, as concluded by Altonji and Segal (1996).

Table 14.3 shows for one cohort the effects of weighting, of the use of datafrom all attrition samples and of the use of the multilevel modelling approachof Section 14.3.

For the covariance structure approach, the impact of weighting is similar forall three choices of the matrix V . The fairly modest impact of weighting isexpected here, since the BHPS weights do not vary greatly and are not stronglyrelated to earnings.

The impact of using data from all attrition samples s1, . . . , s5, not just froms5, appears to be a little more marked than the impact of weighting. This mayreflect the fact that the earnings behaviour of those men who leave the samplebefore 1995 may be different from those who remain in the sample for all fivewaves. In particular, this behaviour may be less stable leading to a reduction inthe estimated correlation . Control for possible informative attrition might beattempted by including covariates in the model.

Table 14.3 Parameter estimates for Model B for cohort born after 1960.

EstimatorParameter

Covariance structure approachUsing attrition sample at wave 5 onlyWeightedOLS 0.49 0.155 0.071GLS (iid) 0.41 0.154 0.063GLS (complex) 0.40 0.150 0.061

UnweightedOLS 0.45 0.166 0.078GLS (iid) 0.37 0.161 0.068GLS (complex) 0.35 0.156 0.066

Using all five attrition samples (weighted)OLS 0.36 0.169 0.052GLS (iid) 0.38 0.158 0.048GLS (complex) 0.30 0.155 0.047

Multilevel modelling approachUsing attrition sample at wave 5 only

Weighted unscaled 0.41 0.169 0.041Weighted scaled 0.41 0.167 0.042Unweighted 0.41 0.170 0.045

Using all five attrition samplesWeighted unscaled 0.43 0.167 0.043Weighted scaled 0.43 0.163 0.045Unweighted 0.43 0.165 0.047

AN APPLICATION 217

r

r s2u s2

e

Page 241: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The results for the multilevel modelling approach in Table 14.3 are basedupon the two-step method described in Section 14.3. The estimated value of isfirst determined and then estimates of and are obtained by the method ofPfeffermann et al. (1998) either with or without weights and, in the former case,the weights may be scaled or not.

The impact of weighting on the multilevel approach is again modest, indeedsomewhat more modest than for the covariance structure approach. This may bebecause a common estimate of is used here. Scaling the weights also has littleeffect. This may be because all the weights are fairly close to one in thisapplication and thus scaling has less of an impact than in the two-stage samplingapplication in Pfeffermann et al. (1998).

The differences between the estimates from the covariance structure ap-proach and the corresponding multilevel modelling approaches are not espe-cially large in Table 14.3 relative to the standard errors in Table 14.2.Nevertheless, across all four cohorts and both models, the main differencesin the estimates between methods were between the three choices of V matrixfor the covariance structure approach and between the covariance structureand the multilevel approaches. The impact of weighting and the scaling ofweights tended to be less important.

14.5. CONCLUDING REMARKS concluding remarks

It is often useful to include random effects in the specification of models forlongitudinal survey data. In this chapter we have considered two approaches toallowing for complex survey designs and sample attrition when fitting suchmodels. The covariance structure approach is particularly natural with surveydata. The complex survey design and attrition are allowed for when makinginference about the covariance matrix of the longitudinal responses. Modellingof the structure of this matrix may then proceed in a standard way. The secondapproach is to adapt standard multilevel modelling procedures, extendingthe approach of Pfeffermann et al. (1998).

The two approaches may be compared in a number of ways:

The multilevel approach incorporates the different attrition samples moredirectly, although the possible creation of bias with unequal (for given i)with small numbers of level 1 units (i.e. small T ), as discussed by Pfeffer-mann et al. (1998), may be a problem.The multilevel approach incorporates covariates more naturally, althoughthe extension of the covariance structure approach to include covariatesusing LISREL models is well established.The covariance structure approach handles serial correlation moreeasily.The covariance structure approach generates goodness-of-fit tests and resi-duals at the level of variances and covariances. The multilevel approachgenerates unit level residuals.

218 RANDOM EFFECTS MODELS FOR LONGITUDINAL SURVEY DATA

rs2

u s2e

rwtji

wtji.

.

.

.

Page 242: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Finally, our application of the covariance structure approach to the BHPS datashowed evidence of bias in the estimation of the variance components whenusing GLS with a covariance matrix V estimated from the data. This accordswith the findings of Altonji and Segal (1996). This evidence suggests that it issafer to specify V as the identity matrix and use Rao � Scott adjustments fortesting.

CONCLUDING REMARKS 219

Page 243: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 244: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 15

Event History Analysis andLongitudinal SurveysJ. F. Lawless

15.1. INTRODUCTION introduction

Event history analysis as discussed here deals with events that occur over thelifetimes of individuals in some population. For example, it can be used toexamine educational attainment, employment, entry into marriage or parent-hood, and other matters. In epidemiology and public health it can be used tostudy the relationship between the incidence of diseases and environmental,dietary, or economic factors. The main objectives of analysis are to model andunderstand event history processes of individuals. The timing, frequency andpattern of events are of interest, along with factors associated with them. �Time�is often the age of an individual or the elapsed time from some event other thanbirth: for example, the time since a person married or the time since a diseasewas diagnosed. Occasionally, �time� may refer to some other scale than calendartime.

Two closely related frameworks are used to describe and analyze eventhistories: the multi-state and event occurrence frameworks. In the former afinite set of states {1, 2, . . . , K} is defined such that at any time an individualoccupies a unique state, for example employed, unemployed, or not in thelabour force. In the latter the occurrences of specific types of events areemphasized. The two frameworks are equivalent since changes of state canbe considered as types of events, and vice versa. This allows a unified statisticaltreatment but for description and interpretation we usually select one pointof view or the other. Event history analysis includes as a special case the area ofsurvival analysis. In particular, times to the occurrence of specific events(from a well-defined time origin), or the durations of sojourns in specificstates, are often referred to as survival or duration times. This area is welldeveloped (e.g. Kalbfleisch and Prentice, 2002; Lawless, 2002; Cox and Oakes,1984).

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 245: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Event history data typically consist of information about events and covari-ates over some time period, for a group of individuals. Ideally the individualsare randomly selected from a population and followed over time. Methods ofmodelling and analysis for such cohorts of closely monitored individuals arealso well developed (e.g. Andersen et al., 1993; Blossfeld, Hamerle and Mayer,1989). However, in the case of longitudinal surveys there may be substantialdepartures from this ideal situation.

Large-scale longitudinal surveys collect data on matters such as health,fertility, educational attainment, employment, and economic status at succes-sive interview or follow-up times, often spread over several years. For example,Statistics Canada�s Survey of Labour and Income Dynamics (SLID) selectspanels of individuals and interviews them once a year for six years, and itsNational Longitudinal Survey of Children and Youth (NLSCY) follows asample of children aged 0�11 selected in 1994 with interviews every secondyear. The fact that individuals are followed longitudinally affords the possibil-ity of studying individual event history processes. Problems of analysis canarise, however, because of the complexity of the populations and processesbeing studied, the use of complex sampling designs, and limitations in thefrequency and length of follow-up. Missing data and measurement error mayalso occur, for example in obtaining information about individuals prior totheir time of enrolment in the study or between widely spaced interviews.Attrition or losses to follow-up may be nonignorable if they are associatedwith the process under study.

This chapter reviews event history analysis and considers issues associated withlongitudinal survey data. The emphasis is on individual-level explanatory analy-sis so the conceptual framework is the process that generates individuals and theirlife histories in the populations on which surveys are based. Section 15.2 reviewsevent history models, and section 15.3 discusses longitudinal observationalschemes and conventional event history analysis. Section 15.4 discusses analyticinference from survey data. Sections 15.5, 15.6, and 15.7 deal with survivalanalysis, the analysis of event occurrences, and the analysis of transitions.Section 15.8 considers survival data from a survey and Section 15.9 concludeswith a summary and list of areas needing further development.

15.2. EVENT HISTORY MODELS event history models

The event occurrence and multi-state frameworks are mathematically equiva-lent, but for descriptive or explanatory purposes we usually adopt one frame-work or the other. For the former, we suppose J types of events are defined andfor individual i let

Y ij(t) number of occurrences of event type j up to time t. (15.1)

Covariates may be fixed or vary over time and so we let x i(t) denote thevector of all (fixed or time-varying) covariates associated with individual i attime t.

222 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

Page 246: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

In the multi-state framework we define

Y i(t) state occupied by individual i at time t, (15.2)

where Y i(t) takes on values in {1, 2, . . . , K}. Both (15.1) and (15.2) keep trackof the occurrence and timing of events; in practice the data for an individualwould include the times at which events occur, say ti1 ti2 ti3 . . . , and thetype of each event, say Ai1, Ai2, Ai3, . . . . The multi-state framework is usefulwhen transitions between states or the duration of spells in a state are ofinterest. For example, models for studying labour force dynamics often usestates defined as: 1 � Employed, 2 � Unemployed but in the labour force,3 � Out of the labour force. The event framework is convenient when patternsor numbers of events over a period of time are of interest. For example, inhealth-related surveys we may consider occurrences such as the use of hospitalemergency or outpatient facilities, incidents of disease, and days of work misseddue to illness.

Stochastic models for either setting may be specified in terms of eventintensity functions (e.g. Andersen et al., 1993). Let Hi(t) denote the history ofall events and covariates relevant to individual i, up to but not including time t.We shall treat time as a continuous variable, but discrete versions of the resultsbelow can also be given. The intensity function for a type j event ( j 1, . . . , J )is then defined as

where Y ij[s, t) Y ij(t ) Y ij(s ) is the number of type j events in the interval[s, t). That is, the conditional probability of a type j event occurring in[t, t t), given covariates and the prior event history, is approximatelyij i i t.For multi-state models there are correspondingly transition intensity func-

tions,

where k and both k and range over {1, . . . , K}.If covariates are �external� and it is assumed that no two events can occur

simultaneously then the intensities specify the full event history process, condi-tional on the covariate histories. External covariates are ones whose values aredetermined independently from the event processes under study (Kalbfleischand Prentice, 2002, Ch. 6). F ixed covariates are automatically external. �In-ternal� covariates are more difficult to handle and are not considered in thischapter; a joint model for event occurrence and covariate evolution is generallyrequired to study them.

Characteristics of the event history processes can be obtained from theintensity functions. In particular, for models based on (15.3) we have (e.g.Andersen et al., 1993) that

EVENT HISTORY MODELS 223

� � �

���3����3�4$��3�44 � ��!���6

� 0��� I�$ �� ��4 � ����3�4$��3�42

��$ (15.3)

� �

����3����3�4$��3�44 � ��!���6

� 0��3�� ��4 � ����3��4 � �$ ��3�4$��3�42

��$ (15.4)

� �

� �� �(t x (t),H (t)) t for small� �

��� �

Page 247: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Pr{No events over [

Similarly, for multi-state models based on (15.4) we have

Pr{No exit from state k by

The intensity function formulation is very flexible. For example, we mayspecify that the intensities depend on features such as the times since previousevents or previous numbers of events, as well as covariates. However, it isobvious from (15.5) and (15.6) that even the calculation of simple featuressuch as state sojourn probabilities may be complicated. In practice, we oftenrestrict attention to simple models, in particular, Markov models, for which(15.3) or (15.4) depend only on x (t) and t, and semi-Markov models, forwhich (15.3) or (15.4) depend on H(t) only through the elapsed time since themost recent event or transition, and on x (t).

Survival models are important in their own right and as building blocks formore detailed analysis. They deal with the time T from some starting point tothe occurrence of a specific event, for example an individual�s length of life, theduration of their first marriage, or the age at which they first enter the labourforce. The terms failure time, duration, and lifetime are common synonyms forsurvival time. A survival model can be considered as a transitional model withtwo states, where the only allowable transition is from state 1 to state 2. Thetransition intensity (15.4) from state 1 to state 2 can then be written as

where T i represents the duration of individual i�s sojourn in state 1. For survivalmodels, (15.7) is called the hazard function. From (15.6),

Pr(Ti

When covariates x i are all fixed, (15.7) becomes i(t x i) and (15.8) is thesurvivor function

S i(t

Multiplicative regression models based on the hazard function are often used,following Cox (1972):

224 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

�$ �� �4���3�4$��3�4 �� � � � � �� �2

� �#" �� ���

�����

���3����3�4$ ��3�44��

� �� (15.5)

�� ����3�4 � �$��3�4$��3�4 �� � � � � �� �2

� �#" �� ���

�� ���

����3����3�4$ ��3�44��

� �� (15.6)

��3����3�44 � ��!���6

� 0�� � �� ����� � �$ ��3�42

��$ (15.7)

� ����3�4 �� 6 � � � �4 � �#" �� �

6

��3����3�44��� �

� (15.8)

� �

���4 � �#" �� �

6

��3����4��� �

� (15.9)

��3����4 � �63�4 �#" 3���4 (15.10)

Page 248: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

is common, where 0(t) is a positive function and is a vector of regressioncoefficients of the same length as x .

Models for repeated occurrences of the same event are also important; theycorrespond to (15.3) with J 1. Poisson (Markov) and renewal (semi-Markov)processes are often useful. Models for which the event intensity function is ofthe form

are called modulated Poisson processes. Models for which

where ui(t) is the elapsed time since the last event (or since t 0 if no event hasyet occurred), are called modulated renewal processes.

Detailed treatments of the models above are given in books on event historyanalysis (e.g. Andersen et al., 1993; Blossfeld, Hamerle and Mayer, 1989),survival analysis (e.g. Kalbfleisch and Prentice, 2002; Lawless, 2002; Cox andOakes, 1984), and stochastic processes (e.g. Cox and Isham, 1980; Ross 1983).Sections 15.5 to 15.8 outline a few basic methods of analysis.

The intensity functions fully specify a process and allow, for example, pre-diction of future events or the simulation of individual processes. If the datacollected are not sufficient to identify or fit such models, we may consider apartial specification of the process. For example, for recurrent events the meanfunction is M (t) E{Y (t)}; this can be considered without specifying a fullmodel (Lawless and Nadeau, 1995).

In many populations the event processes for individuals in a certain group orcluster may not be mutually independent. For example, members of the samehousehold or individuals living in a specific region may exhibit association,even after conditioning on covariates. The literature on multivariate models orassociation between processes is rather limited, except for the case of multivari-ate survival distributions (e.g. Joe, 1997). A common approach is to basespecification of covariate effects and estimation on separate working modelsfor different components of a process, but to allow for association in thecomputation of confidence regions or tests (e.g. Lee, Wei and Amato, 1992;Lin, 1994; Ng and Cook, 1999). This approach is discussed in Sections 15.5and 15.6.

15.3. GENERAL OBSERVATIONAL ISSUES general observational issues

The analysis of event history data is dependent on two key points: How wereindividuals selected for the study? What information was collected aboutindividuals, and how was this done? In longitudinal surveys panels are usuallyselected according to a complex survey design; we discuss this and its implica-tions in Section 15.4. In this section we consider observational issues associatedwith a generic individual, whose life history we wish to follow.

GENERAL OBSERVATIONAL ISSUES 225

� �

��3����3�4$ ��3�44 � �63�4�3��3�44 (15.11)

��3����3�4$ ��3�44 � �63��3�44�3��3�44$ (15.12)

Page 249: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

We consider studies which follow a group or panel of individuals longitudin-ally over time, recording events and covariates of interest; this is referred to asprospective follow-up. Limitations on data collection are generally imposed bytime, cost, and other factors. Individuals are often observed over a time periodwhich is shorter than needed to obtain a complete picture of the process inquestion, and they may be seen or interviewed only sporadically, for exampleannually. We assume for now that event history variables Y (t) and covariatesx (t) for an individual over the time interval [ 0, 1] can be determined from theavailable data. The time scale could be calendar time or something specific tothe individual, such as age. In any case, 0 will not in general correspond to thenatural or physical origin of the process {Y (t)}, and we denote relevant historyabout events and covariates up to time 0 by H ( 0). (Here, �relevant� willdepend on what is needed to model or analyze the event history process overthe time interval [ 0, 1]; see (15.13) below.) The times 0 or 1 may be random.For example, an individual may be lost to follow-up during a study, say if theymove and cannot be traced, or if they refuse to participate further. We some-times say that the individual�s event history {Y (t)} is ( right-) censored at time 1and refer to 1 as a censoring time. The time 0 is often random as well; forexample, we may wish to focus on a person�s history following the randomoccurrence of some event such as entry to parenthood.

The distribution of {Y (t) }, conditional on H ( 0) and relevantcovariate information X {x (t), t 1}, gives a likelihood function on whichinferences can be based. If 0 and 1 are fixed by the study design (i.e. are non-random) then for an event history process specified by (15.3), we have (e.g.Andersen et al., 1993, Ch. 2)

Pr{r events in [

where �Pr� denotes the probability density. For simplicity we have omittedcovariates in (15.13); their inclusion merely involves replacing H (u) with H (u),x (u).

If 0 or 1 is random then under certain conditions (15.13) is still valid forinference purposes; in particular, this allows 0 or 1 to depend upon past butnot future events. In such cases (15.13) is not necessarily the probability densityof {Y (t) conditional on 0, 1, and H ( 0), but it is a partial likeli-hood. Andersen et al. (1993, Ch. 2) give a rigorous discussion.

Example 1. Survival times

Suppose that T 0 represents a survival time and that an individual is ran-domly selected at time 0 and followed until time 0, where 0 and 1are measured from the same time origin as T . An illustration concerning theduration of breast feeding of first-born children is discussed in Section 15.8,and duration of marital unions is considered later in this section. Assuming that

226 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

� �

� �

� � � �

�� �

��6 � � � �� �� ��

� �

�6$ ��K � �!�� �� � � � $ � �"�� ��$ � � � $ � ��3�642

�� ���

���3����3��44 �#" �� ��

�6

�����

��3���3�44��

� �$ (15.13)

� �

��6 � � � ��2

� �

� � �

��6 � �� � � � �

Page 250: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

T 0, we observe T t if t 1, but otherwise it is right-censored at 1. Lety min (t, 1) and I ( y t) indicate whether t was observed. If (t) denotesthe hazard function (15.7) for T (for simplicity we assume no covariates arepresent) then the right hand side of (15.13) with J 1 reduces to

The likelihood (15.14) is often written in terms of the density and survivalfunctions for T :

where S (t) (u)du} as in (15.9), and f (t) (t)S (t). When 0 wehave S ( 0) 1 and (15.15) is the familiar censored data likelihood (see e.g.Lawless, 2002, section 2.2). If 0 then (15.15) indicates that the relevantdistribution is that of T , given that T 0; this is referred to as left-truncation.This is a consequence of the implicit fact that we are following an individual forwhom �failure� has not occurred before the time of selection 0. Failure torecognize this can severely bias results.

Example 2. A state duration problem

Many life history processes can be studied as a sequence of durations inspecified states. As a concrete example we consider the entry of a person intotheir first marital union (event E1) and the dissolution of that union by divorceor death (event E2). In practice we would usually want to separate dissolutionsby divorce or death but for simplicity we ignore this; see Trussell, Rodriguezand Vaughan (1992) and Hoem and Hoem (1992) for more detailed treatments.F igure 15.1 portrays the process.

We might wish to examine the occurrence of marriage and the length ofmarriage. We consider just the duration S of marriage, for which importantcovariates might include the calendar time of the marriage, ages of the partnersat marriage, and time-varying factors such as the births of children. Supposethat the transition intensity from state 2 to 3 as defined in (15.4) is of theform

where t1 is the time (age) of marriage and x (t) represents fixed and time-varyingcovariates. The function x ) is thus the hazard function for S .

1

Nevermarried

Firstmarriage (E 1)

2

Dissolution offirst marriage (E 2)

3

Figure 15.1 A model for first marriage.

GENERAL OBSERVATIONAL ISSUES 227

� � � � � ��� � � � �

/ � �3 �4� �#" �� �

�6

�3�4��� �

� (15.14)

/ � 3 �4

�3�64

� ��3 �4

�3�64

� ���

(15.15)

� �#" 0� �

6� � � �6 �

� ��6 �

� �

�1;3���3�4$�3�44 � �3�� ����3�44$ (15.16)

�3��

Page 251: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Suppose that individuals are randomly selected and that an individual isfollowed prospectively over the time interval [tS , tF ]. F igure 15.2 shows fourdifferent possibilities according to whether each of the events E1 and E2 occurswithin [tS , tF ] or not. There may also be individuals for whom both E1 andE2 occurred before tS and ones for whom E1 does not occur by time tF , butthey contribute no information on the duration of marriage. By (15.13), theportion of the event history likelihood depending on (15.16) for any of casesA to D is

where tj is the time of event Ej ( j 1, 2), I (event E2 is observed),max (t1, tS ), and y min (t2, tF ). For all cases (15.17) reduces to the

censored data likelihood (15.14) if we write s t2 t1 as the marriage durationand let (u) depend on covariates. For cases C and D, we need to know the timet1 < tS at which E1 occurred. In some applications (but not usually in the caseof marriage) the time t1 might be unknown. If so an alternative to (15.17) mustbe sought, for example by considering Pr{E2 occurs at t2 E1 occurs before tS }instead of Pr{E2 occurs at t2 H (tS )}, upon which (15.17) is based. This requiresinformation about the intensity for events E1, in addition to (s x ). An alterna-tive is to discard data for cases of type C and D. This is permissible and doesnot bias estimation for the model (15.16) (e.g. Aalen and Husebye, 1991; Guo,1993) but often reduces the amount of information greatly.

F inally, we note that individuals could be selected differentially according towhat state they are in at time tS ; this does not pose any problem as long as theprobability of selection depends only on information contained in H (tS ). Forexample, one might select only persons who are married, giving only data typesC and D.

The density (15.13) and thus the likelihood function factors into a productover j 1, . . . , J and so if intensity functions do not share common parameters,

E1

E1

E2

E2

E2

E2E1

E1

tS tFt

A

B

C

D

Figure 15.2 Observation of an individual re E1 and E2.

228 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

�3 �� ����3�144� �#" �

� �

�6

�3�� ����3�44��

� �$ (15.17)

� � ��6 � �

� ��

��

��

Page 252: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the events can be analyzed one type at a time. In the analysis of both multipleevent data and survival data it has become customary to use the notation( i0, yi, i) introduced in Example 1. Therneau and Grambsch (2000) describeits use in connection with S-Plus and SAS procedures. The notation indicatesthat an individual is observed at risk for some specific event over the period[ i0, yi]; i indicates whether the event occurred at yi ( 1) or whether noevent was observed ( 0).

Frequently individuals are seen only at periodic interviews or follow-up visitswhich are as much as one or two years apart. If it is possible to identifyaccurately the times of events and values of covariates through records orrecall, the likelihoods (15.13) and (15.14) can be used. If information aboutthe timing of events is unknown, however, then (15.13) or (15.14) must bereplaced with expressions giving the joint probability of outcomes Y (t) at thediscrete time points at which the individual was seen; for certain models this isdifficult (e.g. Kalbfleisch and Lawless, 1989). An important intermediate situ-ation which has received little study is when information about events orcovariates between follow-up visits is available, but subject to measurementerror (e.g. Holt, McDonald and Skinner, 1991).

Right-censoring of event histories (at 1) is not a problem provided that thecensoring process depends only on observable covariates or events in the past.However, if censoring depends on the current or future event history thenobservation is response selective and (15.13) is no longer the correct distribu-tion of the observed data. For example, suppose that individuals are inter-viewed every year, at which time events over the past year are recorded. If anindividual�s nonresponse, refusal to be interviewed, or loss to follow-up isrelated to events during that year, then censoring of the event history at theprevious year would depend on future events and thus violate the requirementsfor (15.13).

More generally, event or covariate information may be missing at certainfollow-up times because of nonresponse. If nonresponse at a time point isindependent of current and future events, given the past events and co-variates, then standard missing data methods (e.g. Little and Rubin, 1987)may in principle be used. However, computation may be complicated, andmodelling assumptions regarding covariates may be needed (e.g. Lipsitz andIbrahim, 1996). Little (1992, 1995) and Carroll, Ruppert and Stefanski(1995) discuss general methodology, but this is an area where further work isneeded.

We conclude this section with a remark about the retrospective ascertain-ment of information. There may in some studies be a desire to utilize portionsof an individual�s life history prior to their time of inclusion in the study(e.g. prior to tS in Example 2) as responses, rather than simply as conditioningevents, as in (15.13) or (15.17). This is especially tempting in settings wherethe typical duration of a state sojourn is long compared to the length offollow-up for individuals in the study. Treating past events as responses cangenerate selection effects, and care is needed to avoid bias; see Hoem (1985,1989).

GENERAL OBSERVATIONAL ISSUES 229

� �

�� �� ��� �

Page 253: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

15.4. ANALYTIC INFERENCE FROM LONGITUDINAL SURVEYDATA analytic inference from longitudinal survey data

Panels in many longitudinal surveys are selected via a sample design thatinvolves stratification and clustering. In addition, the surveys have numerousobjectives, many of which are descriptive (e.g. Kalton and Citro, 1993; Binder,1998). Because of their generality they may yield limited information aboutexplanatory or causal mechanisms, but analytic inference about the life historyprocesses of individuals is nevertheless an important goal.

Some aspects of analytic inference are controversial in survey sampling, andin particular, the use of weights (see e.g. Chapters 6 and 9); we consider thisbriefly. Let us drop for now the dependence upon time and write Y i and x i forresponse variables and covariates, respectively. It is assumed that there is a��superpopulation�� model or process that generates individuals and their (yi, x i)values. At any given time there is a finite population of individuals from whicha sample could be drawn, but in a process which evolves over time the numbersand make-up of the individuals and covariates in the population are constantlychanging. Marginal or individual-specific models for the superpopulation pro-cess consider the distribution f ( yi x i) of responses given covariates. Responsesfor individuals may not be (conditionally) independent, but for a complexpopulation the specification of a joint model for different individuals isdaunting, so association between individuals is often not modelled explicitly.For the survey, we assume for simplicity that a sample s is selected at a singletime point at which there are N individuals in the finite population, and letIi I (i s) indicate whether individual i is included in the sample. Let thevector zi denote design-related factors such as stratum or cluster information,and assume that the sample inclusion probabilities

1, . . . ,N (15.18)

depend only on the zi.The objective is inference about the marginal distributions f ( yi x i) or

joint distributions f ( y1, y2, x 1, x 2, . . . ) based on the sample data(x i, yi, i s; s). For convenience we use f to denote various density functions,with the distribution represented being clear from the arguments of the function.As discussed by Hoem (1985, 1989) and others the key issue is whether samplingis response selective or not. Suppose first that Y i and zi are independent, given x i:

f ( yiThen

Pr( yiand if we are also willing to assume independence of the Y i given s and thex i(i s), inference about f ( yi x i) can be based on the likelihood

230 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

� �

�� � � 3.� � �� ��$ ��$ 6�4$ � �

�� � � �

���$ 6�4 � 3 �����4� (15.19)

���$ .� � �4 � 3 �����4 (15.20)

� �

/ �����

3 �����4$ (15.21)

Page 254: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

for either parametric or semi-parametric model specifications. Independencemay be a viable assumption when x i includes sufficient information, but if it isnot then an alternative to (15.21) must be sought. One option is to developmultivariate models that specify dependence. This is often difficult, and anotherapproach is to base estimation of the marginal individual-level models on(15.21), with an adjustment made for variance estimation to recognize thepossibility of dependence; we discuss this for survival analysis in Section 15.5.

If (15.20) does not hold then (15.21) is incorrect and leads to biased estima-tion of f ( yi x i). Sometimes (e.g. see papers in Kasprzyk et al., 1989) this isreferred to as model misspecification, since (15.19) is violated, but that is notreally the issue. The distribution f ( yi x i) is well defined and, for example, if weuse a non-parametric approach no strong assumptions about specification aremade. The key issue is as identified by Hoem (1985, 1989): when (15.20) doesnot hold, sampling is response selective, and thus nonignorable, and (15.21) isnot valid.

When (15.19) does not hold, one might question the usefulness of f ( yi x i) foranalytic inference. If we wish to consider it we might try to model the distribu-tion f ( yi x i, zi) and obtain Pr( yi x i, Ii 1) by marginalization. This is usuallydifficult, and a second approach is a pseudo-likelihood method that utilizes theknown sample inclusion probabilities (see Chapter 2). If (15.20) and thus(15.21) are valid, the score function for a parameter vector specifyingf ( yi x i) is

UL (

If (15.20) does not hold we consider instead the pseudo-score function (e.g.Thompson, 1997, section 6.2)

UW (

If EY i x i then E{UW ( )} 0, where the expectation isnow over the Ii, Y i pairs in the population, given the x i, and estimation of canbe based on the equation UW ( 0. Nothing is assumed here about independ-ence of the terms in (15.23); this must be addressed when developing a distri-bution theory for estimation or testing purposes.

Estimation based on (15.23) is sometimes suggested as a general preferencewith the argument that it is �robust� to superpopulation model misspecification.But as noted, when (15.19) fails the utility of f ( yi x i) is questionable; tostudy individual-level processes, every attempt should be made to obtaincovariate information which makes (15.20) plausible. Skinner, Holt andSmith (1989) and Thompson (1997, chapter 6) provide general discussions ofanalytic inference from surveys.

The pseudo-score (15.23) can be useful when i in (15.18) dependson yi. Hoem (1985, 1989) and Kalbfleisch and Lawless (1988) discuss examples

ANALYTIC INFERENCE FROM LONGITUDINAL SURVEY DATA 231

� � �

�4 ��5���

.�� ��� 3 �����4

��� (15.22)

��

�4 ��5���

.�

��

� ��� 3 �����4��

� (15.23)

� 0� ��� 3�����4���2 � 6 � ��

�4 �

Page 255: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

of response-selective sampling in event history analysis. This is an importanttopic, for example when data are collected retrospectively. Lawless, Kalbfleischand Wild (1999) consider settings where auxiliary design information is avail-able on individuals not included in the sample; in that case more efficientalternatives to (15.23) can sometimes be developed.

15.5. DURATION OR SURVIVAL ANALYSIS duration or survival analysis

Primary objectives of survival analysis are to study the distribution of a survivaltime T given covariates x , perhaps in some subgroup of the population. Thehazard function for individual i is given by (15.7) for continuous time models.Discrete time models are often advantageous; in that case we denote thepossible values of T as 1, 2, . . . and specify discrete hazard functions

Then (15.9) is then replaced with (e.g. Kalbfleisch and Prentice, 2002, section 1.2)

S (t

Following the discussion in Section 15.4, key questions are whether thesampling scheme is response selective, and whether survival times can beassumed independent. We consider these issues for non-parametric and para-metric methodology.

15.5.1. Non-parametric marginal survivor function estimation

Suppose that a survival distribution S (t) is to be estimated, with no condition-ing on covariates. Let the vector z include sample design information anddenote Sz(t) Pr(Ti z). Assume for simplicity that Z is discrete andlet Pz P(Z i z) and (z) Pr(Ii z); note that Pz is part of thesuperpopulation model. Then

S (t)

and

Pr(Ti

It is clear that (15.26) and (15.27) are the same if (z) is constant, i.e. the designis self-weighting. In this case the sampling design is ignorable in the sense that(15.20) holds. The design is also ignorable if Sz(t) S (t) for all z. In that casethe scientific relevance of S (t) seems clear. More generally, with S (t) given by(15.26), its relevance is less obvious; although S (t) is well defined as a popula-tion average, it may be of limited interest for analytic purposes.

232 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

��3����4 � � 3�� � ���� � �$ ��4� (15.24)

���4 �����

���

I� � �3����4K� (15.25)

��

� ��:� �� �� � ��:� �

� � 3�� � �4 ��6

�6�63�4 (15.26)

� ��.� � �4 ��

6 �364�6�63�4�6 �364�6

� (15.27)

Page 256: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Estimates of S (t) are valuable for homogeneous subgroups in a popula-tion, and we consider non-parametric estimation by using discrete time.Following the set-up in Example 1, for individual i s let i0 denote thestart of observation, let yi min (ti, i1) denote either a failure time orcensoring time, and let I ( yi ti) denote a failure. If the sample design isignorable then the log-likelihood contribution corresponding to (15.14) can bewritten as

where (2), . . . ) denotes the vector of unknown (t), ni(t)I ( ) indicates an individual is at risk of failure, anddi(t) I (T i t, ni(t) 1) indicates that individual i was observed to fail attime t. If observations from different sampled individuals are independentthen the score function has components

Solving U ( 0, we get estimates

where d(t) and n(t) are the number of failures and number of individuals at riskat time t, respectively. By (15.25) the estimate of S (t) is then the Kaplan�Meierestimate

The estimating equation U ( 0 also provides consistent estimates whenthe sample design is ignorable but the observations for individuals i s are notmutually independent. When the design is nonignorable, using design weightsin (15.28) has been proposed (e.g. Folsom, LaVange and Williams 1989).However, an important point about censoring or losses to follow-up shouldbe noted. It is assumed that censoring is independent of failure times in thesense that E{di(t) S (t) and losses to follow-up arerelated to z (which is plausible in many settings), then this condition is violatedand even weighted estimation is inconsistent.

Variance estimates for the (t) or (t) must take any association amongsample individuals into account. If individuals are independent then standardmaximum likelihood methods apply to (15.28), and yield asymptotic varianceestimates (e.g. Cox and Oakes, 1984)

DURATION OR SURVIVAL ANALYSIS 233

� ���

�� � �

��3�4 ������

��3�40(�3�4 ��� �3�4 � 3� � (�3�44 ��� 3� � �3�442$

� � 3�3�4$ � � ���6 � � � ��

83�4� �����

��3�4(�3�4 � �3�4�3�43� � �3�44

� �$ � � �$ 1$ � � � � (15.28)

� � �

������

�4 �

��3�4 ��

��� ��3�4(�3�4���� ��3�4

� (3�4

�3�4$ (15.29)

��3� 4 �����

���

3� � ��3�44� (15.30)

�4 ��

���3�4 � �2 � �3�4� ' �63�4 ��

�� ��

�; 3��4 � ������3�43� � ��3�44

�3�4

� (15.31)

Page 257: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

If there is association among individuals various methods for computingdesign-based variance estimates could be utilized (e.g. Wolter, 1985). We con-sider ignorable designs and the following simple random groups approach.Assume that individuals i s can be partitioned into C groups c 1, . . . ,Csuch that observations for individuals in different groups are independent, andthe marginal distribution for T i is the same across groups. Let sc denote thesample individuals in group c and define

nc(t)

Then (see the Appendix) under some mild conditions a robust variance estimatefor has entries (for t )

where we take as an upper limit on T . The corresponding variance estimatefor (t) is

The estimates (t) and variance estimates apply to the case of continuous timesas well, by identifying t 1, 2, . . . with measured values of t and noting that

(t) 0 if d(t) 0. Williams (1995) derives a similar estimator by a lineariza-tion approach.

15.5.2. Parametric models

Marginal parametric models for Ti given x i may be handled along similar lines.Consider a continuous time model with hazard and survivor functions of theform ) and S (t ), and assume that the sample design is ignorable.The contribution to the likelihood score function from data ( i0, yi, i) forindividual i is, from (15.14),

Ui(

We again consider association among observations via independent clustersc 1, . . . ,C within which association may be present. The estimating equation

234 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

�; 3 ��3�44 � ��3�41����

���

��3�4

�3�43� � ��3�44� (15.32)

� �

�����"

��3�4$ ("3�4 �����"

(�3�4

��"3�4 �("3�4

�"3�4$ � � �$ 1$ � � � 7 " � �$ � � � $<�

�� � �$ � � � $ �7 � �$ � � � $ �

�;=3��4�$ ��<"��

�"3�4�"3 4

�3�4�3 43��"3�4 � ��3�443��"3 4 � ��3 44$ (15.33)

���

�;=3 ��3�44 � ��3�41�<"��

����

���

�"3�43��"3�4 � ��3�44

�3�43� � ��3�44

� �1

� (15.34)

��

���

� �

�3����7 � ���7 �� �

�4 � ��� ��� �3�����7 �4

��� �

��

� ��

��6

�3����7 �4��� (15.35)

Page 258: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

U (

is unbiased, and under mild conditions the estimator obtained from (15.36) isasymptotically normal with covariance matrix estimated consistently by

where I ( and

The case where observations are mutually independent is covered by (15.38)with all clusters of size 1. An alternative estimator for in this case is

1. Existing software can be used to fit parametric models, withsome extra coding to evaluate (15.38). Huster, Brookmeyer and Self (1989)provide an example of this approach for clusters of size 2, and Lee, Wei andAmato (1992) use it with Cox�s (1972) semi-parametric proportional hazardsmodel.

An alternative approach for clustered data is to formulate multivariatemodels S (t1, . . . , tk ) for the failure times associated with a cluster of size k .This is important when association among individuals is of substantive interest.The primary approaches are through the use of cluster-specific random effects(e.g. Clayton, 1978; Xue and Brookmeyer, 1996) or so-called copula models(e.g. Joe, 1997, Ch. 5). Hougaard (2000) discusses multivariate models, particu-larly of random effects type, in detail.

A model which has been studied by Clayton (1978) and others has jointsurvivor function for T 1, . . . , Tk of the form

S (t1, . . . , tk )

where the S j(tj) are the marginal survivor functions and is an �association�parameter. Specifications of S j(tj) in terms of covariates can be based onstandard parametric survival models (e.g. Lawless, 2002, Ch. 6). Maximumlikelihood estimation for (15.39) is in principle straightforward for a sample ofindependent clusters, not necessarily equal in size.

15.5.3. Semi-parametric methods

Semi-parametric models are widely used in survival analysis, the most popularbeing the Cox (1972) proportional hazards model, where Ti has a hazardfunction of the form

DURATION OR SURVIVAL ANALYSIS 235

�4 ��<"��

����"

8�3�4 � 6 (15.36)

��

�; 3��4 � .3��4���F��383�44.3��4��$ (15.37)

�4 � ��83�4���

�F��383�44 ��<"��

3����"

8�3��443

����"

8�3��44� (15.38)

���; 3��4 � .3��4�

������

��3��4���� � 3�� �4

� ���

$ (15.39)

�3����4 � �63�4 �#"3���4$ (15.40)

Page 259: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where 0(t) > 0 is an arbitrary �baseline� hazard function. In the case of inde-pendent observations, partial likelihood analysis, which yields estimates ofand non-parametric estimates of , is standard and well known(e.g. Kalbfleisch and Prentice, 2002). The case where x i in (15.40) varies with t isalso easily handled. For clustered data, Lee, Wei and Amato (1992) haveproposed marginal methods analogous to those in Section 15.5.2. That is, themarginal distributions of clustered failure times T 1, . . . , Tk are modelled using(15.40), estimates are obtained under the assumption of independence, but arobust variance estimate is obtained for . Lin (1994) provides further discus-sion and software. This methodology is extended further to include stratifica-tion as well as estimation of baseline cumulative hazard functions and survivalprobabilities by Spiekerman and Lin (1998) and Boudreau and Lawless (2001).Binder (1992) has discussed design-based variance estimation for in marginalCox models, for complex survey designs; his procedures utilize weightedpseudo-score functions. Lin (2000) extends these results and considers relatedmodel-based estimation. Software packages such as SUDAAN implement suchanalyses; Korn, Graubard and Midthune (1997) and Korn and Graubard(1999) illustrate this methodology. Boudreau and Lawless (2001) describe theuse of general software like S-Plus and SAS for model-based analysis. Semi-parametric methods based on random effects or copula models have also beenproposed, but investigated only in special settings (e.g. Klein and Moeschber-ger, 1997, Ch. 13).

15.6. ANALYSIS OF EVENT OCCURRENCES analysis of event occurrences

Many processes involve several types of event which may occur repeatedly or ina certain order. For interesting examples involving cohabitation, marriage, andmarriage dissolution see Trussell, Rodriguez and Vaughan (1992) and Hoemand Hoem (1992). It is not possible to give a detailed discussion, but weconsider briefly the analysis of recurrent events and then methods for morecomplex processes.

15.6.1. Analysis of recurrent events

Objectives in analyzing recurrent or repeated events include the study ofpatterns of occurrence and of the relationship of fixed or time-varying covari-ates to event occurrence. If the exact times at which the events occur areavailable then individual-level models based on intensity function specificationssuch as (15.11) or (15.12) may be employed, with inference based on likelihoodfunctions of the form (15.13). Berman and Turner (1992) discuss a convenientformat for parametric maximum likelihood computations, utilizing a discre-tized version of (15.13). Adjustments to variance estimation to account forcluster samples can be accommodated as in Section 15.5.2.

Semi-parametric methods may be based on the same partial likelihoodideas as for survival analysis (Andersen et al., 1993, Chs 4 and 6; Therneau

236 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

��

�63�4 � �

6�63�4��

��

��

Page 260: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

and Hamilton, 1997). The book by Therneau and Grambsch (2000)provides many practical details and illustrations of this methodology. Forexample, under model (15.11) we may specify g(x i(t)) parametrically, say asg(x i(t); ) exp ( x i(t)), and leave the baseline intensity function 0(t) unspeci-fied. A partial likelihood for based on n independent individuals is then

LP(

where r (t) 1 if and only if individual is under observation at time t. Therneauand Hamilton (1997) and Therneau and Grambsch (2000) discuss how Coxsurvival analysis software in S-Plus or SAS may be used to estimate from(15.41), and model extensions to allow stratification on the number of priorevents. The baseline cumulative intensity function u can beestimated by

where is obtained by maximizing (15.41). In the case where there areno covariates, (15.42) becomes the Nelson�Aalen estimator (Andersen et al.,1993)

where n(tij) is the total number of individuals observed at time tij.

Lawless and Nadeau (1995) and Lawless (1995) show that this approach alsogives robust procedures for estimating the mean functions for the number ofevents over [0, t]. They provide robust variance estimates, and estimates foronly may also be obtained from S-Plus or SAS programs for Cox regression(Therneau and Grambsch, 2000). Random effects can be incorporated inmodels for recurrent events (Lawless, 1987; Andersen et al., 1993; Hougaard,2000; Therneau and Grambsch, 2000).

In some settings exact event times are not provided and we instead have eventcounts for intervals such as weeks, months, or years. Discrete time versions ofthe methods above are easily developed. For example, suppose that t 1, 2, . . .indexes time periods and let yi(t) and x i(t) represent the number of events andcovariate vector for individual i in period t. Conditional models may be basedon specifications such as

yi(t)

where x i (t) may include both information in x i(t) and information about priorevents. In some cases yi(t) can equal only 0 or 1, and then analogous binaryresponse models (e.g. Diggle, Liang and Zeger, 1994) may be used. Unobserv-able heterogeneity or overdispersion may be handled by the incorporation of

ANALYSIS OF EVENT OCCURRENCES 237

� � � ��

�4 ������

�)�

���

�#" 3���3���44����� �3���43�

��3���44

� �$ (15.41)

� � �

�63�4 � �

6�63�4�

��63�4 ��

3�$ �4������

������ �3���4 �#" 3����3���44

� �$ (15.42)

��

��5�3�4 ��

3�$ �4������

�3���4

� �$ (15.43)

� �����

�3���4

��

��3�4$��3�4 �������I�63�4 �#" 3���� 3�44K�

Page 261: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

random effects (e.g. Lawless, 1987). Robust methods in which the focus is noton conditional models but on marginal mean functions such as

E{yi(t)

are simple modifications of methods described above (Lawless and Nadeau,1995).

15.6.2. Multiple event types

If several types of events are of interest the intensity-based frameworks ofSections 15.2 and 15.3 may be used. In most situations the intensity functionsfor different types of events do not involve any common parameters. Thelikelihood functions based on terms of the form (15.13) then factor intoseparate pieces for each event type, meaning that models for each type can befitted separately. Often the methodology for survival times or recurrent eventsserve as convenient building blocks. Therneau and Grambsch (2000) discusshow software for Cox models can be used for a variety of settings. Lawless(2002, Ch. 11) discusses the application of survival methods and software tomore general event history processes. Rather than attempt any general discus-sion, we give two short examples which represent types of circumstances. In thefirst, methods based on ideas in Section 15.6.1 would be useful; in the second,survival analysis methods as in Section 15.5 can be exploited.

Example 3. Use of medical facilities

Consider factors which affect a person�s decision to use a hospital emergencydepartment, clinic, or family physician for certain types of medical treatment orconsultation. Simple forms of analysis might look at the numbers of timesvarious facilities are used, against explanatory variables for an individual.Patterns of usage can also be examined: for example, do persons tend to followemergency room visits with a visit to a family physician? Association betweenthe uses of different facilities can be considered through the methods ofSection 15.6.1, by using time-dependent covariates that reflect prior usage.Ng and Cook (1999) provide robust methods for marginal means analysis ofseveral types of events.

Example 4. Cohabitation and marriage

Trussell, Rodriguez and Vaughan (1992) discuss data on cohabitation andmarriage from a Swedish survey of women aged 20�44. The events of maininterest are unions (cohabitation or marriage) and the dissolution of unions,but the process is complicated by the fact that cohabitations or marriages maybe first, second, third, and so on; and by the possibility that a given couple maycohabit, then marry. If we consider the durations of the sequence of �states�,survival analysis methodology may be applied. For example, we can considerthe time to first marriage as a function of an individual�s age and explanatory

238 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

���3�42 � �63�4 �#" 3���3�44 (15.44)

Page 262: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

factors such as birth cohort, education history, cohabitation history, andemployment, using survival analysis with time-varying covariates. The durationof first marriage can be similarly considered, with baseline covariates for themarriage partners, and post-marriage covariates such as the birth of children.After dissolution of a first marriage, subsequent marital unions and dissol-utions may be considered likewise.

15.7. ANALYSIS OF MULTI-STATE DATA analysis of multi-state data

Space does not permit a detailed discussion of analysis for multi-state models;we mention only a few important points.

If intensity-based models in continuous time (see (15.4) ) are used then,provided complete information about changes of state and covariates is avail-able, (15.13) is the basis for the construction of likelihood functions. At anypoint in time, an individual is at risk for only certain types of events, namelythose which correspond to transitions out of the state currently occupied and, asin Section 15.6.2, likelihood functions typically factor into separate pieces foreach distinct type of transition. Andersen et al. (1993) extensively treat Markovmodels and describe non-parametric and parametric methods. For the semi-Markov case where the transitions out of a state depend on the time since thestate was entered, survival analysis methods are useful. If transitions to morethan one state are possible from a given state, then the so-called �competing risks�extension of survival analysis (e.g. Kalbfleisch and Prentice, 2002, Ch. 8; Law-less, 2002, Chs 10 and 11) is needed. Example 4 describes a setting of this type.

Discrete time models are widely used for analysis, especially when individualsare observed at discrete times t 0, 1, 2, . . . that have the same meaning fordifferent individuals. The basic model is a Markov chain with time-dependenttransition probabilities

Pijk (t

where x i may include information on covariates and on prior event history. Ifan individual is observed at time points t a, a 1, . . . , b then likelihoodmethods are essentially those for multinomial response regression models(e.g. Fahrmeir and Tutz, 1994; Lindsey, 1993). Analysis in continuous time ispreferred if individuals are observed at unequally spaced time points or if theretend to be several transitions between observation points. If it is not possible todetermine the exact times of transitions, then analysis is more difficult (e.g.Kalbfleisch and Lawless, 1989).

15.8. ILLUSTRATION illustration

The US National Longitudinal Survey of Youth (NLSY) interviews personsannually. In this illustration we consider information concerning mothers who

ANALYSIS OF MULTI-STATE DATA 239

���� 3�44 � � 3��3�� �4 � ����3�4 � �$ ��� 3�44$�

� �

Page 263: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

chose to breast-feed their first-born child and, in particular, the duration T ofbreast feeding (see e.g. Klein and Moeschberger, 1997, section 1.14).

In the NLSY, females aged 14 to 21 in 1979 were interviewed yearly until1988. They were asked about births and breast feeding which occurred since thelast interview, starting in 1983; in 1983 information about births as far back as1978 was also collected. The data considered here are for 927 first-born childrenwhose mothers chose to breast-feed them; duration times of breast feeding aremeasured in weeks. Covariates included

Race of mother (Black, White, Other)Mother in poverty (Yes, No)Mother smoked at birth of child (Yes, No)Mother used alcohol at birth of child (Yes, No)Age of mother at birth of child (Years)Year of birth (1978�88)Education level of mother (Years of school)Prenatal care after third month (Yes, No).

A potential problem with these types of data is the presence of measurementerror in observed duration times, due to recall errors in the date at whichbreast feeding concluded. We consider this below, but first report the resultsof duration analysis which assumes no errors of measurement. Given thepresence of covariates related to design factors and the unavailability ofcluster information, a standard (unweighted) analysis assuming independenceof individual responses was carried out. Analysis using semi-parametric pro-portional hazard methods (Cox, 1972) and accelerated failure time models(Lawless, 2002, Ch. 6) suggests an exponential or Weibull regression modelwith covariates for education, smoking, race, poverty, and year of birth.Education in total years is treated as a continuous covariate below. Table15.1 shows estimates and standard errors for the semi-parametric Cox modelwith hazard function of the form (15.40) and for a Weibull model with hazardfunction

For the model (15.45) the estimate of was almost exactly one, indicating anexponential regression model and that the in (15.40) and in (15.45) have thesame meaning. In Table 15.1, binary covariates are coded so that 1 Yes,0 No; race is coded by two variables (Black 1, not 0 and White 1,not 0); year of birth is coded as 4 (1978) to 4 (1986) and education level ofmother is centred to have mean 0 in the sample.

The analysis indicates longer duration for earlier years of birth, for whitemothers than for non-white, for mothers with more education, and for mothersin poverty. There was a mild indication of an education�poverty interaction(not shown), with more educated mothers living in poverty tending to haveshorter durations.

240 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

�3����4 � � �#" 3�� ���43� �#" 3�� ���44���� (15.45)

��

� � � �� �

Page 264: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 15.1 Analysis of breast feeding duration.

Cox model Weibull model

Covariate estimate se estimate se

Intercept � � 2.587 0.087Black 0.106 0.128 0.122 0.127White 0.279 0.097 0.325 0.097Education 0.058 0.020 0.063 0.020Smoking 0.250 0.079 0.280 0.078Poverty 0.184 0.093 0.200 0.092Year of birth 0.068 0.018 0.078 0.018

As noted, measurement error may be an issue in the recording of durationtimes, if the dates breast feeding terminated are in error; this will affect onlyobserved duration times, and not censoring times. In some settings where recallerror is potentially very severe, a decision may be made to use only currentstatus data and not the measured durations (e.g. Holt, McDonald and Skinner,1991). That is, since dates of birth are known from records, one could at eachannual interview merely record whether or not breast feeding of an infant bornin the past year had terminated or not. The �observation� would then be thateither T i Ci or Ti > Ci, where Ci is the time between the date of birth and theinterview. Diamond and McDonald (1992) discuss the analysis of such data.

If observed duration times are used, an attempt may be made to incorporatemeasurement error or to assess its possible effect. For the data considered here,the dates of births and interviews are not given so nothing very realistic can bedone. As a result we consider only a very crude sensitivity analysis based onrandomly perturbing observed duration times ti to tnew

i ti ui. Two scenarioswere considered by way of illustration: (i) ui has probability mass 0.2 at each ofthe values 4, 2, 0, 2, and 4 weeks, and (ii) ui has probability mass 0.3, 0.3,0.2, 0.1, 0.1 at the values 4, 2, 0, 2, 4, respectively. These represent caseswhere measurements errors in ti are symmetric about zero and biased down-ward (i.e. dates of cessation of breast feeding are biased downward), respect-ively. For each scenario, 10 sets of values for tnew

i were obtained (censoredduration times were left unchanged) and new models fitted for each dataset.The variation in estimates and standard errors across the simulated datasetswas negligible for practical purposes and indicates no concerns about thesubstantive conclusions from the analysis; in each case the standard error of

across the simulations was less than 20 % of the standard error of (whichwas stable across the simulated datasets).

15.9. CONCLUDING REMARKS concluding remarks

For individual-level modelling and explanatory analysis of event history pro-cesses, sufficient information about events and covariates must be obtained.

CONCLUDING REMARKS 241

� �

���

����

� �

� �� �

�� ��

Page 265: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

This supports unweighted analysis based on conventional assumptions of non-selective sampling, but cluster effects in survey populations and samples maynecessitate multivariate models or adjustments for association among sampledindividuals. An approach based on the identification of clusters was outlined inSection 15.5 for the case of survival analysis, but in general, variance estimationmethods used with surveys have received little study for event history data.

Many issues have not been discussed. There are difficult measurement prob-lems associated with longitudinal surveys, including factors related to thebehaviour of panel members. The edited volume of Kasprzyk et al. (1989)contains a number of excellent discussions. Tracking individuals closely enoughand designing data collection and validation so that one can obtain relevantinformation on individual life histories prior to enrolment and between follow-up times are crucial. The following is a short list of areas where methodologicaldevelopment is needed.

Methods for handling missing data (e.g. Little, 1992).Methods for handling measurement errors in responses (e.g. Hoem, 1985;Trivellato and Torelli, 1989; Holt, McDonald and Skinner, 1991; Skinnerand Humphreys, 1999) and in covariates (e.g. Carroll, Ruppert and Ste-fanski 1995).Studies of response-related losses to follow-up and nonresponse, and theneed for auxiliary data.Methods for handling response-selective sampling induced by retrospectivecollection of data (e.g. Hoem, 1985, 1989).F itting multivariate and hierarchical models with incomplete data.Model checking and the assessment of robustness with incomplete data.

F inally, the design of any longitudinal survey requires careful consideration,with a prioritization of analytic vs. descriptive objectives and analysis of theinterplay between survey data, individual-level event history modelling, andexplanatory analysis. For investigation of explanatory or causal mechanisms,trade-offs between large surveys and smaller, more controlled longitudinalstudies directed at specific areas deserve attention. Smith and Holt (1989)provide some remarks on these issues, and they have often been discussed inthe social sciences (e.g. F ienberg and Tanur, 1986).

APPENDIX: ROBUST VARIANCE ESTIMATION FOR THEKAPLAN�MEIER ESTIMATE

We write (15.28) as

The theory of estimating equations (e.g. Godambe, 1991; White, 1994) gives

242 EVENT HISTORY ANALYSIS AND LONGITUDINAL SURVEYS

��

��

83�4� ��<"��

����"

��3�4(�3�4 � �3�4�3�43� � �3�44

� �$ � � �$ � � � $ ��

Page 266: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

as an estimate of the asymptotic variance of andis an estimate of the covariance matrix for U ( ). Now

cov(U (

where we use the fact that E{di(t) i(t) 1, i sc} (t). Under mild condi-tions this may be estimated consistently (as C ) by

which after rearrangement using (A.1) gives (15.33).

ACKNOWLEDGEMENTS

The author would like to thank Christian Boudreau, Mary Thompson, ChrisSkinner, and David Binder for helpful discussion. Research was supported inpart by a grant from the Natural Sciences and Engineering Research Council ofCanada.

CONCLUDING REMARKS 243

�;=3��4 � .3��4�� �; 383�44.3��4�� (A.1)

where I ( )��$ � � ��83�4����; 383�44 �

�4�$83�4 4 ��<"��

�����"

����"

��3�4��3 43(�3�4 � �3�443(�3 4 � �3 44�3�43� � �3�44�3 43� � �3 44

$

�� � � � �� �

�<"��

����"

����"

��3�4��3 43(�3�4 � ��3�443(�3 4 � ��3 44��3�43� � ��3�44��3 43� � ��3 44

$

Page 267: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 268: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 16

Applying HeterogeneousTransition Models inLabour Economics: theRole of Youth Training inLabour Market TransitionsFabrizia Mealli and Stephen Pudney

16.1. INTRODUCTION introduction

Measuring the impact of youth training programmes on the labour marketcontinues to be a major focus of microeconometric research and debate. Incountries such as the UK, where experimental evaluation of training pro-grammes is infeasible, research is more reliant on tools developed in theliterature on multi-state transitions, using models which predict simultaneouslythe timing and destination state of a transition. Applications include Ridder(1987), Gritz (1993), Dolton, Makepeace and Treble (1994) and Mealli andPudney (1999).

In this chapter we describe an application of multi-state event history analy-sis, based not on a sample survey but rather on a �census� of 1988 male school-leavers in Lancashire. Despite the fact that we are working with an extremeform of survey, there are several methodological issues, common in the analysisof survey data, that must be addressed. There are a number of important modelspecification difficulties facing the applied researcher in this area. One is theproblem of scale and complexity that besets any realistic model. Active labourmarket programmes like the British Youth Training Scheme (YTS) and itssuccessors (see Main and Shelly, 1998; Dolton, 1993) are embedded in theyouth labour market, which involves individual transitions between severaldifferent states: employment, unemployment and various forms of education

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 269: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

or training. In principle, every possible type of transition introduces an add-itional set of parameters to be estimated, so the dimension of the parameterspace rises with the square of the number of separate states, generating bothcomputational and identification difficulties. This is the curse of dimensionalitythat afflicts many different applications of statistical modelling of survey datain economics, including demand analysis (Pudney, 1981) and discrete responsemodelling (Weeks, 1995). A second problem is generated by the institutionalfeatures of the training and education system. YTS places are normally limitedin duration and college courses are also normally of standard lengths. Conven-tional duration modelling is not appropriate for such episodes, but flexiblesemi-parametric approaches (such as that of Meyer, 1990) may introduce fartoo many additional parameters in a multi-state context. A third importantissue is the persistence that is generally found in observed sequences of individ-ual transitions. There are clearly very strong forces tending to hold manyindividuals in a particular state, once that state has been entered. This is aconsideration that motivates the widely used Goodman (1961) mover�stayermodel, which captures an extreme form of persistence. A fourth problem issample attrition. This is important in any longitudinal study, but particularly sofor the youth labour market, where many of the individuals involved may haveweak attachment to training and employment, and some resistance to monitor-ing by survey agencies.

Given the scale of the modelling problem, there is no approach which offersan ideal solution to all these problems simultaneously. In practice we areseeking a model specification and an estimation approach which gives a rea-sonable compromise between the competing demands of generality and flexi-bility on the one hand and tractability on the other.

An important economic focus of the analysis is the selection problem. It is wellknown that selection mechanisms may play an important role in this context:that people who are (self-)assigned to training may differ in terms of theirunobservable characteristics, and these unobservables may affect also theirsubsequent labour market experience. If the fundamental role of training perse is to be isolated from the effects of the current pattern of (self-)selection, thenit is important to account for the presence of persistent unobserved heterogeneityin the process of model estimation. Once suitable estimates are available, it thenbecomes possible to assess the impact of YTS using simulations which holdconstant the unobservables that generate individual heterogeneity.

16.2. YTS AND THE LCS DATASET yts and the lcs dataset

With the rise of unemployment, and especially youth unemployment, in the UKduring the 1980s, government provision of training took on an important andevolving role in the labour market. The one-year YTS programme, introducedin 1983, was extended in 1986 to two years; this was modified and renamedYouth Training in 1989. The system was subsequently decentralised with theintroduction of local Training and Enterprise Councils. Later versions of the

246 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

Page 270: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

scheme were intended to be more flexible, but the system remains essentiallyone of two-year support for young trainees. Our data relate to the 1988 cohortof school-leavers and thus to the two-year version of YTS. In exceptionalcircumstances, YTS could last longer than two years, but for a large majorityof participants in the programme, the limit was potentially binding. YTSparticipants may have had special status as trainees (receiving only a standardYTS allowance), or be regarded as employees in the normal sense, being paidthe normal rate for the job. Thus YTS had aspects of both training andemployment subsidy schemes. Additional funds were available under YTS forpayment to the training providers (usually firms, local authorities and organisa-tions in the voluntary sector) to modify the training programme for people withspecial training needs who needed additional support.

The data used for the analysis are drawn from a database held on thecomputer system of the Lancashire Careers Service (LCS), whose duties arethose of delivering vocational guidance and a placement service of youngpeople into jobs, training schemes or further education. It generates a widerange of information on all young people who leave school in Lancashire andon the jobs and training programmes undertaken. Andrews and Bradley (1997)give a more detailed description of the dataset.

The sample used in this study comprises 3791 males who entered the labourmarket with the 1988 cohort of school-leavers, and for whom all necessaryinformation (including postcode, school identifier, etc.) was available. Thisgroup was observed continuously from the time they left school, aged 16, inthe spring or summer of their fifth year of secondary school, up to the finalthreshold of 30 June 1992, and every change of participation state wasrecorded. We identify four principal states: continuation in formal education,which we refer to as college (C); employment (E); unemployment (U); andparticipation in one of the variants of the government youth training scheme(all referred to here as YTS). Note that 16- and 17-year-olds are not eligible forunemployment-related benefits, so unemployment in this context does not referto registered unemployment, but is the result of a classification decision of thecareers advisor. We have also classified a very few short, unspecified non-employment episodes as state U. There is a fifth state which we refer to as�out of sample� (O). This is a catch-all classification referring to any situation inwhich either the youth concerned is out of the labour force for some reason, orthe LCS has lost touch with him or her. Once state O is encountered in therecord of any individual, the record is truncated at that point, so that it is anabsorbing state in the sense that there can be no subsequent recorded transitionout of state O. In the great majority of cases, a transition to O signifies thepermanent loss of contact between the LCS and the individual, so that it is, ineffect, the end of the observation period and represents the usual phenomenonof sample attrition. However, it is important that we deal with the potentialendogeneity of attrition, so transitions into state O are modelled together withother transition types.

Many of the individual histories begin with a first spell corresponding to asummer �waiting� period before starting a job, training or other education. We

YTS AND THE LCS DATASET 247

Page 271: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

have excluded all such initial spells recorded by the LCS as waiting periods, andstarted the work history instead from the succeeding spell. After this adjust-ment is made, we observe for each individual a sequence of episodes, the finalone uncompleted, and for each episode we have an observation on twoendogenous variables: its duration and also (for all but the last) the destinationstate of the transition that terminates it. The data also include observations onexplanatory variables such as age, educational attainment and a degree of detailon occupational category of the YTS place and its status (trainee, employee,special funding). Summary statistics for the sample are given in Table A1 in theappendix.

There are two obvious features revealed by inspection of the data, which giverise to non-standard elements of the model we estimate. The first of these isshown in Figure 16.1, which plots the smoothed empirical cdf of YTS spelldurations. The cdf clearly shows the importance of the two-year limit on thelength of YTS spells and the common occurrence of early termination. Nearly30 % of YTS spells finish within a year and nearly 50 % before the two-yearlimit. Conventional transition model specifications cannot capture this feature,and we use below an extension of the limited competing risks (LCR) modelintroduced by Mealli, Pudney and Thomas (1996). F igure 16.2 plots thesmoothed empirical hazard function of durations of college spells, and revealsanother non-standard feature in the form of peaks at durations around 0.9and 1.9 years (corresponding to educational courses lasting one and two

0 100 200 300 400 500 600 700 800 900

YTS duration (days)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Empi

rical

cdf

Figure 16.1 Empirical cdf of YTS durations.

248 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

Page 272: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

0 100 200 300 400 500 600 700 800 900

College duration (days)

0.00

120.

0010

0.00

180.

0006

0.00

040.

0002

0.00

00

Em

piric

al h

azar

d

Figure 16.2 Smoothed empirical hazard function for transitions from college.

academic years). Again, we make an appropriate adaptation to our model tocope with this.

16.3. A CORRELATED RANDOM EFFECTS TRANSITION MODELa correlated random effects transition model

Longitudinal data of the type described above cover each individual from thebeginning of his or her work history to an exogenously determined date atwhich the observation period ends. This generates for each sampled individuala set of m observed episodes (note that m is a random variable). Each episodehas two important attributes: its duration; and the type of episode that succeedsit (the destination state). In our case there are four possible states that we mightobserve. We write the observed endogenous variables r0, 1, r1, . . . , m , where jis the duration of the jth episode, and rj is the destination state for the transitionthat brings it to an end. Thus the model can be regarded as a specification forthe joint distribution of a set of m continuous variables (the j) and m discretevariables (the rj). For each episode there is a vector of observed explanatoryvariables, x j, which may vary across episodes but which is assumed constantover time within episodes.

The model we estimate in this study is a modified form of the conventionalheterogeneous multi-spell multi-state transition model (see Pudney (1989),Lancaster (1990), and Chapter 15 by Lawless for surveys). Such modelsproceed by partitioning the observed work history into a sequence of episodes.

A CORRELATED RANDOM EFFECTS TRANSITION MODEL 249

� � �

Page 273: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

For the first spell of the sequence, there is a discrete distribution of the statevariable r0 with conditional probability mass function Conditionalon past history, each successive episode for j 1 . . .m 1 is characterised by ajoint density / mass function where x j may include functions ofearlier state and duration variables, to allow for lagged state dependence. Theterm v is a vector of unobserved random effects, each element normalised tohave unit variance; v is constant over time, and can thus generate strong serialdependence in the sequence of episodes (see Chapter 14 by Skinner andHolmes). Under our sampling scheme, the final observed spell is usually anepisode of C, E, U or YTS, which is still in progress at the end of the observa-tion period. For this last incomplete episode, the eventual destination state isunobserved, and its distribution is characterised by a survivor functionS ( v) which gives the conditional probability of the mth spell lasting atleast as long as m . Conditional on the observed covariates X {x 0 . . . xm} andthe unobserved effects v, the joint distribution of r0, 1, r1, . . . , m is then

For the smaller number of cases where the sample record ends with a transitionto state O (in other words, attrition), there is no duration for state O and thelast component of (16.1) is S ( v) 1. There is a further complication forthe still fewer cases where the record ends with a YTS O transition, and thisis discussed below.

The transition components of the model (the probability density / mass func-tion f and the survivor function S ) are based on the notion of a set of origin-anddestination-specific transition intensity functions for each spell. These give theinstantaneous probability of exit to a given destination at a particular time,conditional on no previous exit having occurred. Thus, for any given episode,spent in state k , the lth transition intensity function ) is given by

where x and are respectively vectors of observed and unobserved covariateswhich are specific to the individual but may vary across episodes for a givenindividual. Our data are constructed in such a way that an episode can never befollowed by another episode of the same type, so the k , k th transition intensitykk does not exist. The joint probability density / mass function of exit route, r,

and realised duration, , is then constructed as

where Ik ) is the k , lth integrated hazard:

250 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

��/�?��?* 3&� �

/��* ����� * 3*

�����*

� �� �

/�?* ��* �� � � � ���� * 3 � ��/�?��?* 3����

��� /��* ����� * 3

� �/�����* 3� (16.1)

�����* ��

���/� * �

��/� � �* � � /�* �� ��3�� � �* �* �3 � ���/���* �3�� (16.2)

��

/�* ���* �3 � ���/���* �3 �, �����

!��/���* �3� �

(16.3)

�/���* �

!��/���* �3 �� �

?

���/���* �3��� (16.4)

Page 274: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Since the random effects are unobserved, (16.1) cannot be used directly asthe basis of an estimated model. However, if we assume a specific joint distri-bution function, G( ), for the random effects, they can be removed by integra-tion and estimation can then proceed by maximising the following log-likelihood based on (16.1) with respect to the model parameters:

where the suffix i 1 . . . n indexes the individuals in the sample.It is important to realise that, for estimation purposes, the definition (16.2) of

the transition intensity function is applicable to any form of continuous timemulti-state transition process. It is also possible to think of such a process interms of a competing risks structure, involving independently distributed latentdurations for transition to each possible destination, with the observed dur-ation and transition corresponding to the shortest of the latent durations. Thesetwo interpretations are observationally equivalent in the sense that it is alwayspossible to construct a set of indpendent latent durations consistent with anygiven set of transition intensities. This aspect of the interpretation of the modeltherefore has no impact on estimation. However, when we come to simulatingthe model under assumptions of changed policy or abstracting from the biasingeffects of sample attrition, then interpretation of the structure becomes import-ant. For simulation purposes the competing risks interpretation has consider-able analytical power, but at the cost of a very strong assumption about thestructural invariance of the transition process. We return to these issues inSection 16.5 below.

The specifications we use for the various components of the model aredescribed in the following sections.

16.3.1. Heterogeneity

We now turn to the specification of the persistent random effects. F irst, notethat there has been some debate about the practical importance of heterogen-eity in applied modelling. Ridder (1987) has shown that neglecting unobservedheterogeneity results in biases that are negligible, provided a sufficiently flexiblebaseline hazard is specified. However, his results apply only to the simple caseof single-spell data with no censoring. In the multi-spell context where randomeffects capture persistence over time as well as inter-individual variation, andwhere there is a non-negligible censoring frequency, heterogeneity cannot beassumed to be innocuous. We opt instead for a model in which there is areasonable degrees of flexibility in both the transition intensity functions andthe heterogeneity distribution.

The same problem of dimensionality is found here as in the observable part ofthe model. Rather than the general case of 16 unobservables, each specific to adistinct origin�destination combination, we simplify the structure by using per-sistent heterogeneity to represent those unobservable factors which predispose

A CORRELATED RANDOM EFFECTS TRANSITION MODEL 251

�� % ���

�����

���/�?��?* �3

����

��� /�� * �����* �3

� �/�����* �3�$/�3

� �(16.5)

Page 275: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

individuals towards long or short stays in particular states. Our view is that thissort of state-dependent �stickiness� is likely to be the main dimension of unobserv-able persistence � an assumption similar to, but less extreme than, the assumptionunderlying the familiar mover � stayer model (Goodman, 1961). Thus we use afour-factor specification, where each of the four random effects is constant overtime and linked to a particular state of origin rather than destination. We assumethat the observed covariates and these random effects enter the transition inten-sities in an exponential form, so that in general can be expressed as

where k is a scale parameter. There is again a conflict betweenflexibility and tractability, in terms of the functional form of the distribution ofthe unobservables k . One might follow Heckman and Singer (1984) and Gritz(1993) by using a semi-parametric mass-point distribution, where the location ofthe mass points and the associated probabilities are treated as parameters to beestimated. Van den Berg (1997) has shown in the context of a two-state competingrisks model that this specification has an advantage over other distributionalforms (including the normal) in that it permits a wider range of possible correl-ations between the two underlying latent durations. However, in our four-dimensional setting, this would entail another great expansion of the parameterspace. Since there is a fair amount of informal empirical experience suggestingthat distributional form is relatively unimportant provided the transition inten-sities are specified sufficiently flexibly, we are content to assume a normaldistribution for the k .

We introduce correlation across states in the persistent heterogeneity terms ina simple way which nevertheless encompasses the two most common formsused in practice. This is done by constructing the k from an underlying vectoras follows:

where is a single parameter controlling the general degree of cross-correl-ation. Under this specification, the correlation between any pair of heterogen-eity terms, and , is 2sgn Note that one of thescale parameters should be normalised with respect to its sign, since (with the ksymmetrically distributed) the sample distribution induced by the model isinvariant to multiplication of all the k by 1. There are two important specialcases of (16.6): 0 corresponds to the assumption of independence across(origin) states; 1 yields the one-factor specification discussed by Linde-boom and van den Berg (1994).

16.3.2. The initial state

Our model for the initial state indicator r0 is a four-outcome multinomial logit(MNL) structure of the following form:

252 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

���/���* �3��/�< �*����3 �

� �

�� � /�� �3�� � ��7���

�� (16.6)

���� ���� /����3�/�� �3�/�� 5�>3&�

� �� �� �

��/�? � ���?* �3 ��, /�?�� � ����3�7��� �, /�?�� � ����3

(16.7)

Page 276: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where 1 is normalised at 0. The parameters j are scale parameters that alsocontrol the correlation between the initial state and successive episodes.

16.3.3. The transition model

For a completely general model, 16 transition intensities should be specified � apractical impossibility, since this would lead to an enormously large parameterset. This dimensionality problem is very serious in applied work, and there aretwo obvious solutions, neither of which is ideal. The most common approach isto reduce the number of states, either by combining states (for example, collegeand training) into a single category, or by deleting individuals who makecertain transitions (such as those who return to college or who leave the sampleby attrition). The consequences of this type of simplification are potentiallyserious and obscure, since it is impossible to test the implicit underlyingassumptions without maintaining the original degree of detail. In these twoexamples, the implicit assumptions are respectively: that transition rates to andfrom college and training are identical; and that the processes of transition tocollege or attrition are independent of all other transitions. Neither assumptionis very appealing, so we prefer to retain the original level of detail in the data,and to simplify the model structure in (arguably) less restrictive and (definitely)more transparent ways.

We adopt as our basic model a simplified specification with separate intensityfunctions only for each exit route, letting the effect of the state of origin becaptured by dummy variables included with the other regressors, together witha few variables which are specific to the state of origin (specifically SPECIALand YTCHOICE describing the nature of the training placement in YTS spells,and YTMATCH, CLERK and TECH describing occupation and trainingmatch for E U transitions).

The best choice for the functional form is not obvious. Meyer (1990)has proposed a flexible semi-parametric approach which is now widely used insimpler contexts (see also Narendranathan and Stewart, 1993). It entails esti-mating the dependence of on t as a flexible step function, which introduces aseparate parameter for each step. In our case, with 16 different transition types,this would entail a huge expansion in the dimension of the parameter space.Instead, we adopt a different approach. We specify a generic parametric func-tional form for which is chosen to be reasonably flexible (in particular,not necessarily monotonic). However, we also exploit our a priori knowledge ofthe institutional features of the education and training systems to modify thesefunctional forms to allow for the occurrence of �standard� spell durations formany episodes of YTS and college. These modifications to the basic model aredescribed below. The basic specification we use is the Burr form:

where zk is a row vector of explanatory variables constructed from x in someway that may be specific to the state of origin, k . Note that the absence of a

A CORRELATED RANDOM EFFECTS TRANSITION MODEL 253

� �

����/���* �3

���

���/���* �3*

���/���* �3 ��, / ��� � ����3������&

�� >�� �, / ��� � ����3���(16.8)

Page 277: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

subscript k on is not restrictive: any form can be rewritten bydefining zk appropriately, using origin-specific dummy variables in additive andmultiplicative form. The form (16.8) is non-proportional and not necessarilymonotonic, but it has the Weibull form as the special case 0. Theparameters and are specific to the origin�destination combination k, l,and this gives the specification considerable flexibility. The Burr form has thefollowing survivor function:

Note that the Burr model can be derived as a Weibull�gamma mixture, with thegamma heterogeneity spell-specific and independent across spells, but such aninterpretation is not necessary and is not in any case testable without further apriori restrictions, such as proportionality.

16.3.4. YTS spells

There are two special features of YTS episodes that call for some modificationof the standard transition model outlined above. One relates to attrition(YTS O transitions). Given the monitoring function of the LCS for YTStrainees, it is essentially impossible for the LCS to lose contact with an individ-ual while he or she remains in a YTS place. Thus a YTS O transition mustcoincide with a transition from YTS to C, E or U, where the destination state isunobserved by the LCS. Thus a transition of this kind is a case where theobserved duration in the (m 1)th spell (YTS) is indeed the true completedYTS duration, but the destination state rm 1 is unobserved. For the smallnumber of episodes of this kind, the distribution (16.3) is

A second special feature of YTS episodes is the exogenous two-year limitimposed on them by the rules of the system. Mealli, Pudney and Thomas (1996)proposed a simple model for handling this complication which Figure 16.1shows to be important in our data. The method involves making allowance fora discontinuity in the destination state probabilities conditional on YTS dur-ation at the two-year limit. The transition structure operates normally until thelimit is reached, at which point a separate MNL structure comes into play.Thus, for a YTS episode,

where w is a vector of relevant covariates. In the sample there are no cases at allof a college spell following a full-term YTS episode; consequently the summa-tion in the denominator of (16.11) runs over only two alternatives, E and U.The and parameters are normalised to zero for the latter.

254 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

�� ���� ���

�� ��� ��

��/���* �3 � K�� >�� �, / ��� � ����3��� L���

>�� � (16.9)

�����* �

/������* �3 �����

���/���������* �3 �, �����

!��/���������* �3

� �� (16.10)

��/� � ��� � >* �* �3 � �, /)�� � ���*+3�� �, /)�� � ���*+3

(16.11)

� �

Page 278: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

16.3.5. Bunching of college durations

To capture the two peaks in the empirical hazard function for college spells, wesuperimpose two spikes uniformly across all transition intensity functions forcollege spells. Thus, for origin C and destinations l E, U, O, YTS, themodified transition intensities are

where A 1(t) and A 2(t) are indicator functions of (0.8 t 1) and (1.8 t 2)respectively.

16.3.6. Simulated maximum likelihood

The major computational problem involved in maximising the log-likelihoodfunction (16.5) is the computation of the four-dimensional integral whichdefines each of the n likelihood components. The approach we take is toapproximate the integral by an average over a set of Q pseudo-random deviatesgenerated from the assumed joint standard lognormal distribution for . Wealso make use of the antithetic variates technique to improve the efficiency ofsimulation, with all the underlying pseudo-random normal deviates reused withreversed sign to reduce simulation variance. Thus, in practice we maximisenumerically the following approximate log-likelihood:

where is the likelihood component for individual i, evaluated at a simu-lated value for the unobservables.

This simulated ML estimator is consistent and asymptotically efficient as Qand n with 0 (see Gourieroux and Monfort, 1996). Practicalexperience suggests that it generally works well even with small values of Q (seeMealli and Pudney (1996) for evidence on this). In this study, we use Q 40.The asymptotic approximation to the covariance matrix of the estimatedparameter vector is computed via the conventional OPG (Outer Product ofGradient) formula, which gives a consistent estimate in the usual sense, underthe same conditions on Q and n

16.4. ESTIMATION RESULTS estimation results

16.4.1. Model selection

Our preferred set of estimates is given in Tables A4�A7 in the appendix.These estimates are the outcome of a process of exploration which ofnecessity could not follow the �general-to-specific� strategy that is usuallyfavoured, since the most general specification within our framework wouldhave approximately 450 parameters, with 16 separate random effects, and

ESTIMATION RESULTS 255

�,�/���* �3 � �,�/���* �3 �, K ���/�3� >�>/�3L

� � � �

%/�3 ���

�����

-

�-.��

��/�.3� ��/� �.3>

��/�.3�.

� � ����

�- �

���

Page 279: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

would certainly not be possible to estimate with available computing and dataresources. Even after the considerable simplifications we have made, thereremain 117 parameters in the model. Apart from the constraints imposed on usby this dimensionality problem, we have adopted throughout a conservativecriterion, and retained in the model all variables with coefficient t-ratios in excessof 1.0. Some further explanatory variables were tried in earlier specifications, butfound to be insignificant everywhere. These were all dummy variables, distin-guishing those who: took a non-academic subject mix at school; were in atechnical/craft occupation when in work; and had trainee rather than employeestatus when in YTS. Thus the sparse degree of occupational and training detail inthe final model is consistent with the available sample information.

The relatively low frequencies of certain transition types have made it neces-sary to impose further restrictions to achieve adequate estimation precision. Inparticular, the and parameters for destination l O (sample attrition)could not be separately estimated, so we have imposed the restrictions

and for all k, h.

16.4.2. The heterogeneity distribution

Table A7 gives details of the parameters underlying the joint distribution of thepersistent heterogeneity terms appearing in the initial state and transition struc-tures. Heterogeneity appears strongly significant in the initial state logit only forthe exponents associated with employment and YTS. The transition intensitieshave significant origin-specific persistent heterogeneity linked to C, U and YTS.The evidence for heterogeneity associated with employment is weak, although avery conservative significance level would imply a significant role for it, and alsoin the logit that comes into force at the two-year YTS limit ( E in Table A6).

The estimated value of implies a correlation of 0.30 between any pair ofscaled heterogeneity terms, , in the transition part of the model. There is apositive correlation between the heterogeneity terms associated with the E, Uand YTS states, but these are all negatively correlated with the heterogeneityterm associated with college. This implies a distinction between those who arepredisposed towards long college spells and those with a tendency towards longU and YTS spells. Note that the estimate of is significantly different fromboth 0 and 1 at any reasonable significance level, so both the one-factor andindependent stochastic structures are rejected by these results, although inde-pendence is clearly preferable to the single-factor assumption.

Wherever significant, there is a negative correlation between the randomeffect appearing in a branch of the initial state logit, , and the correspond-ing random effect in the transition structure, . This implies, as one mightexpect, that a high probability of starting in a particular state tends to beassociated with long durations in that state.

The logit structure which determines exit route probabilities once the two-year YTS limit is reached involves a random effect which is correlated with therandom effect in the transition intensities for YTS spells. However, this is ofdoubtful significance.

256 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

�� �� �

�/ � 0/ �/ � 0/

�� �����

��������

Page 280: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

16.4.3. Duration dependence

The functional forms of the destination-specific transition intensities are plottedin Figures 16.3�16.6 conditional on different states of origin. In constructingthese plots, we have fixed the elements of the vector of observed covariates x atthe representative values listed in Table 16.1 below. The persistent origin-specificrandom effects, , are fixed at their median values, 0. The relative diversity of

0.1

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

00 200 400 600 800 1000

Tra

nsiti

on in

tens

ity

Duration in college (days)

YTS

O

UE

Figure 16.3 Estimated transition intensities for exit from education (C).

0 200 400 600 800 1000

Duration in employment (days)

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

Tra

nsiti

on in

tens

ity

YTS

UO

Figure 16.4 Estimated transition intensities for exit from employment (E).

ESTIMATION RESULTS 257

Page 281: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

8

7

6

5

4

3

2

1

00 200 400 600 800 1000

Tra

nsiti

on in

tens

ity

Duration in unemployment (days)

E

YTSO

Figure 16.5 Estimated transition intensities for exit from unemployment (U).

0.25

0.2

0.15

0.1

0.05

00 200 400 600 800 1000

Duration in YTS (days)

Tra

nsiti

on in

tens

ity

E

U

Figure 16.6 Estimated transition intensities for exit from YTS.

the functional forms across states of origin gives an indication of the degree offlexibility inherent in the structure we have estimated. There are several pointsto note here.

F irst, the estimated values of the 13 parameters are significant in all butfour cases, implying that the restricted Weibull form would be rejected againstthe Burr model that we have estimated; thus, if we were prepared to assumeproportional Weibull competing risks, this would imply a significant role for

258 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

��

Page 282: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

destination-specific Gamma heterogeneity uncorrelated across spells. Second,the transition intensities are not generally monotonic; an increasing then fallingpattern is found for the transitions C YTS, E YTS and YTS U. Theaggregate hazard rate is non-monotonic for exits from college, employment andunemployment. Third, transition intensities for exit to college are very small forall states of origin except unemployment, where there is a sizeable intensity oftransition into education at short unemployment durations. The generally lowdegree of transition into state C reflects the fact that, for most people, formalpost-16 education is a state entered as first destination after leaving school, ornot at all. However, the fact that there are unobservables common to both theinitial state and transition parts of the model implies that the decision to entercollege after school is endogenous and cannot be modelled separately from thetransitions among the other three states. The discontinuity of exit probabilitiesof the two-year YTS limit is very marked (see Figure 16.7).

F igure 16.8 shows the aggregated hazard rates, , governingexits from each state of origin, k . The typical short unemployment duration-simply a high hazard rate for exits from unemployment, but declining stronglywith duration, implying a heavy right hand tail for the distribution ofunemployment durations. For the other three states of origin, the hazardrates are rather flatter, except for the one- and two-year peaks for college spells.Note that we cannot distinguish unambiguously between true duration depend-ence and the effects of non-persistent heterogeneity here, at least not withoutimposing restrictions such as proportionality of hazards.

16.4.4. Simulation strategy

The model structure is sufficiently complex that it is difficult to interpret theparameter estimates directly. Instead we use simple illustrative simulations tobring out the economic implications of the estimated parameter values. The

1.2

E

1

0.8

0.6

0.4

0.2

00 200 400 600 800 1000

Duration in YTS (days)

Pr (

exit

rout

e)

Figure 16.7 Probabilities of exit routes from YTS conditional on duration.

ESTIMATION RESULTS 259

� � �

���/���* �3 ��

� ���

Page 283: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

�base case� simulations are performed for a hypothetical individual who isaverage with respect to quantitative attributes and modal with respect tomost qualitative ones. An exception to this is educational attainment, whichwe fix at the next-to-lowest category (GCSE2), to represent the group forwhom YTS is potentially most important. Thus our representative individualhas the characteristics listed in Table 16.1.

The treatment of state O (nonignorable attrition) is critical. We assume that,conditional on the persistent heterogeneity terms , the labourmarket transition process is the outcome of a set of independent competingrisks represented by the hazards . Superimposed on this process is a fifth

4.5

4

3.5

3

2.5

2

1.5

1

0.5

00 200 400 600 800 1000

Duration by state of origin (days)

E

C

U

YTS

Haz

ard

rate

Figure 16.8 Aggregated hazard rates.

Table 16.1 Attributes of illustrative individual.

Attribute Assumption used for simulations

Date of birth 28 February 1972Ethnic origin WhiteEducational attainment One or more GCSE passes, none above grade DSubject mix Academic mix of school subjectsHealth No major health problemSchool quality Attended a school where 38.4 % of pupils achieved five or

more GCSE passesArea quality Lives in a ward where 77.9 % of homes are owner-occupiedLocal unemployment Unemployment rate in ward of residence is 10.3%Date of episode Current episode began on 10 March 1989Previous YTS No previous experience of YTSOccupation When employed, is neither clerical nor craft/technicalSpecial needs Has no special training needs when in YTS

260 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

�, * �1 * �2 * �*+

���

Page 284: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

independent risk of attrition which is relevant to the observation process butirrelevant to the labour market itself. In this sense, attrition is conditionallyignorable, by which we mean that it is independent of the labour markettransition process after conditioning on the persistent unobservables. It is notunconditionally ignorable since the unobservables generate correlation betweenobserved attrition and labour market outcomes, which would cause bias ifignored during estimation. However, conditional ignorability means that,once we have estimated the model and attrition process jointly (as we havedone above), we can nevertheless ignore attrition in any simulation that holdsfixed the values of the persistent unobservables. In these circumstances, ignor-ing attrition means marginalising with respect to the attrition process. Giventhe assumption of conditional independence of the competing risks, this isequivalent to deleting all hazards for exit to state O and simulating theremaining four-risk model. This procedure simulates the distribution of labourmarket event histories that would, if unhampered by attrition, be observed inthe subpopulation of individuals with heterogeneity terms equal to the givenfixed values k .

The simulation algorithm works as follows. For the representative individualdefined in Table 16.1, 500 five-year work histories are generated via stochasticsimulation of the estimated model.1 These are summarised by calculating theaverage proportion of time spent in each of the four states and the averagefrequency of each spell type. To control for endogenous selection and attrition,we keep all the random effects fixed at their median values of zero, and reset alltransition intensities into state O to zero. We then explore the effects of thecovariates by considering a set of hypothetical individuals with slightly differ-ent characteristics from the representative individual. These explore the effectsof ethnicity, educational attainment and the nature of the locality. For the lastof these, we change the SCHOOL, AREA and URATE variables to values of10%, 25% and 20% respectively.

Table 16.2 reveals a large impact for the variables representing ethnicity andeducational attainment, in comparison with the variables used to capture theinfluence of social background. An individual identical to the base case, butfrom a non-white ethnic group (typically South Asian in practice), is predictedto have a much higher probability of remaining in full-time education (59 % ofthe five-year period on average, compared with 20 % for the reference whiteindividual). However, for ethnic minority individuals who are not in education,the picture is gloomy. Non-whites have a much higher proportion of their non-college time (22 % compared with 9 %) spent unemployed, with a roughlycomparable proportion spent in YTS.

The effect of increasing educational attainment at GCSE is to increase theproportion of time spent in post-16 education from 20 % to 31 % and 66 % forthe three GCSE performance classes used in the analysis. Improving GCSE

1 The simulation process involves sampling from the type I extreme value distribution for the logitparts of the model, and from the distribution of each latent duration for the transition part. In bothcases, the inverse of the relevant cdf was evaluated using uniform pseudo-random numbers.

ESTIMATION RESULTS 261

Page 285: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 16.2 Simulated effects of the covariates for a hypothetical individual.

Simulatedindividual

Spelltype

Proportionof time (%)

Proportion ofnon-college time

Frequencyof spells (%)

Mean no.of spells

Base case C 19.7 � 18.2(see Table 16.1) E 56.9 70.9 39.0 2.59

U 7.2 9.0 24.1YTS 16.2 20.2 18.7

Non-white C 58.7 � 55.3E 24.2 58.6 17.8 2.01U 9.3 22.5 18.8YTS 7.8 18.9 8.0

1�3 GCSEs atgrade C orbetter

C 30.6 � 29.9E 47.7 68.7 33.1 2.16U 9.3 13.4 23.2YTS 12.4 17.9 13.8

More than 3GCSEs at gradeC or better

C 65.6 � 64.8E 24.3 70.6 17.2 1.55U 4.5 13.1 12.0YTS 5.6 16.3 6.0

Major healthproblem

C 22.2 � 21.3E 52.4 67.4 33.6 2.57U 4.8 6.2 22.3YTS 20.6 26.5 22.8

Poor school andarea quality

C 18.4 � 17.0E 52.8 64.7 35.9 2.75U 9.1 11.2 24.4YTS 19.6 24.0 22.7

Note: 500 replications over a five-year period; random effects fixed at 0.

performance has relatively little impact on the amount of time predicted to bespent in unemployment and its main effect is to generate a substitution offormal education for employment and YTS training.

There is a moderate estimated effect of physical and social disadvantage.Individuals identified as having some sort of (subjectively defined) major healthproblem are predicted to spend a greater proportion of their first five post-schoolyears in college or YTS (43 % rather than 36 %) compared with the otherwisesimilar base case. This displaces employment (52 % rather than 57 %), but alsoreduces the time spent unemployed by about two and a half percentage points. Inthis sense, there is evidence that the youth employment system was managing toprovide effective support for the physically disadvantaged, if only temporarily.After controlling for other personal characteristics, there is a significant role for

262 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

Page 286: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

local social influences as captured by the occupational, educational and housingcharacteristics of the local area, and the quality of the individual�s school. Poorschool and neighbourhood characteristics are associated with a slightly reducedprediction of time spent in college and employment, with a correspondingincrease in unemployment and YTS tenure. Nevertheless, compared with raceand education effects, these are minor influences.

16.4.5. The effects of unobserved heterogeneity

To analyse the effects of persistent heterogeneity specific to each state of origin,we conduct simulations similar to those presented in the previous paragraph.The results are shown in Figures 16.9�16.12. We consider the representativeindividual and then conduct the following sequence of stochastic simulations.For each state k C, E, U, YTS set all the heterogeneity terms to zero exceptfor one, k , whose value is varied over a grid of values in the range [ 2, 2](covering approximately four standard deviations). At each point in the grid,500 five-year work histories are simulated stochastically and the average pro-portion of time spent in each state is recorded. This is done for each of the fourk , and the results plotted. The plots in Figures 16.9�16.12 show the effect of

varying each of the heterogeneity terms on the proportion of time spentrespectively in college, employment, unemployment and unemployment.

Figure 16.9 The effect of each state-specific random effect on the proportions of timespent in college.

ESTIMATION RESULTS 263

�� �

25%

20%

15%

10%

5%

0%−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Value of random effect

Pro

port

ion

of t

ime

in c

olle

ge

C

E

U

YTS

Page 287: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Figure 16.10 The effect of each state-specific random effect on the proportions of timespent in employment.

Figure 16.11 The effect of each state-specific random effect on the proportions of timespent in unemployment.

The striking feature of these plots is the large impact of these persistentunobservable factors on the average proportions of the five-year simulationperiod spent in each of the four states. This is particularly true for college,

264 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

80%

70%

60%

50%

40%

30%

20%

10%

0%−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Pro

port

ion

of t

ime

in e

mpl

oym

ent

Value of random effect

CE

U

YTS

12%

10%

8%

6%

4%

2%

0%−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Pro

port

ion

of t

ime

in u

nem

ploy

men

t

Value of random effect

CE

U

YTS

Page 288: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Figure 16.12 The effect of each state-specific random effect on the proportions of timespent in YTS.

where the proportion of time spent in education falls from over 20 % at 0to almost zero at 2, with a corresponding rise in the time spent inemployment and unemployment. The proportion of time spent unemployed(essentially the unemployment rate among individuals of the representativetype) is strongly influenced by all four state-specific random effects, with a 6percentage point variation in the unemployment rate.

16.5. SIMULATIONS OF THE EFFECTS OF YTS simulations of the effects of yts

We now bring out the policy implications of the model by estimating theaverage impact of YTS for different types of individual, again using stochasticsimulation as a basis (Robins, Greenland and Hu (1999) use similar tools toestimate the magnitude of a causal effect of a time-varying exposure). A formalpolicy simulation can be conducted by comparing the model�s predictions intwo hypothetical worlds in which the YTS system does and does not exist. Thelatter (the �counterfactual�) requires the estimated model to be modified in sucha way that YTS spells can no longer occur. The results and the interpretationalproblems associated with this exercise are presented in Section 16.5.2 below.However, first we consider the effects of YTS participation and and of earlydropout from YTS, by comparing the simulated labour market experience ofYTS participants and non-participants within a YTS world. For this we use themodel as estimated, except that the �risk� of attrition (transition to state O) isdeleted.

SIMULATIONS OF THE EFFECTS OF YTS 265

25%

20%

15%

10%

5%

0%−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Pro

port

ion

of t

ime

in Y

TS

Value of random effect

CE

U

YTS

�, ��, �

Page 289: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

16.5.1. The effects of YTS participation

We work with the same set of reference individuals as in Sections 16.4.4�16.4.5above. Again, the state-specific random effects are fixed at their median valuesof 0, so that the simulations avoid the problems of endogenous selection arisingfrom persistent unobservable characteristics. This time the 500 replications aredivided into two groups: the first one contains histories with no YTS spell andthe second one histories with at least one YTS spell. We have then two groupsof fictional individuals, identical except that the first happen by chance to haveavoided entry into YTS, while the second have been through YTS: thus wecompare two potential types of work history, only one of which could beobserved for a single individual (Holland, 1986).

To make the comparison as equal as possible, we take the last three years ofthe simulated five-year history for the non-YTS group and the post-YTS period(which is of random length) for the YTS group. We exclude from each groupthose individuals for whom there is a college spell in the reference period, thusfocusing attention solely on labour market participants.

F igure 16.13 shows, for the base case individual, the difference in simulatedunemployment incidence for the two groups. At the median value of the randomeffects, the difference amounts to approximately 5 percentage points, so thatYTS experience produces a substantially reduced unemployment risk. We haveinvestigated the impact of unobservable persistent heterogeneity by repeatingthe simulations for a range of fixed values for each of the k . F igure 16.13 shows

Figure 16.13 The proportions of labour market time spent unemployed for individualswith and without YTS episodes.

266 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

14%

10%

12%

8%

6%

4%

2%

0%−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Pro

port

ion

of t

ime

une

mpl

oyed

Unemployment-specific random effect

YTS spellNo YTS spell

Page 290: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the plot for U ; broadly similar patterns are found for the other k , suggestingthat the beneficial effect of YTS participation is more or less constant acrossindividuals with differing unobservable characteristics.

Table 16.3 shows the influence of observable characteristics, summarising theresults of simulations for the base case and peturbations with respect to ethni-city, education and area/school quality. The beneficial effects of YTS partici-pation are evident in all cases, but are particularly strong for members of ethnicminorities and for those with better levels of school examination achievement.Note that these are the groups with the highest probabilities of full-term YTSspells.

16.5.2. Simulating a world without YTS

The ultimate aim of this type of modelling exercise is to say something aboutthe economic effects of implementing a training/employment subsidy schemesuch as YTS. The obvious way to attempt this is to compare simulations of themodel in two alternative settings: one (the �actual�) corresponding to the YTSscheme as it existed during the observation period; and the other (the �counter-factual�) corresponding to an otherwise identical hypothetical world in whichYTS does not exist. There are well-known and obvious limits on what can beconcluded from this type of comparison, since we have no direct way ofknowing how the counterfactual should be designed. Note that this is not a

Table 16.3 Simulated effects of YTS participation on employment frequency andduration for hypothetical individuals.

Replications with no YTS spell

Simulated individual % period in work % spells in work

Base case 89.9 86.0Non-white 65.3 60.51�3 GCSEs at grade C 85.1 83.0> 3 GCSEs at grade C 88.3 85.4Low school and area quality 86.5 84.0

Replications containing a YTS spell

Simulated individual% post-YTSperiod in work

% post-YTSspells in work

Mean YTSduration

% YTS spellsfull term

Base case 95.1 89.2 1.47 51.0Non-white 86.6 80.2 1.56 59.51�3 GCSEs at grade C 96.5 91.8 1.62 63.6> 3 GCSEs at grade C 98.6 96.8 1.75 73.1Low school and area quality 90.1 81.1 1.47 51.8

Note: 500 replications over a five-year period; random effects fixed at 0.

SIMULATIONS OF THE EFFECTS OF YTS 267

� �

Page 291: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

problem specific to the simulations presented in this chapter; any attempt togive a policy-oriented interpretation of survey-based results is implicitly subjectto the same uncertainties.

The design of a counterfactual case requires assumptions about three majorsources of interpretative error, usually referred to, rather loosely, as deadweightloss, displacement and scale effects. Deadweight loss refers to the possibilitythat YTS (whose objective is employment promotion) may direct some re-sources to those who would have found employment even without YTS.Since YTS has some of the characteristics of an employment subsidy, this is astrong possibility. It seems likely that if YTS had not existed during ourobservation period, then some of those who were in fact observed to participatein YTS would have been offered conventional employment instead, possibly onold-style private apprenticeships. Displacement refers to a second possibilitythat a net increase in employment for the YTS target group might be achievedat the expense of a reduction in the employment rate for some other group,presumably older, poorly qualified workers. Note, however, that displacementeffects can also work in the other direction. For example, Johnson and Layard(1986) showed, in the context of a segmented labour market with persistentunsatisfied demand for skilled labour and unemployment amongst unskilledworkers, that training programmes can simultaneously produce an earningsincrease and reduced unemployment probability for the trainee (which might bedetected by an evaluation study) and also make available a job for one of thecurrent pool of unemployed. A third interpretative problem is that the aggre-gate net effect of a training programme may be non-linear in its scale, so thatextrapolation of a micro-level analysis gives a misleading prediction of theeffect of a general expansion of the scheme. This mechanism may work, forinstance, through the effect of the system on the relative wages of skilled andunskilled labour (see Blau and Robins, 1987).

The evidence on these effects is patchy. Deakin and Pratten (1987) giveresults from a survey of British employers which suggests that roughly a halfof YTS places may have either gone to those who would have been employedby the training provider anyway or substituted for other types of worker (withdeadweight loss accounting for the greater part of this inefficiency). However,other authors have found much smaller effects (see Jones, 1988), and the issueremains largely unresolved. Blau and Robins (1987) found some empiricalevidence of a non-linear scale effect, by estimating a significant interactionbetween programme size and its effects. The need for caution in interpretingthe estimated effects of YTS participation is evident, but there exists no clearand simple method for adjusting for deadweight, displacement and scale effects.

The economic assumptions we make about the counterfactual have a directparallel with the interpretation of the statistical transition model (see alsoGreenland, Robins and Pearl (1999) for a discussion on this). To say anythingabout the effects of removing the YTS programme from the youth labourmarket requires some assumption about how the statistical structure wouldchange if we were to remove one of the possible states. The simulations wepresent in Table 16.4 correspond to the very simplest counterfactual case and,

268 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

Page 292: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 16.4 Simulated work histories for hypothetical individuals with and without theYTS scheme in existence.

Simulatedindividual

Spelltype

Proportion of time fornon-YTS world (%)

Increase compared withYTS world (% points)

Base case C 24.0 4.3E 58.1 1.2U 17.9 10.7

Non-white C 65.3 6.6E 21.7 2.5U 13.0 3.7

1�3 GCSEs atgrade C

C 38.8 8.2E 43.1 4.6U 18.1 8.8

> 3 GCSEs atgrade C

C 70.2 4.6E 22.3 2.0U 7.4 2.9

Major healthproblem

C 31.5 9.3E 49.5 2.9U 19.0 14.2

Poor school andarea quality

C 25.7 7.3E 52.1 0.7U 22.3 13.2

Note: 500 replications over a five-year period; random effects fixed at 0.

equivalently, to the simplest competing risks interpretation. In the non-YTSworld, we simply force the transition intensities for movements from any stateinto YTS, and the probability of YTS as a first destination, to be zero. Theremainder of the estimated model is left unchanged, so that it generates transi-tions between the remaining three states. In other words, we interpret the modelas a competing risks structure, in which the YTS �risk� can be removed withoutaltering the levels of �hazard� associated with the other possible destinationstates. This is, of course, a strong assumption and avoids the issue of the macro-level effects which might occur if there really were an abolition of the wholestate training programme.

As before, we work with a set of hypothetical individuals, and remove theeffect of inter-individual random variation by fixing the persistent individual-specific random effects at zero. Table 16.4 then summarises the outcome of 500replications of a stochastic simulation of the model. The sequence of pseudo-random numbers used for each replication is generated using a randomlyselected seed specific to that replication; within replications, the same pseudo-random sequence is used for the actual and counterfactual cases. Note that theresults are not directly comparable with those presented in Section 16.4.2 whichcompared YTS participants and non-participants, since we are considering here

SIMULATIONS OF THE EFFECTS OF YTS 269

������������������

Page 293: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the whole five-year simulation period rather than the later part of it. We arealso not focusing exclusively on the labour market, since we retain in theanalysis individuals who are predicted by the simulations to remain in educa-tion. A third major difference is that the analysis of Section 16.4.2 did notconsider the effects of differences in YTS participation frequency together withthe effects of YTS per participant, whereas the simulations reported here willnecessarily show bigger impacts of abolition for groups with high YTS partici-pation rates.

On the basis of these results in Table 16.4, the effect of the YTS programmeon employment frequencies is important but moderate: a fall of no more than 5percentage points in the proportion of time spent in employment. Instead, themajor impact of abolition is on time spent in education and in unemployment.With YTS abolished, the proportion of time spent in unemployment rises formost cases by between 6 and 14 percentage points, although the rise is neces-sarily much smaller for those with low probabilities of YTS participation(notably non-whites and those with good GCSE results). The simulated degreeof substitution between continuing education and YTS is substantial, with theduration rising by 4�9 percentage points in every case. The rise is largest forindividuals disadvantaged by ethnicity, health or social/educational back-ground, but also for those with a modestly increased level of school examin-ation achievement relative to the base case.

16.6. CONCLUDING REMARKS concluding remarks

We have estimated a large and highly complex transition model designed toaddress the formidable problems of understanding the role played by govern-ment training schemes in the labour market experience of school-leavers. Thequestion �what is the effect of YTS?� is a remarkably complex one, and we havelooked at its various dimensions using stochastic simulation of the estimatedmodel. Abstracting from endogenous (self-)selection into YTS, we have foundevidence suggesting a significant improvement in subsequent employment pro-spects for those who do go through YTS, particularly in the case of YTS�stayers�. This is a rather more encouraging conclusion than that of Dolton,Makepeace and Treble (1994), and is roughly in line with the earlier appliedliterature, based on less sophisticated statistical models. Our results suggestthat, for the first five years after reaching school-leaving age, YTS appearsmainly to have absorbed individuals who would otherwise have gone intounemployment or stayed on in the educational system. The employment pro-motion effect of YTS among 16�21-year-olds might in contrast be judgedworthwhile but modest. Our estimated model is not intended to have any directapplication to a period longer than the five-year simulation period we haveused. However, arguably, these results do give us some grounds for claiming theexistence of a positive longer term effect for YTS. The increased employmentprobabilities induced by YTS naturally occur in the late post-YTS part of the

270 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

Page 294: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

five-year history we have simulated. As a result, we can conclude that, condi-tional on observables and persistent unobservable characteristics, a greaterproportion of individuals can be expected to reach age 21 in employment, ifYTS has been available during the previous five years, than would otherwise bethe case. On the reasonable assumption of a relatively high degree of employ-ment stability after age 21, this suggests a strong, positive, long-term effect ofYTS on employment probabilities.

APPENDIX appendix

Table A1 Variables used in the models.

Variable Definition Mean

Time-invariant characteristics (mean over all individuals):DOB Date of birth (years after 1 January 1960) 12.16WHITE Dummy 1 if white; 0 if other ethnic origin 0.942GCSE2 Dummy for at least 1 General Certificate of Secondary

Education (GCSE) pass at grade D or E0.263

GCSE3 Dummy for 1�3 GCSE passes at grade C or better 0.185GCSE4 Dummy for at least 4 GCSE passes at grade C or better 0.413ILL Dummy for the existence of a major health problem 0.012SCHOOL Measure of school quality proportion of pupils with at

least 5 GCSE passes in first published school leaguetable

0.384

AREA Measure of social background proportion of homes inward of residence that are owner-occupied

0.779

Spell-specific variables (mean over all episodes):DATE Date of the start of spell (years since 1 January 1988) 1.11YTSYET Dummy for existence of a spell of YTS prior to the

current spell0.229

YTSDUR Total length of time spent on YTS prior to the currentspell (years)

0.300

YTSLIM Dummy 1 if two-year limit on YTS was reached priorto the current spell

0.094

YTSMATCH Dummy 1 if current spell is in employment and therewas a previous YTS spell in the same industrial sector

0.121

CLERICAL Dummy 1 if current spell is in clerical employment 0.036TECH Dummy 1 if current spell is in craft/technical

employment0.135

STN Dummy 1 for YTS spell and trainee with specialtraining needs

0.013

CHOICE Dummy 1 if last or current YTS spell in desiredindustrial sector

0.136

URATE Local rate of unemployment (at ward level) 0.103

APPENDIX 271

��

Page 295: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table A2 Sample transition frequencies (%).

(a) Initial spell

State oforigin

Destination state

MarginalC E U YTS Attrition Incomplete

C � 7.7 12.3 4.8 4.9 70.3 47.4E 3.0 � 17.8 15.4 0.9 62.9 10.6U 9.1 27.2 � 57.6 5.5 0.7 28.2YTS 1.6 73.9 11.9 � 10.1 2.5 13.8

Marginal 3.1 21.5 9.7 20.2 5.1 40.5 100

(b) All spells

State oforigin

Destination state

MarginalC E U YTS Attrition Incomplete

C � 8.1 13.4 4.9 4.8 68.8 25.1E 0.7 � 13.2 5.8 0.4 79.9 30.5U 5.3 37.8 � 41.2 13.3 2.4 24.5YTS 1.3 70.6 16.5 � 8.2 3.4 19.9

Marginal 1.8 25.3 10.7 13.1 6.2 42.9 100

Table A3 Mean durations (years).

C E U YTS

Mean duration for completed spells 1.00 0.57 0.27 1.48Mean elapsed duration for bothcomplete and incomplete spells

2.95 2.17 0.31 1.52

Table A4 Estimates: initial state logit component (standard errors in parentheses).

Covariate

Destination state (relative to YTS)

C E U

Constant 1.575 (0.36) 0.321 (0.38) 3.835 (0.47)WHITE 2.580 (0.34) � 0.867 (0.37)GCSE2 0.530 (0.16) � �GCSE3 1.284 (0.20) 0.212 (0.22) 0.307 (0.16)GCSE4 2.985 (0.20) 0.473 (0.24) 0.522 (0.17)ILL � 2.311 (1.10) �SCHOOL 1.855 (0.31) � 1.236 (0.35)AREA � � 1.818 (0.33)URATE � 8.607 (2.57) 12.071 (1.95)

272 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

�� �

Page 296: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table A5(a) Estimates: transition component (standard errors in parentheses).

Coefficient Destination-specific transition intensities

C E U O YTS

Constant 1.377 (1.85) 5.924 (0.57) 1.820 (0.71) 6.926 (0.78) 4.481 (1.11)DATE 8.090 (2.99) � � 0.795 (0.11) 1.884 (0.14)YTSYET � � 1.461 (0.43) � �YTSDUR � 0.762 (0.18) 1.328 (0.46) 0.198 (0.17) �YTSLIMIT � 2.568 (0.71) 3.234 (0.75) � �YTSMATCH � � 0.610 (0.50) � �CLERICAL � � 0.865 (0.53) � �STN � � 1.158 (0.41) � �CHOICE � 0.335 (0.15) � � �WHITE 1.919 (0.77) 1.433 (0.28) 0.751 (0.32) � 1.007 (0.29)GCSE2 2.150 (0.61) � 0.666 (0.20) � 0.437 (0.18)GCSE3 2.369 (0.88) 0.700 (0.17) 1.233 (0.24) 1.115 (0.33) 1.036 (0.45)GCSE4 3.406 (0.94) 0.939 (0.18) 2.046 (0.26) 2.221 (0.32) 1.642 (0.45)ILL � � 0.642 (0.39) � 0.964 (0.75)E � � 4.782 (0.59) 0.962 (0.61) 3.469 (1.05)U 5.530 (0.73) 6.654 (0.54) � 5.079 (0.58) 6.066 (1.04)YTS � 3.558 (0.49) 2.853 (0.41) � �U* (GCSE3/4) 0.447 (0.72) 0.927 (0.22) � 1.197 (0.36) 1.635 (0.45)SCHOOL � 0.233 (0.32) 0.690 (0.47) 1.389 (0.45) 1.451 (0.50)AREA 1.512 (1.23) � 1.628 (0.51) � �URATE � 3.231 (1.99) 2.630 (2.62) 8.488 (3.32) 5.724 (3.20)

College 0.817 (0.16)College 1.516 (0.16)

Table A5(b) Estimates: Burr shape parameters (standard errors in parentheses).

Origin

Destination

C E U O YTS

C � 1.341 (0.15) 1.852 (0.16) 1.636 (0.13) 1.167 (0.22)E 0.356 (0.45) � 1.528 (0.21) 1.636 (0.13) 1.190 (0.22)U 2.667 (0.25) 1.601 (0.10) � 1.636 (0.13) 1.722 (0.11)YTS 0.592 (0.58) 1.427 (0.12) 1.100 (0.13) 1.636 (0.13) �

C � 2.494 (0.62) 1.171 (0.47) 0.555 (0.26) 5.547 (1.48)E 0.414 (1.70) � 4.083 (0.43) 0.555 (0.26) 5.465 (0.75)U 2.429 (0.41) 1.652 (0.13) � 0.555 (0.26) 1.508 (0.12)YTS 5.569 (4.45) 1.018 (0.36) 1.315 (0.40) 0.555 (0.26) �

APPENDIX 273

/���3

� � � � �� �

� �� �

��

�� �

�� � � �� � � �

��

� >

�� ��

� �

��

��

Page 297: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table A6 Estimates: YTS limit logit (standard errors in parentheses).

Parameter Coefficients for state E

Constant 5.205 (1.25)Heterogeneity ( E ) 1.493 (0.81)

Table A7 Estimates: coefficients of random effects and correlation parameter(standard errors in parentheses).

State

C E U YTS

Initial state logit ( k ) 0.075 (0.10) 1.106 (0.39) 0.160 (0.23) 0.521 (0.23)Transition model ( k ) 3.832 (0.33) 0.248 (0.21) 0.335 (0.13) 0.586 (0.20)Correlation parameter ( ) 0.224 (0.04)

ACKNOWLEDGEMENTS acknowledgements

We are grateful to the Economic and Social Research Council for financialsupport under grants R000234072 and H519255003. We would also like tothank Martyn Andrews, Steve Bradley and Richard Upward for their adviceand efforts in constructing the database used in this study. Pinuccia Caliaprovided valuable research assistance. We received helpful comments fromseminar participants at the Universities of Florence and Southampton.

274 HETEROGENEOUS TRANSITION MODELS IN LABOUR ECONOMICS

� �

�� � � �

� �

Page 298: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

PART E

Incomplete Data

Page 299: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 300: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 17

Introduction to Part ER. L. Chambers

17.1. INTRODUCTION introduction

The standard theoretical scenario for survey data analysis is one where asample is selected from a target population, data are obtained from the sampledpopulation units and these data, together with auxiliary information about thepopulation, are used to make an inference about either a characteristic ofthe population or a parameter of a statistical model for the population. Animplicit assumption is that the survey data are ��complete�� as far as analysis isconcerned and so can be represented by a rectangular matrix, with rowscorresponding to sample units and columns corresponding to sample variables,and with all cells in the matrix containing a valid entry. In practice, however,complex survey design and sample nonresponse lead to survey data structuresthat are more complicated than this, and different amounts of informationrelevant to the parameters of interest are often available from different vari-ables. In many cases this is because these variables have different amounts of��missingness��. Skinner and Holmes provide an illustration of this in Chapter14 where variables in different waves of a longitudinal survey are recorded fordifferent numbers of units because of attrition.

The three chapters making up Part E of this book explore survey data analysiswhere relevant data are, in one way or another, and to different extents, missing.The first, by Little (Chapter 18), develops the Bayesian approach to inference ingeneral settings where the missingness arises as a consequence of different formsof survey nonresponse. In contrast, Fuller (Chapter 19) focuses on estimation ofa finite population total in the specific context of two-phase sampling. This is anexample of ��missingness by design�� in the sense that an initial sample (the first-phase sample) is selected and a variableX observed. A subsample of these units isthen selected (the second-phase sample) and another variable Y observed for thesubsampled units. Values of Y are then missing for units in the first-phase samplewho are not in the second-phase sample. As we show in the following section, thischapter is also relevant to the case of item nonresponse on a variable Y whereimputation may make use of information on another survey variable X . F inally,

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 301: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Steel, Tranmer and Holt (Chapter 20) consider the situation where either the unitlevel data of interest are missing completely, but detailed aggregates correspond-ing to these data are available, or a limited amount of survey information isavailable, but not sufficient to allow linkage of analysis and design variables inthe modelling process.

In Chapter 18 Little builds on the Bayesian theory introduced in Chapter 4 todevelop survey inference methods when data are missing because of nonre-sponse. Following standard practice he differentiates between unit nonre-sponse, where no data are available for survey non-respondents, and itemnonresponse, where partial data are available for survey non-respondents. Inthe first case (unit nonresponse), the standard methodology is to compensatefor this nonresponse by appropriately reweighting the responding sample units,while in the second case (item nonresponse) it is more usual to pursue animputation strategy, and to fill in the ��holes�� in the rectangular completedata matrix by best guesses about the missing variable values. From a theoret-ical perspective, this distinction is unnecessary. Unit nonresponse is just anextreme form of item nonresponse, and both weighting (or more generally,estimation) and imputation approaches can be developed for this case. InSection 17.3 below we briefly sketch the arguments for either approach andrelate them to the Bayesian approach advocated by Little.

Fuller�s development in Chapter 19 adopts a model-assisted approach todevelop estimation under two-phase sampling, using a linear regression modelfor Y in terms of X to improve estimation of the population mean of Y . Sincetwo-phase sampling is a special case of item nonresponse, the same methodsused to deal with item nonresponse can be applied here. In Section 17.4 belowwe describe the approach to estimation in this case. Since imputation-basedmethods are a standard way of dealing with item nonresponse, a naturalextension is where the second-phase sample data provide information forimputing the ��missing�� Y -values in the first-phase sample.

The development by Steel, Tranmer and Holt (Chapter 20) is very different inits orientation. Here the focus is on estimation of regression parameters, andmissingness arises through aggregation, in the sense that the survey data, to agreater or lesser extent, consist not of unit-level measurements but of aggregatemeasurements for identified groups in the target population. If these group-level measurements are treated as equivalent to individual measurements thenanalysis is subject to the well-known ecological fallacy (Cleave, Brown andPayne 1995; King, 1997; Chambers and Steel, 2001). Steel, Tranmer and Holt infact discuss several survey data scenarios, corresponding to different amountsof aggregated data and different sources for the survey data, including situ-ations where disaggregated data are available from conventional surveys andaggregated data are available from censuses or registers. Integration of surveydata from different sources and with different measurement characteristics (inthis case aggregated and individual) is fast becoming a realistic practical optionfor survey data analysts, and the theory outlined in Chapter 20 is a guide tosome of the statistical issues that need to be faced when such an analysis isattempted. Section 17.5 provides an introduction to these issues.

278 INTRODUCTION TO PART E

Page 302: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

17.2. AN EXAMPLE an example

To illustrate the impact of missingness on survey-data-based inference, wereturn to the maximum likelihood approach outlined in Section 2.2, and extendthe normal theory application described in Example 1 there to take account ofsurvey nonresponse.

Consider the situation where the joint population distribution of the surveyvariables Y and X and the auxiliary variable Z is multivariate normal, and wewish to estimate the unknown mean vector of this distribu-tion. As in Example 1 of Chapter 2 we assume that the covariance matrix

of these variables is known and that sampling is noninformative given thepopulation values of Z , so the population vector zu of sample inclusion indica-tors {it ; t 1, . . . ,N } can be ignored in inference. In what follows we use thesame notation as in that example.

Now suppose that there is complete response on X , but only partial responseon Y . Further, this nonresponse is noninformative given Z , so the respondentvalues of Y , the sample values of X and the population values of Z define thesurvey data. Let the original sample size be n, with nobs respondents on Y andnmis n nobs non-respondents. Sample means for these two groups will bedenoted by subscripts of obs and mis below, with the vector of respondentvalues of Y denoted by yobs. Let Eobs denote expectation conditional on thesurvey data yobs, x s, zu and let Es denote expectation conditional on the (unob-served) �complete� survey data ys, x s, zu. The score function for is then

scobs(

where denotes the average value of Z over the N n non-sampled popu-lation units. We use the properties of the normal distribution to show that

Eobs(

where X and Z are the coefficients of X and Z in the regression of Y on both Xand Z . That is,

AN EXAMPLE 279

� � +�� # �� # ��-�

� ���� ��� ������ ��� ������ ��� ���

��

��

� �

�- � ���+��+��+�--- ���� �

���+���-� ����� � ����� � ��

���

���

� +� � �-

���

���������

���

�+����� � ��-

���

����� �

���- ��

�������� � ���+�� � �� +���� � �� -� ��+���� � ��--� �

� �

�� � ������ � ������������ � �4��

��� �� � ������ � ������������ � �4��

Page 303: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The MLE for Y is defined by setting the first component of scobs( ) to zero andsolving for Y . This leads to an MLE for this parameter of the form

The MLEs for X and Z are clearly unchanged from the regression estimatorsdefined in (2.6). The sample information for is obtained by applying (2.5).This leads to

infoobs(

where

D

and 2Y XZ is the conditional variance of Y given X and Z .

Comparing the above expression for infoobs( ) with that obtained for infos( )in Example 1 in Chapter 2, we can see that the loss in information due tomissingness in Y is equal to nmis 1D 1. In particular, this implies that there islittle loss of information if X and Z are good ��predictors�� of Y (so 2

Y XZ isclose to zero).

17.3. SURVEY INFERENCE UNDER NON-RESPONSE survey inference under non-response

A key issue that must be confronted when dealing with missing data is therelationship between the missing data indicator, R , the sample selection indica-tor, I, the survey variable(s), Y , and the auxiliary population variable, Z . Theconcept of missing at random or MAR is now well established in the statisticalliterature (Rubin, 1976; Little and Rubin, 1987), and is strongly related to theconcept of noninformative nonresponse that was defined in Chapter 2. Essen-tially, MAR requires that the conditional distribution, given the matrix ofpopulation covariates zu and the (unobserved) �complete� sample data matrixys, of the matrix of sample nonresponse indicators rs generated by the surveynonresponse mechanism, be the same as that of the conditional distribution ofthis outcome given zu and the observed sample data yobs. Like the noninformativenonresponse assumption, it allows a useful factorisation of the joint distributionof the survey data that allows one to condition on just yobs, i and zu in inference.

In Chapter 18 Little extends this idea to the operation of the sample selectionmechanism, defining selection at random or SAR where the conditional distri-bution of the values in i given zu and the population vector y is the same as its

280 INTRODUCTION TO PART E

� ��

��� � ���� �������

+��� � ����-����

���+��� � ����- �� � ���

���

�� �� +��� � ����-

� ��

� ��

�- ����+��+�--� �����+��+�--

����+��+��+�---� ���+����+��+�---� �����+��+��+�---

����+���+�--� �����+���+�--

���� � �� �� �

� �!

� �� ����

� ����

��4� ��� 5 5

5 5 5

5 5 5

��

��

� �� �

�� ��

� �

Page 304: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

distribution given z and the observed values yobs. The relationship of this idea tononinformative sampling (see Chapters 1 and 2) is clear, and, as with thatdefinition, we see that SAR allows one to ignore essentially the sample selectionmechanism in inference about either the parameters defining the distribution ofyu, or population quantities defined by this matrix. In particular, probabilitysampling, where iu is completely determined by zu, corresponds to SAR and inthis case the selection mechanism can be ignored in inference provided infer-ence conditions on zu. On the other hand, as Little points out, the MARassumption is less easy to justify. However, it is very popular, and often servesas a convenient baseline assumption for inference under nonresponse. Underboth MAR and SAR it is easy to see that inference depends only on the jointdistribution of yu and zu, as the example in the previous section demonstrates.Following the development in this example, we therefore now assume that Y infact is made up of two variables, Y and X , where Y has missingness, but X iscomplete. Obviously, under unit nonresponse X is ��empty��.

The prime focus of the discussion in Chapter 18 is prediction of the finitepopulation mean yU . The minimum mean squared error predictor of this meanis its conditional expectation, given the sample data. Under SAR and MAR,these data correspond to yobs, x s and zu. Let (x t , zt ; ) denote the conditionalexpectation of Y given X xt and Z zt , with the corresponding expectationof Y given Z zt denoted (zt ; ). Here and are unknown parameters.For simplicity we assume distinct population units are mutually independent.This optimal predictor then takes the form

where obs and mis denote the respondents and non-respondents in the sampleand U s denotes the non-sampled population units. A standard approach iswhere X is categorical, defining so-called response homogeneity classes. When Zis also categorical, reflecting sampling strata, this predictor is equivalent to aweighted combination of response cell means for Y , where these cells aredefined by the joint distribution of X and Z . These means can be calculatedfrom the respondent values of Y , and substituted into (17.1) above, leading to aweighted estimator of yU , with weights reflecting variation in the distribution ofthe sample respondents across these cells. This estimator is often referred to asthe poststratified estimator, and its use is an example of the weighting approachto dealing with survey nonresponse. Of course, more sophisticated modelslinking Y , X and Z can be developed, and (17.1) will change as a consequence.In Chapter 18 Little describes a random effects specification that allows��borrowing strength�� across response cells when estimating the conditionalexpectations in (17.1). This decreases the variability of the cell mean estimatesbut at the potential cost of incurring model misspecification bias due to non-exchangeability of the cell means.

The alternative approach to dealing with nonresponse is imputation. Theform of the optimal predictor (17.1) above shows us that substituting the

SURVEY INFERENCE UNDER NON-RESPONSE 281

� �� �

� � �#� � �

���'�� � ������

�� ����

�+��# ��< �-�����

�+��< �#�-

� �(17.1)

Page 305: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

unknown values in ymis by our best estimate of their conditional expectationsgiven the observed sample values in x s and the population values in zu willresult in an estimated mean that is identical to the optimal complete dataestimator of yU . One disadvantage with this method of imputation is that theresulting sample distribution of Y (including both real and imputed values)then bears no relation to the actual distribution of such values that would havebeen observed had we actually measured Y for the entire sample. Consequentlyother analyses that treat the imputed dataset as if it were actually measured inthis way are generally invalid. One way to get around this problem is to adoptwhat is sometimes called a ��regression hot deck�� imputation procedure. Hereeach imputed Y -value is defined to be its estimated conditional expectationgiven its observed values of X and Z plus an imputed residual drawn via simplerandom sampling with replacement from the residuals yt ), t obs.The imputed Y -values obtained in this way provide a better estimate of theunobserved true distribution of this variable over the sample.

By using a standard iterated expectation argument, it is easy to see that thistype of single-value imputation introduces an extra component of variabilitythat is typically referred to as imputation variance. Furthermore, the resultingestimator (which treats the observed and imputed data values on an equalfooting) is not optimal in the sense that it does not coincide with the minimummean squared error predictor (17.1) above. This can be overcome by adopting amultiple imputation approach (Rubin, 1987), as Little demonstrates in Chapter18. Properly implemented, with imputations drawn from the posterior distribu-tion of ymis, multiple imputation can also lead to consistent variance estimates,and hence valid confidence intervals. However, such a proper implementationis non-trivial, since most commonly used methods of imputation do not corres-pond to draws from a posterior distribution. An alternative approach that isthe subject of current research is so-called controlled or calibrated imputation,where the hot deck sampling method is modified so that the imputation-basedestimator is identical to the optimal estimator (17.1). In this case the variance ofthe imputation estimator is identical to that of this estimator.

17.4. A MODEL-BASED APPROACH TO ESTIMATION UNDERTWO-PHASE SAMPLING estimation under two-phase sampling

In Chapter 19 Fuller considers estimation of yU under another type of miss-ingness. This is where values of Y are only available for a ��second-phase��probability subsample s2 of size n2 drawn from the original ��first-phase��probability sample s1 of size n1 that is itself drawn from the target populationof size N . Clearly, this is equivalent to Y -values being missing for sample unitsin the probability subsample s1 s2 of size n1 n2 drawn from s1. As above, weassume that values of another variable X are available for all units in s1.However, following Fuller�s development, we ignore the variable Z , assumingthat all probabilistic quantities are conditioned upon it.

282 INTRODUCTION TO PART E

� �+��# ��< ��

� �

Page 306: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Consider the situation where X is bivariate, given by X (1D), so x t (1dt),with population mean EU (yt , in which case EU (yt)

, where t( ) denotes the expected value of dt . In this case (17.1) canbe expressed

Here ) and ) denote the population and first phase sample averages ofthe ) respectively.

In practice the parameters and are unknown, and must be estimatedfrom the sample data. Optimal estimation of these parameters can be carriedout using, for example, the maximum likelihood methods outlined in Chapter2. More simply, however, suppose sampling at both phases is noninformativeand we assume a model for Y and D where these variables have first- andsecond-order population moments given by EU (Y D) 0 D, EU (D)

, varU (Y D d) 2 and varU (D) 2. Under this model the least squaresestimators of 0, , and are optimal. Note that the least squares estimators of

0 and are based on the phase 2 sample data, while that of is the phase1 sample mean of D. When these are substituted into the expression for inthe preceding paragraph it is easy to see that it reduces to the standard regres-sion estimator of the population mean of Y . In fact, theregression estimator corresponds to the optimal predictor (17.1) when Y and Dare assumed to have a joint bivariate normal distribution over the populationand maximum likelihood estimates of the parameters of this distribution aresubstituted for their unknown values in this predictor. In contrast, in Chapter19 Fuller uses sample inclusion probability-weighted estimators for 0, and ,which is equivalent to substituting Horvitz�Thompson pseudo-likelihood esti-mators of these parameters into the optimal predictor.

Variance estimation for this regression estimator can be carried out in anumber of ways. In Chapter 18 Fuller describes a design-based approach tothis problem. Here we sketch a corresponding model-based development. Fromthis perspective, we aim to estimate the prediction variance varU ( ) ofthis estimator. When the first-phase sample size n1 is small relative to thepopulation size N , this prediction variance is approximately the model-basedvariance of the regression estimator itself. Noting that the regression estimatorcan be written , its model-based variance can be estimated usingthe Taylor series approximation

varU (

where infos ) is the inverse of the observed sample information matrixfor the parameters 0, and . This information matrix can be calculated using

ESTIMATION UNDER TWO-PHASE SAMPLING 283

� ����- � �+��<�- � �5 � %��

� �

���'�� ������4

�� ������4

�5 � %��������

�5 � ��+�-�

� �

����+�4��4 � +� � �4-�5 � +���%� � �4

�%4 ����+�-� �����+�--�-

��5 � ��+�-�����+�4+��4 � �5 � �%4�-� ��+�%� � ���+�--�-

��5 � ��+�-�

��+� ���+���+�

�5# � �

� � � � � �� � � � � �

� � �� � �

���'��

����$ � ��4 � + �%� � �%4-��

� � �

����$ � ��

����$ � ��5 � ����

����$- � +� �� ��-����� + ��5#��# ��-+� �� ��- �

��+ ��5#��# ��� � �

Page 307: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the theory outlined in Section 2.2. Equivalently, since yreg is in fact the max-imum likelihood estimator of the marginal expected value of Y , we can usethe equivalence of two-phase sampling and item nonresponse to apply a sim-plified version of the analysis described in the example in Section 17.2 (drop-ping the variable Z ) to obtain

infos(

where V denotes the variance�covariance matrix of Y and D,

V

and

The parameters 2 and 2 can be estimated from the phase 2 residualsyt and phase 1 residuals dt respectively, and the model-based variance of the regression estimator then estimated by the [1,1] compon-ent of the inverse of the sample information matrix infos( ) above, withunknown parameters substituted by estimates.

Given the equivalence of two-phase estimation and estimation under itemnonresponse, it is not surprising that a more general imputation-based ap-proach to two-phase estimation can be developed. In particular, substitutingthe missing values of Y by estimates of their conditional expectations giventheir observed values of D will result in a phase 1 mean of both observed andimputed Y -values that is identical to the regression estimator.

As noted in Section 17.2, a disadvantage with this �fixed value� imputationapproach is that the phase 1 sample distribution of Y -values (both real andimputed) in our data will then bear no relation to the actual distribution of suchvalues that would have been observed had we actually measured Y for theentire phase 1 sample. This suggests we adapt the regression hot deck imput-ation idea to two-phase estimation.

To be more specific, for t s1, put equal to yt if t s2 and equal tootherwise. Here is drawn at random, and with replacement,

from the phase 2 residuals ru s2. The imputed meanof Y over the phase 1 sample is then

where t denotes the number of times unit t s2 is a �donor� in this process.Since the residuals are selected using simple random sampling it is easy to seethat the values { t , t s2} correspond to a realisation of a multinomial randomvariable with n1 n2 trials and success probability vector with all components

284 INTRODUCTION TO PART E

��

�#�- � +��I��+ � +�� � �4-, J+�� � ��+�� � +�� � �4-+

��,+��

� �44 � �4 �4

�4 4

� �

, � �4 5

5 5

� ��

� � ��4 � +%� � �%4-�� � �%�

�#�

� ��� ���5 � %���� ��� ���

� � � ��4 � +% � �%4-��# �

���� � ����

���

��� � ����$ � ����

�����4

��� � ����$ � ����

��4

��� (17.2)

��

Page 308: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

equal to 1 n2. Consequently, when we use a subscript of s to denote condition-ing on the sample data,

EU (

and

By construction, the phase 2 residuals sum to zero over s2, and so the imputedphase 1 mean and the regression estimator have the same expectation.However, the variance of contains an extra component of variability pro-portional to the expected value of the sum of the squared phase 2 residuals.This is the imputation variance component of the �total� variability of . Notethat the model-based variance of this imputed mean is not estimated bycalculating the variance of the t over s1 and dividing by n1. From the aboveexpression for the model-based variance of 1 we can see that the appropriateestimator of this variance is

where reg) denotes the estimator of the model-based variance of the regres-sion estimator that was developed earlier, and 2 is the usual estimator of theerror variance for the linear regression of Y on D in the phase 2 sample.

Controlled or calibrated imputation is where the imputations are stillrandom, but now are also subject to the restriction that the estimator basedon the imputed data recovers the efficient estimator based on the observedsample data. In Chapter 19 Fuller applies this idea in the context of two-phasesampling. Here the efficient estimator is the regression estimator, and theregression hot deck imputation scheme, rather than being based on simplerandom sampling, is such that and the regression estimator yreg coincide.From (17.2) one can easily see that a sufficient condition for this to be the caseis where the sum of the imputed residuals is zero. Obviously, under this typeof imputation the variance of the imputed mean and the variance of theregression estimator are the same, and no efficiency is lost by adopting animputation-based approach to estimation.

ESTIMATION UNDER TWO-PHASE SAMPLING 285

����- � �� ��+����-� �

� �� ����$ � ����

��4

��+�-��

� �� �� ����$ �

�4� �

��

���4

��

� �

���� +����- � ���� ��+����-� �� �� ����+ ����-

� �

� ���� ����$ ��

�4� �

��

���4

��

� �

� ��

��

�4� �

��

� ��4

�4� ��

�4

��4

��

� �4��

��

��

��

���� ����$����

����

�����

�+ +����- � �+ +����$-��4 � 4

��

�4� �

��

���4

�+ +����

���� �

���

Page 309: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

17.5. COMBINING SURVEY DATA AND AGGREGATE DATA INANALYSIS combining survey data and aggregate data in analysis

Many populations that are of interest in the social sciences have a natural��group�� structure, reflecting social, political, demographic, geographic andeconomic differences between different population units in the groups. Prob-ably the most common example is hierarchical grouping, with individualsgrouped into households that are in turn grouped into neighbourhoods andso on. A key characteristic of the units making up such groups is that they tendto be more like other units from the same group than units from other groups.This intra-group correlation is typically reflected in the data collected in multi-stage social surveys that use the groups as sampling units. Analysis of thesedata is then multilevel in character, incorporating the group structure into themodels fitted to these data. However, when the groups themselves are definedfor administrative or economic reasons, as when dwellings are grouped intocensus enumeration districts in order to facilitate data collection, it seemsinappropriate to define target parameters that effectively condition on theserather arbitrary groupings. In these cases there is a strong argument for��averaging�� in some way over the groups when defining these parameters.

In Chapter 20, Steel, Tranmer and Holt adopt this approach. More formally,they define a multivariate survey variable W for each unit in the population,and an indicator variable C that indicates group membership for a unit. Asusual, auxiliary information is defined by the vector-valued variable Z . Theirpopulation model then conditions on the group structure defined by C, in thesense that they use a linear mixed model specification to characterise groupmembership, with realised value wt for unit t in group g in the populationsatisfying

wt

Here g is a random ��group effect�� shared by all units in group g and t is aunit-specific random effect. Both these random effects are assumed to have zeroexpectations, constant covariance matrices and respectively, and to beuncorrelated between different groups and different units. The aim is to esti-mate the unconditional covariance matrix of W in the population, given by

If W corresponds to a bivariate variable with components Y and X then themarginal regression of Y on X in the population is defined by the componentsof , so this regression can be estimated by substitution once an estimate of isavailable.

Steel, Tranmer and Holt describe a number of sample data scenarios, rangingfrom the rather straightforward one where unit-level data on W and Z , alongwith group identifiers, are available to the more challenging situation where thesurvey data just consist of group-level sample averages of W and Z . In theformer case can be estimated via substitution of estimates for the parameters

286 INTRODUCTION TO PART E

� �(�� � ��(���� � �$ � ��� (17.3)

� �

�+4-

(�� �+�-

(��

� � �+�-(�� � �+4-

(�� � ��(������(��� (17.4)

� �

Page 310: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

of the right hand side of (17.4) obtained by using a multilevel modellingpackage to fit (17.3) to these data, while in the latter case, estimation offrom these group averages generally leads to aggregation bias (i.e. the eco-logical fallacy).

In an effort to minimise this bias, these authors consider an ��augmented��data situation where a limited amount of unit-level values of Z are available inaddition to the group averages. This situation, for example, occurs where aunit-level sample file, without group indicators, is released for selected censusvariables (including Z ) and at the same time comprehensive group-level averageinformation is available for a much wider set of census outputs (covering bothW and Z ). In Section 20.5 they note that an estimate zz of the population unit-level covariance of Z obtained from the unit-level data can be combined, using(17.4), with estimates of and wjz calculated from the group-level average dataset to obtain an adjusted estimate of . However, since boththese group-level estimators are biased for their unit-level values, with, forexample, the bias of the group level estimator of z essentially increasingwith the product of the average group size and , it is clear that unless Zsubstantially ��explains�� the intra-group correlations in the population (i.e.is effectively zero) or the sample sizes underpinning the group averages are verysmall, this adjusted estimator of will still be biased, though its bias will besomewhat smaller than the naive group-level estimator of that ignores theinformation in Z .

Before closing, it is interesting to point out that the entire analysis so far hasbeen under what Steel (1985) calls a ��group dependence�� model, in the sensethat , the parameter of interest, has been defined as the sum of correspondinggroup-level and unit-level parameters. Recollect that our target was the unit-level regression of Y on X in the population, and this in turn has been defined interms of the components of . Consequently this regression parameter dependson the grouping in the population. However, as mentioned earlier, suchgrouping structure may well be of no interest at all, in the sense that ifpopulation data on Y and X were available, we would define this parameterand estimate its value without paying any attention to the grouping structure ofthe population. This would be the case if the groups corresponded to arbitrarydivisions of the population. Modifications to the theory described in Chapter20 that start from this rather different viewpoint remain a topic of activeresearch.

COMBINING SURVEY DATA AND AGGREGATE DATA IN ANALYSIS 287

��

�(�� � �+�-

(�� � �+4-

(�� ��

�(��+4-

(���+4-

(��

��

Page 311: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 312: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 18

Bayesian Methods for Unitand Item NonresponseRoderick J. Little

18.1. INTRODUCTION AND MODELING FRAMEWORK introduction and modeling framework

Chapter 4 discussed Bayesian methods of inference for sample surveys. In thischapter I apply this perspective to the analysis of survey data subject to unitand item nonresponse. I first extend the Bayesian modeling framework ofChapter 4 to allow for missing data, by adding a model for the missing-datamechanism. In Section 18.2, I apply this framework to unit nonresponse,yielding Bayesian extensions of weighting methods. In Section 18.3, I discussitem nonresponse, which can be handled by the technique of multiple imput-ation. Other topics are described in the concluding section.

I begin by recapping the notation of Chapter 4:

yU ( ytj), ytj value of survey variable j for unit t, j 1, . . . , J ;t U {1, . . . ,N }Q Q( yU ) finite population quantityiU (i1, . . . , iN ) sample inclusion indicators; it 1 if unit t included, 0 other-wiseyU ( yinc, yexc), yinc included part of yU , yexc excluded part of yUzU fully observed covariates, design variables.

Without loss of generality the sampled units are labeled t 1, . . . , n, so that yincis an (n J ) matrix. Now suppose that a subset of the values in yinc is missingbecause of unit or item nonresponse. Write yinc ( yobs, ymis), yobs observed,ymis missing, and, as in Sections 2.2 and 17.3, define the (n J ) missing-dataindicator matrix:

if ytj observed;if ytj missing.

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Son, Ltd

� � �� �� �� � �

� � ��

��

� �� �

�� � 3���4+ ��� � �+

8+

Page 313: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Following the superpopulation model of Little (1982) and the Bayesian formu-lation in Rubin (1987) and Gelman et al. (1995), the joint distribution of yU , iUand rs can then be modeled as

The likelihood based on the observed data (zU , yobs, iU , rs) is

Adding a prior distribution for the parameters, the posteriordistribution of the parameters given the data is

Conditions for ignoring the selection mechanism in the presence of nonre-sponse are the

Selection at random (SAR): ) for all yexc, ymis.

Bayesian distinctness:

which are again satisfied for probability sampling designs. If the selectionmechanism is ignorable, conditions for ignoring the missing-data mechanismare

Missing at random (MAR): ) for all ymis.

Bayesian distinctness:

If the selection and missing-data mechanisms are ignorable, then

so the models for the selection and missing-data mechanisms do not affectinferences about and Q( yU ).

The selection mechanism can be assumed ignorable for probability sampledesigns, provided the relevant design information is recorded and includedin the analysis. In contrast, the assumption that the missing-data mechanismis ignorable is much stronger in many applications, since the mechanism isnot in the control of the sampler, and nonresponse may well be related tounobserved values ymis and hence not MAR. This fact notwithstanding,methods for handling nonresponse in surveys nearly all assume that themechanism is ignorable, and the methods in the next two sections make thisassumption. Methods for nonignorable mechanisms are discussed inSection 18.4.

290 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

�3 �� + �� + ��� � + �+ �+ �4 � �3 �� � � + �4�3�� � � + �� + �4�3��� � + �� + ��+ �4�

�3�+ �+ �� � + ����+ �� + ��4 � �3 ����+ �� + ��� � + �+ �+ �4

���3 �� + �� + ��� � + �+ �+ �4����������

�3�+ �+ �� � 4

�3�+ �+ �� � + ����+ �� + ��4 � �3�+ �+ �� � 4�3�+�+�� � + ����+ �� + ��4�

�3�� � � + �� +�4 � �3�� � � + ����+�

�3�+�+�� � 4 � �3�+�� � 4�3�� � 4+

�3��� � + �� + ��+�4 � �3��� � + �� + ����+�

�3�+�� � 4 � �3�� � 4�3�� � 4�

�3�� � + ����4 � �3�� � + ����+ ��+ �� 4 ���

�3�3 �� 4� � + ����+ ��+ �� 4 � �3�3 �� 4� � + ����4+

Page 314: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

18.2. ADJUSTMENT-CELL MODELS FOR UNIT NONRESPONSEadjustment-cell models for unit nonresponse

18.2.1. Weighting methods

The term �unit nonresponse� is used to describe the case where all the items inthe survey are missing, because of noncontact, refusal to respond, or someother reason. The standard survey approach is to discard the nonrespondentsand analyze the respondents weighted by the inverse of the estimated proba-bility of response. A common form of weighting is to form adjustment cellsbased on variables X recorded for respondents and nonrespondents, and thenweight respondents in adjustment cell k by the inverse of the response rate inthat cell; more generally the inverse of the response rate is multiplied by thesampling weight to obtain a weight that allows for differential probabilities ofsampling and nonresponse. This weighting approach is simple and corrects fornonresponse bias if the nonresponse mechanism is MAR; that is, if respondentsare a random sample of sampled cases within each adjustment cell.

Weighting for unit nonresponse is generally viewed as a design-based ap-proach, but it can also be derived from the Bayesian perspective via fixed effectsmodels for the survey outcomes within adjustment cells defined by X . Thefollowing example provides the main idea.

Example 1. Unit nonresponse adjustments for the mean from a simple randomsample

Suppose a simple random sample of size n is selected from a population of sizeN , and r units respond. Let X be a categorical variable defining K adjustmentcells within which the nonresponse mechanism is assumed MAR. In cell k , letNk be the population size and P (P1, . . . , PK ) where Pk Nk N is the popu-lation proportion. Also let nk and pk nk n denote the number and proportionof sampled units and rk the number of responding units in cell k . Let Y denote asurvey variable with population mean yUk in cell k , and suppose interestconcerns the overall mean

Suppose initially that the adjustment-cell proportions P are known, as when Xdefines population post-strata. Let Y denote a full probability model for thevalues of Y . Then yU can be estimated by the posterior mean given P, the surveydata, and any known covariates, denoted collectively as D:

ADJUSTMENT-CELL MODELS FOR UNIT NONRESPONSE 291

� � �� �

��� ��'!��

(!���!�

�)( � "3 ��� �(+�+ �) 4 ��'!��

(!"3 ���!�(!+�+ �) 4

'

��!��

(!3 �!��! � 3� � �!4"3 ���!�(!+�+ �) 44+ (18.1)

Page 315: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where yk is the respondent mean and yk is the mean of the nonsampled andnonresponding units, andprovides a prediction of the mean of the nonincluded values. A measure of theprecision of the estimate is the posterior variance

Now consider a fixed effects model for Y with noninformative constant priorfor the means within the adjustment cells:

The posterior mean for this model given P is

(18.4)

the post-stratified respondent mean that effectively weights the respondents incell k by rPk rk . The posterior variance given { 2

k} is

which is equivalent to the version of the design-based variance that conditions onthe respondent counts {rk} (Holt and Smith, 1979). When the variances { 2

k} arenot known, Bayes� inference under a noninformative prior yields t-type correc-tions that provide better calibrated inferences in small samples (Little, 1993a).

If P is unknown, let M denote a model for the cell counts and write). Then yU is estimated by the posterior mean

(18.5)

and the precision by the posterior variance

(18.6)

where the second term represents the increase in variance from estimation ofthe cell proportions. The Bayesian formalism decomposes the problem into twoelements that are mixed in the design-based approach, namely predicting thenonincluded values of Y in each cell, and estimating the cell proportions whenthey are not known. Random sampling implies the following model P for thecell counts conditional on n:

(18.7)

292 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

� ��

�! � �!��!

*( �*��3 ��� �(+�+ �) 4

��'!��

(.!3� � �!4.3*��3 ���!�(+�+ �) 44

��'!��

�'! �!

(!(!3� � �!43� � �!43+��3 ���!+ ���! �(+�+ �) 4�

E ��� � � !+ 6�!+ �.!7F ��% �3�!+ �

.!4+

E6�!+ "�$ �.!7F � ���� (18.3)

This equation emphasizes that the model

"3 ��� �(+�4 � ��(� ��'!��

(!��!+

� �

*��3 ��� �6�.!7+(+�4 � �.

(� ��'!��

(.!3� � �!4�.

!��!+ �! � �!��!+

�� � 3�) + �, �

�) � "3 �)(��+ �4

*��3 ��� ��+ �4 � "3*��3 ��� �(+�+ �44 � *��3 �)(��+ �4+

6!�7 �,��35(4+

Page 316: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where Mnml (n, P) denotes the multinomial distribution with index n. With aflat prior on logP, the posterior mean of Pk is nk n and the posterior mean of yUis

(18.8)

which is the standard weighting-class estimator, effectively weighting respond-ents in cell k by (r n)(nk rk ), where nk rk is the inverse of the response rate in thatcell (Oh and Scheuren, 1983). The posterior variance is

(18.9)

which can be compared to the design-based estimate of mean squared error inOh and Scheuren (1983).

Example 2. Unit nonresponse adjustments for a stratified mean

For a more complex example that includes both sampling and nonresponseweights, let Z denote a stratifying variable, and suppose a stratified randomsample of size nj is selected from stratum Z j, with selection probabilityj nj N j. As before, let X be a categorical variable defining K nonresponse

adjustment cells, and in stratum j, cell k , let Njk be the population size andP (P11, . . . , PJK ) where Pjk Njk N is the population proportion. Also, letnjk and pjk njk n denote the number and proportion of sampled units and rjkthe number of responding units in cell jk . Let Y denote a survey variable withpopulation mean yUjk in cell jk , and suppose interest concerns the overall mean

As in the previous example, we consider a fixed effects model for Y withnoninformative constant prior for the means within the adjustment cells:

(18.10)

The posterior mean for this model given P is

(18.11)

ADJUSTMENT-CELL MODELS FOR UNIT NONRESPONSE 293

"3 ��� � %���4 � ��-+ ��'!��

�!��!+

� � �

*��3 ��� � %���4 ��'!��

"3(.!� %���43� � �!4�.

!��! � *���'!��

(!��!� %���� �

��'!��

�.! � �!3� � �!4�3� �4

� �3� � �!4�.

!��!

��'!��

�!3 ��! � ��-+4.�3� �4+

�� �

� � �� �

��� ������

�'!��

(�!����!�

E ���.� � �+ � � !+ 6��!+ �.�!7F ��% �3��!+ �

.�!4+

E6��!+ "�$ �.�!7F � ����

"3 ��� �(+ %���4 � ��(� ������

�'!��

(�!���!+

Page 317: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the post-stratified respondent mean that effectively weights the respondents incell k by rPjk rjk . If X is based on sample information, random sampling withinthe strata implies the following model P for the adjustment-cell counts condi-tional on nk :

(18.12)

With a flat prior on logQj , the posterior mean of Qjk is njk nj and the posteriormean of yU is

(18.13)

where respondents in cell jk are weighted by wjk (Nj nj)(njk rjk )(r N ), which isproportional to the inverse of the product of the sampling rate in stratum j andthe response rate in cell jk .

Note that (18.13) results from cross-classifying the data by the strata as wellas by the adjustment cells when forming nonresponse weights. An alternativeweighting-class estimator assumes that nonresponse is independent of thestratifying variable Z within adjustment cells, and estimates the response ratein cell jk as . This yields

(18.14)

This estimator is approximately unbiased because the expected value of wjkequals wjk . Some authors advocate a variant of (18.14) that includes sampleweights in the estimate of the response rate, that is

Both these estimates have the practical advantage of assigning weight wjk 0to cells where there are no respondents (rjk 0), and hence the respondentmean is not defined. However, assigning zero weight seems unjustified whenthere are sample cases in these cells, that is njk > 0, and it seems hard to justifythe implied predictions for the means yUjk in those cells; a modeling approachto this problem is discussed in the next section.

18.2.2. Random effects models for the weight strata

When the respondent sample size in a cell is zero the weighted estimates (18.4),(18.8), (18.11), or (18.13) are undefined, and when it is small the weightassigned to respondents can be very large, yielding estimates with poor preci-sion. The classical randomization approach to this problem is to modify the

294 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

��

6�!��7 �,�� � 5��� �

+ �� � 3���+ +��' 4+ ��! � (�!�(��

"3 ��� �%���4 � ��-+� ������

�'!��

(�3�!��4���! ������

�'!��

0�!��!���!������

�'!��

0�!��!+

� � �

��! � ��!��!

��-+. ������

�'!��

0��!��!���!�

�����

�'!��

0��!��!+ %� � 0�

�! ���!��!��!

� 0�!�

��! ���

��!���

� ��

�!���

���

� ��

Page 318: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

weights, by pooling cells or by trimming (Potter, 1990). From the Bayesianperspective, the problem lies in the fixed effects model specification, whichresults in the mean in each cell being estimated by the cell respondent mean,with no borrowing of strength across cells. The solution lies in replacing the flatprior for the cell means by a proper prior that allows borrowing strength acrosscells.

In the simple random sampling setting, suppose the fixed effects model isreplaced by

(18.15)

where:

(a) (Pk ,Ck ; ) is a mean function of the proportion Pk and covariates C,indexed by unknown parameters ; and

(b) is a prior distribution for the hyperparameters , 2, 2, 2, typically flator weakly informative.

(c) Heterogeneous variance is modeled via known constants (a1k , a2k ),k 1, . . . ,K , and the hyperparameter 2, which is potentially unknown.The full variance structure combines models from various settings asspecial cases. Homogeneous variance is obtained by setting a1k 1 forall k and 2 0. In previous work I and others have set a2k equal to one,but alternative choices such as a2k 1 Pk might also be considered.

Letting results in the fixed effects model for the cell means:

(18.16)which leads to the weighted estimators (18.4) or (18.8), depending on whetheror not P is known. At the other extreme, setting 0 in (18.16) yieldsthe direct regression form of the model:

(18.17)

The posterior mean for this model for known P is the regression estimator:

(18.18)

ADJUSTMENT-CELL MODELS FOR UNIT NONRESPONSE 295

E ��� � � !+ 6�!+ �.!7F ��% �3�!+ ��!�

.!4

E�!�(++!+ �+ .F ��% �3 ��!+ �.!.4+ ��! � 23(!++!5 �4

E "�$ �.!��.+�.F ��% �3 "�$ �.+�.4

E�+ �.+�.+ .F � �+

��

� � � �

2

� �

�� �

� �

. � �E ��� � � !+ �!+ �

.!F ��% �3�!+ ��!�

.!4

E "�$ �.!��.+�.F ��% �3 "�$ �.+�.4

E6�!7+ �.+�.F � E6�!7FE�.+�.F

E6�!7F � ���+ E�.+�.F � �+

� � �

E ��� � � !+(!++!+ �+ �.F ��% �3 ��!+ ��!�.4+ ��! � 23(!++!5 �4+

E�+ "�$ �.F � ��

. .

"3 ��� �(+++ �+ �.4 ��'!��

(!3 �!��! � 3� � �!4��!4+

Page 319: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

which is potentially more efficient than (18.4) but is not necessarily designconsistent, and is potentially vulnerable to bias from misspecification of themodel (18.17).

For model (18.15) with 0 2 , the posterior mean conditional on{P,D, , { 2

k}, 2} is

where

(18.19)

The estimator (18.19) is not generally design unbiased, but it is design consist-ent, since as 1 and (18.19) tends to the design-unbiased estimator(18.4). Design consistency is retained when the hyperparameters are replacedby consistent empirical Bayes� estimates. Alternatively, fully Bayes� inferencecan be applied based on priors on the prior distribution of the hyperpara-meters, yielding ��Bayes� empirical Bayes�� inferences that are also design con-sistent. These inferences take into account uncertainty in the hyperparameters,at the expense of an increase in computational complexity. For a comparison ofthese approaches in the setting of small-area estimation from surveys, seeSingh, Stukel and Pfeffermann (1998).

The estimator (18.19) can be viewed as �model assisted� in survey researchterminology (for example, Cassel, Sarndal and Wretman, 1977), since it repre-sents a compromise between the �design-based� estimator and the�model-based� estimator . However, rather than using a model tomodify a design-based estimator, I advocate inference based directly on modelsthat have robust design-based properties (Little, 1983b, 1991). That is, designinformation is used to improve the model, rather than using model informationto improve design-based estimation. Conceptually, I prefer this approach sinceit lies within the model-based paradigm for statistical inference from surveys,while admitting a role for frequentist concepts in model selection (see, forexample, Box, 1980; Rubin, 1984). A more pragmatic reason is that theassumptions of the estimation method are explicit in the form of the model.

Some possible forms of g for borrowing strength are

(18.20)

which shrinks the post-stratum means to a constant value (Little, 1983b; Ghoshand Meeden, 1986);

(18.21)which assumes a linear relationship between the mean and the post-stratumsize;

(18.22)

where Ck is an ordinal variable indexing the post-strata (Lazzeroni and Little,1998); or models that express the mean as a linear function of both Pk and Ck .

296 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

� � �� �

�) �( �

�'!��

(!3 �!��! � 3� � �!43!��! � 3� � !4��!44+ %� �

! ��!�.!.

�!�.!. � ��!�.!

�! � �+ ! � �

�'!�� (!��!�'

!�� (!��!

23(!5 �4 � �

23(!5 �4 � �8 � ��(!

23+!5 �4 � �8 � ��+!

Page 320: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Elliott and Little (2000) study inferences based on (18.15) when g(Ck , ) is givenby (18.20), (18.22), or a smoothing spline. They show that these inferencescompare favorably with weight trimming, where trimming points are eitherspecified constants or estimated from the data.

18.3. ITEM NONRESPONSE item nonresponse

18.3.1. Introduction

Item nonresponse occurs when particular items in the survey are missing,because they were missed by the interview, or the respondent declined to answerparticular questions. For item nonresponse the pattern of missing values isgeneral complex and multivariate, and substantial covariate information isavailable to predict the missing values in the form of observed items. Thesecharacteristics make weighting adjustments unattractive, since weightingmethods are difficult to generalize to general patterns of missing data (Little,1988) and make limited use of the incomplete cases.

A common practical approach to item-missing data is imputation, wheremissing values are filled in by estimates and the resulting data are analyzed bycomplete-data methods. In this approach incomplete cases are retained in theanalysis. Imputation methods until the late 1970s lacked an underlying theor-etical rationale. Pragmatic estimates of the missing values were substituted,such as unconditional or conditional means, and inferences based on the filled-in data. A serious defect with the method is that it �invents data.� Morespecifically, a single imputed value cannot represent all of the uncertaintyabout which value to impute, so analyses that treat imputed values just likeobserved values generally underestimate uncertainty, even if nonresponse ismodeled correctly. Large-sample results (Rubin and Schenker, 1986) show thatfor simple situations with 30 % of the data missing, single imputation under thecorrect model results in nominal 90 % confidence intervals having actual cover-ages below 80 %. The inaccuracy of nominal levels is even more extreme inmultiparameter testing problems.

Rubin�s (1987) theory of multiple imputation (MI) put imputation on a firmtheoretical footing, and also provided simple ways of incorporating imputationuncertainty into the inference. Instead of imputing a single set of draws for themissing values, a set of Q (say Q 5) datasets is created, each containingdifferent sets of draws of the missing values from their predictive distributiongiven the observed data. The analysis of interest is then applied to each of the Qdatasets and results are combined in simple ways, as discussed in the nextsection.

18.3.2. MI based on the predictive distribution of the missing variables

MI theory is founded on a simulation approximation of the Bayesian posteriordistribution of the parameters given the observed data. Specifically, suppose

ITEM NONRESPONSE 297

Page 321: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

the sampling mechanism is ignorable and let ys ( ytj) represent an n J datamatrix and rs be an n J missing-data indicator matrix with rtj 1 if ytj isobserved. Let yobs be the observed data and ymis the missing components of ys.Let p( ys, rs zU , ) be the distribution of s and rs given design variables zU ,indexed by model parameters , and let p( zU ) be a prior distribution for . Theposterior distribution for given the observed data (zU , yobs, rs) is related to theposterior distribution given hypothetical complete data (zU , yU , rs) by the ex-pression

MI approximates (or simulates) this expression as

(18.23)

where y(q)s ( yobs, y(q)

mis) is an imputed dataset with missing values filled in by adraw y(q)

mis from the posterior predictive distribution of the missing data ymisgiven the observed data zU , yobs, and rs:

(18.24)

and p( zU , y(q)s , rs) is the posterior for based on the filled-in dataset zU and

y(q)s . The posterior mean and covariance matrix of can be approximated

similarly as

(18.25)

(18.26)

These expressions form the basis for MI inference of the filled-in datasets.Equation (18.25) indicates that the combined estimate of a parameter isobtained by averaging the estimates from each of the filled-in datasets. Equa-tion (18.26) indicates that the covariance matrix of the estimate is obtained byaveraging the covariance matrix from the filled-in datasets and adding thesample covariance matrix of the estimates (q) from each of the filled-in data-sets; the (Q 1) Q factor is a small-sample correction that improves the ap-proximation. The second component captures the added uncertainty fromimputation that is missed by single-imputation methods.

Although the above MI theory is Bayesian, simulations show that inferencesbased on (18.23)�(18.26) have good frequentist properties, at least if the model

298 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

� �� �

� �� �� �

�3�� � + ����+ ��4 ���3�� � + ��+ ��4�3 ����� � + ����4������

�3�� � + ����+ ��4 ��

��6��

�3�� � + �364� + ��4+

�364��� � �3 ����� � + ����+ ��4+

�� ��

"3�� � + ����+ ��4 � �� � �

��6��

��364+ %� � ��364 � "3�� � + �364� + ��45

*��3�� � + ����+ ��4 ��

��6��

*��3�� � + �364� + ��4

��� �

�� �

��6��

3��364 � ��43��364 � ��4� �

��� �

Page 322: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

and prior are reasonable. Another useful feature of MI is that the complete-data analysis does not necessarily need to be based on the model used to imputethe missing values. In particular, the complete-data analysis might consist incomputing estimates and standard errors using more conventional design-based methods, in which case the potential effects of model misspecificationare confined to the imputation of the missing values. For more discussion of theproperties of MI under model misspecification, see for example Rubin (1996),Fay (1996), Rao (1996), and the associated discussions.

An important feature of MI is that draws of the missing values are imputedrather than means. Means would be preferable if the objective was to obtain thebest estimates of the missing values, but have drawbacks when the objective isto make inferences about parameters. Imputation of draws entails some loss ofefficiency for point estimation, but the averaging over the K multiply-imputeddatasets in (18.25) considerably reduces this loss. The gain from imputingdraws is that it yields valid inferences for a wide range of estimands, includingnonlinear functions such as percentiles and variances (Little, 1988).

The difficulty in implementing MI is in obtaining draws from the posteriordistribution of ymis given zU , yobs and rs, which typically has an intractable form.Since draws from the posterior distribution of ymis given zU , yobs, rs, and areoften easy to implement, a simpler scheme is to draw from the posteriordistribution of ymis given zU , yobs, rs and , where is an easily computedestimate of such as that obtained from the complete cases. This approachignores uncertainty in estimating , and is termed improper in Rubin (1987). Ityields acceptable approximations when the fraction of missing data is modest,but leads to overstatement of precision with large amounts of missing data. Inthe latter situations one option is to draw (q) from its asymptotic distributionand then impute ymis from its posterior distribution given zU , yobs, rs, and (q). Abetter but more computationally intensive approach is to cycle between drawsy(i)mis P( ymis zU , yobs, rs, (i 1)) and (i) p( zU , yobs, y(i)

mis, rs), an application ofthe Gibbs sampler (Tanner and Wong, 1987; Tanner, 1996).

The formulation provided above requires specification of the joint distribu-tion of ys and rs given zU . As discussed in Section 18.1, if the missing data aremissing at random (MAR) in that the distribution of rs given ys and zU does notdepend on ymis, then inference can be based on a model for ys alone rather thanon a model for the joint distribution of ys and rs (Rubin 1976: Little and Rubin2002). Specifically, (18.23)�(18.26) can replaced by the following:

ITEM NONRESPONSE 299

�� ���

��

� � � � � � ��

�3�� � + ����4 ��

��6��

�3�� � + �364� 4+

�364� � 3 ����+ �

364���4+ �

364��� � �3 ����� � + ����4+

"3�� � + ����4 � �� � �

��6��

��364+ %� � ��364 � "3�� � + �364� 4+

Page 323: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where the conditioning on rs has been dropped. This approach is called infer-ence ignoring the missing-data mechanism. Since modeling the missing-datamechanism is difficult in many applications, and results are vulnerable tomodel misspecification, ignorable models are attractive. Whenever possible,surveys should be designed to make this assumption plausible by measuringinformation on at least a subsample of nonrespondents.

Example 3. M I for the Third National Health and Nutrition ExaminationSurvey

The Third National Health and Nutrition Examination Survey (NHANES-3) was the third in a series of periodic surveys conducted by the NationalCenter for Health Statistics to assess the health and nutritional status ofthe US population. The NHANES-3 Survey began in 1988 and was conductedin two phases, the first in 1988�91 and the second in 1991�4. It involved datacollection on a national probability sample of 39 695 individuals in theUS population. The medical examination component of the survey dictatedthat it was carried out in a relatively small number (89) of localities of thecountry known as stands; stands thus form the primary sampling units. Thesurvey was also stratified with oversampling of particular populationsubgroups.

This survey was subject to nonnegligible levels of unit and item nonresponse,in both its interview and its examination components. In previous surveys,nonresponse was handled primarily using weighting adjustments. Increasinglevels of nonresponse in NHANES, and inconsistencies in analyses ofNHANES data attributable to differing treatments of the missing values, ledto the desire to develop imputation methods for NHANES-3 and subsequentNHANES surveys that yield valid inferences.

Variables in NHANES-3 can be usefully classified into three groups:

1. Sample frame/household screening variables2. Interview variables (family and health history variables)3. Mobile Examination Center (MEC) variables.

The sample frame/household screening variables can be treated essentially asfully observed. Of all sampled individuals, 14.6 % were unit nonrespondentswho had only the sampling frame and household screening variables measured.The interview data consist of family questionnaire variables and health vari-ables obtained for sampled individuals. These variables were subject to unitnonresponse and modest rates of item nonresponse. For example, self-rating of

300 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

*��3�� � + ����4 ��

��6��

*��3�� � + �364� 4

��� �

�� �

��6��

3��364 � ��43��364 � ��4� �

+

Page 324: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

health status (for individuals aged 17 or over) was subject to an overall non-response rate (including unit nonresponse) of 18.8 %, and family income had anoverall nonresponse rate of 21.1 %.

Missing data in the MEC variables are referred to here as examinationnonresponse. Since about 8 % of the sample individuals answered the interviewquestions but failed to attend the examination, rates of examination nonre-sponse were generally higher than rates of interview nonresponse. For example,body weight at examination had an overall nonresponse rate of 21.6 %, systolicblood pressure an overall nonresponse rate of 28.1 %, and serum cholesterol anoverall nonresponse rate of 29.4 %.

The three blocks of variables � screening, interview, examination � had anapproximately monotone structure, with screening variables basically fullyobserved, questionnaire variables missing when the interview is not conducted,and examination variables missing when either (i) the interview is not con-ducted or (ii) the interview is conducted but the MEC examination does nottake place. However, item nonresponse for interview data, and component anditem-within component nonresponse for MEC data, spoil this monotonestructure.

A combined weighting and multiple imputation strategy was adopted tocreate a public-use dataset consisting of over 70 of the main NHANES-3variables (Ezzati and Khare, 1992; Ezzati-Rice et al., 1993, 1995; Khare et al.,1993). The dataset included the following information:

Basic demographics and geography: age, race/ethnicity, sex, household size,design stratum, stand, interview weight.Other interview variables: alcohol consumption, education, poverty index, self-reported health, activity level, arthritis, cataracts, chest pain, heart attack, backpain, height, weight, optical health measures, dental health measures, first-handand second-hand smoking variables.Medical examination variables: blood pressure measures, serum cholesterolmeasures, serum triglycerides, hemoglobin, hematocrit, bone density mea-sures, size measures, skinfold measures, weight, iron, drusen score, macul-pathy, diabetic retinopathy, ferritin, mc measures, blood lead, red cellmeasures.

Many of the NHANES variables not included in the above set are recodes ofincluded variables and hence easily derived.

As in previous NHANES surveys, unit nonrespondents were dropped fromthe sample and a nonresponse weight created for respondents to adjust for thefact they are no longer a random sample of the population. The nonresponseweights were created as inverses of estimated propensity-to-respond scores(Rubin, 1985), as described in Ezzati and Khare (1992). All other missingvalues were handled by multiple imputation, specifically creating five randomdraws from the predictive distribution of the missing values, based on amultivariate linear mixed model (Schafer, 1996). The database consists of thefollowing six components:

ITEM NONRESPONSE 301

Page 325: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

1. A core dataset containing variables that are not subject to imputation (id,demographics, sampling weights, imputation flags) in fixed-width, space-delimited ASCII.

2. Five versions of a data file containing the observed data and the imputedvalues created as one draw from the joint predictive distribution of themissing values.

3. SAS code that will merge the core data with each of the imputed datasets,assign variables, names, etc., to produce five SAS datasets of identical size,with identical variable names.

4. Sample analyses using SUDAAN and Wesvar-PC to estimate means, pro-portions, quantiles, linear and logistic regression coefficients. Each analysiswill have to be run five times.

5. SAS code for combining five sets of estimates and standard errors usingRubin�s (1987) methods for multiple imputation inference, as outlinedabove.

6. Documentation written for a general audience that details (a) the history ofthe imputation project, (b) an overview of multiple imputation, (c)NHANES imputation models and procedures, (d) a summary of the1994�5 evaluation study, (e) instructions on how to use the multiply-imputed database, and (f) caveats and limitations.

A separate model was fitted to sample individuals in nine age classes, withsample sizes ranging from 1410 to 8375 individuals. One reason for stratifyingin this way is that the set of variables defined for individuals varies somewhatby age, with a restricted set applying to children under 17 years, and somevariables restricted to adults aged over 60 years. Also, stratification on age is asimple modeling strategy for reflecting the fact that relationships betweenNHANES variables are known to vary with age.

I now describe the basic form of the model for a particular age stratum. Forindividual t in stand c, let ytc be the (1 J ) vector of the set of items subject tomissing data, and let x tc be a fixed (1 p) vector of design variables and itemsfully observed except for unit nonresponse. It is assumed that

(18.27)

where is a ( p J ) vector of fixed effects, bc is a (1 J ) vector of randomstand effects with mean zero and covariance matrix diag( 1, . . . , J ), and

is an unstructured covariance matrix; conditioning on ( , , ) in (18.27) isimplicit. It is further assumed that the missing components of ytc are missing atrandom and the parameters ( , , ) are distinct from parameters defining themechanism, so that the missing-data mechanism does not have to be modeledfor likelihood inference (Rubin, 1976; Little and Rubin, 2002).

In view of the normality assumption in (18.27), most variables were trans-formed to approximate normality using standard power transformations. Afew variables not amenable to this approach were forced into approximate

302 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

��

���� � �� 3���� �+�4

� � ��38+�4+ � �+ � � � + =5 � � �+ � � � + +

� � �� � � �

� � � �

� � �

Page 326: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

normality by calculating the empirical cdf and corresponding quantiles of thestandard normal distribution. The model (18.27) is a refinement over earlierimputation models in that stand is treated as a random effect rather than a fixedeffect. This reduces the dimensionality of the model and allows for greaterpooling of information across stands.

The Gibbs sampler was used to generate draws from the posterior distribu-tion of the parameters and the missing values for the model in Section 18.2.1,with diffuse conjugate prior distributions on ( , , ). S-Plus and Fortran codeis available at Joseph Schafer�s web site at http://www.psu.edu/~jls. In sum-mary, given values from the tth iteration, the (t 1)th iteration of Gibbs�sampling involves the following five steps:

Here yobs, tc consists of the set of observed items in the vector ytc and ymis , tc, theset of missing items. More details of the forms of these distributions are given inSchafer (1996).

The Gibbs sampler for each age stratum was run as a single chain andconverged rapidly, reflecting the fact that the model parameters and randomeffects were well estimated. After an initial run-in period, draws of the missingvalues were taken at fixed intervals in the chain, and these were transformed backto their original scales and rounded to produce the five sets of imputations.

18.4. NONIGNORABLE MISSING DATA nonignorable missing data

The models discussed in the previous two sections assume the missing data areMAR. Nonignorable, non-MAR models are needed when missingness dependson the missing values. For example, suppose a participant in an income surveyrefused to report an income amount because the amount itself is high (or low).If missingness of the income amount is associated with the amount, aftercontrolling for observed covariates (such as age, education, or occupation)then the mechanism is not MAR, and methods for imputing income based onMAR models are subject to bias. A correct analysis must be based on the fulllikelihood from a model for the joint distribution of ys and rs. The standardlikelihood asymptotics apply to nonignorable models providing the parametersare identified, and computational tools such as the Gibbs sampler also apply tothis more general class of models.

NONIGNORABLE MISSING DATA 303

� � �

A��% 3�3���4 �����+ �+ �3�4

���+ �+ �3�4+�3�4+�3�44 � ������+ � �+ � � � + =

A��% 3�3���4�����+ �+ �3�4���+ �+ �

3�4+�3�4+ 6�3���4 74 � 9����� -��&���

A��% 3�3���4�����+ �+ �3�4���+ �+ �

3�4+ 6�3���4 7+�3���44 � 9����� -��&���

A��% 3�3���4�����+ �+ �3�4���+ �+ 6�3���4

7+�3���4+�3���44 � ������

A��% 3 �3���4���+ ������+ �+ 6�3���4

7+�3���4+�3���4+ �3���4 � ������+ � �+ � � � + =�

Page 327: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Suppose the missing-data mechanism is nonignorable, but the selectionmechanism is ignorable, so that a model is not required for the inclusionindicators iU . There are two broad classes of models for the joint distributionof ys and rs (Little and Rubin, 2002, Ch. 11; Little, 1993b). Selection modelsmodel the joint distribution as

(18.28)

as in Section 18.1. Pattern-mix ture models specify

(18.29)

where and are unknown parameters, and the distribution of ys is condi-tioned on the missing data pattern rs. Equations (18.28) and (18.29) are simplytwo different ways of factoring the joint distribution of ys and rs. When rs isindependent of ysOtherwise (18.28) and (18.29) generally yield different models.

Pattern-mixture models (18.29) seem more natural when missingness definesa distinct stratum of the population of intrinsic interest, such as individualsreporting �don�t know� in an opinion survey. However, pattern-mixture modelscan also provide inferences for parameters of the complete-data distribution,by expressing the parameters of interest as functions of the pattern-mixturemodel parameters and (Little, 1993b). An advantage of the pattern-mixturemodeling approach over selection models is that assumptions about the form ofthe missing-data mechanism are sometimes less specific in their parametricform, since they are incorporated in the model via parameter restrictions.This point is explained for specific normal pattern-mixture models in Little(1994) and Little and Wang (1996).

Most of the literature on nonignorable missing data has concerned selectionmodels of the form (18.28), for univariate nonresponse. An early example is theprobit selection model.

Example 4. Probit selection model

Suppose Y is scalar and incompletely observed, X 1, . . . , Xp represent designvariables and fully observed survey variables, and interest concerns the param-eters of the regression of Y on X 1, . . . , Xp. A normal linear model is assumedfor this regression, that is

(18.30)

The probability that Y is observed given Y and X 1, . . . , Xp is modeled as aprobit regression function:

(18.31)

304 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

the two specifications are equivalent with and .

�3 ��+ ��� � + �+�4 � �3 ��� � + �4�3��� � + ��+�4+

�3 ��+ ��� � + �+ �4 � �3 ��� � + ��+ �4�3��� � + �4+

� �

� � � � � �

� �

3 ������+ � � � + ���4 � � �8 ������

����� + �.

��

��3?� � ����+ ���+ � � � + ���4 � � �8 ������

����� � ������

� �+

Page 328: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where denotes the probit function. When p 1 0, this probability is amonotonic function of the values of Y , and the missing-data mechanism isnonignorable. If, on the other hand, p 1 0 and ( 0, . . . , p) and ( , 2) aredistinct, then the missing-data mechanism is ignorable, and maximum likeli-hood estimates of ( , 2) are obtained by least squares linear regression basedon the complete cases.

Amemiya (1984) calls (18.31) a Type II Tobit model, and it was first intro-duced to describe selection of women into the labor force (Heckman, 1976). It isclosely related to the logit selection model of Greenlees, Reece and Zieschang(1982), which is extended to repeated-measures data in Diggle and Kenward(1994). This model is substantively appealing, but problematic in practice, sinceinformation to simultaneously estimate the parameters of the missing-datamechanism and the parameters of the complete-data model is usually verylimited, and estimates are very sensitive to misspecification of the model (Little,1985; Stolzenberg and Relles, 1990). The following example illustrates theproblem.

Example 5. Income nonresponse in the current population survey

Lillard, Smith and Welch (1982, 1986) applied the probit selection model ofExample 4 to income nonresponse in four rounds of the Current PopulationSurvey Income Supplement, conducted in 1970, 1975, 1976, and 1980. In 1980their sample consisted of 32 879 employed white civilian males aged 16�65 whoreported receipt (but not necessarily amount) ofW wages and salary earningsand who were not self-employed. Of these individuals, 27 909 reported the valueof W and 4970 did not. In the notation of Example 4, Y is defined to equal(W 1) , where is a power transformation of the kind proposed in Box andCox (1964). The predictors X were chosen as education (five dummy variables),years of market experience (four linear splines), probability of being in first yearof market experience, region (south or other), child of household head (yes, no),other relative of household head or member of secondary family (yes,no), personal interview (yes, no), and year in survey (1 or 2). The last fourvariables were omitted from the earnings equation (18.30); that is, their coeffi-cients in the vector were set equal to zero. The variables education, years ofmarket experience, and region were omitted from the response equation(18.31); that is, their coefficients in the vector were set to zero.

Lillard, Smith and Welch (1982) fit the probit selection model (18.30) and(18.31) for a variety of other choices of . Their best-fitting model, 0.45,predicted large income amounts for nonrespondents, in fact 73 % larger onaverage than imputations supplied by the Census Bureau, which used a hotdeck method that assumes ignorable nonresponse. However, this large adjust-ment is founded on the normal assumption for the population residuals fromthe 0.45 model, and on the specific choice of covariates in (18.30) and(18.31). It is quite plausible that nonresponse is ignorable and the unrestrictedresiduals follow the same (skewed) distribution as that in the respondentsample. Indeed, comparisons of Census Bureau imputations with IRS income

NONIGNORABLE MISSING DATA 305

� � � �

� � � � � � �

� �

� �

� �

� �

Page 329: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

amounts from matched CPS/IRS files do not indicate substantial underestima-tion (David et al., 1986).

Rather than attempting to simultaneously estimate the parameters of themodel for Y and the model for the missing-data mechanism, it seems preferableto conduct a sensitivity analysis to see how much the answers change forvarious assumptions about the missing-data mechanism. Examples of thisapproach for pattern-mixture models are given in Rubin (1977), Little (1994),Little and Wang (1996), and Scharfstein, Robins and Rotnitsky (1999). Analternative to simply accepting high rates of potentially nonignorable missingdata for financial variables such as income is to use special questionnaireformats that are designed to collect a bracketed observation whenever a re-spondent is unable or unwilling to provide an exact response to a financialamount question. Heeringa, Little and Raghunathan (2002) describe a Baye-sian MI method for multivariate bracketed data on household assets in theHealth and Retirement Survey. The theoretical underpinning of these methodsinvolves the extension of the formulation of missing-data problems via the jointdistribution-of yU , iU and rs in Section 18.1 to more general incomplete dataproblems involving coarsened data (Heitjan and Rubin, 1991; Heitjan, 1994).Full and ignorable likelihoods can be defined for this more general setting.

18.5. CONCLUSION conclusion

This chapter is intended to provide some indication of the generality andflexibility of the Bayesian approach to surveys subject to unit and item non-response. The unified conceptual basis of the Bayesian paradigm is veryappealing, and computational tools for implementing the approach are becom-ing increasingly available in the literature. What is needed to convince practi-tioners are more applications such as that described in Example 3, moreunderstanding of useful baseline ��reference�� models for complex multistagesurvey designs, and more accessible, polished, and well-documented software:for example, SAS (2001) now has procedures (PROC MI and PROC MIANA-LYZE) for creating and analysing multiply-imputed data. I look forward tofurther developments in these directions in the future.

ACKNOWLEDGEMENTS

This research is supported by grant DMS-9803720 from the National ScienceFoundation.

306 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE

Page 330: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 19

Estimation for MultiplePhase SamplesWayne A. Fuller

19.1. INTRODUCTION introduction

Two-phase sampling, also called double sampling, is used in surveys of manytypes, including forest surveys, environmental studies, and official statistics.The procedure is applicable when it is relatively inexpensive to collect infor-mation on a vector denoted by x , relatively expensive to collect information onthe vector y of primary interest, and x and y are correlated. In the two-phasesample, the vector x is observed on a large sample and (x , y) is observed on asmaller sample. The procedure dates from Neyman (1938). Rao (1973)extended the theory and Cochran (1977, Ch. 12) contains a discussion of thetechnique. Hidiroglou (2001) presents results for different configurations forthe two phases.

We are interested in constructing an estimator for a large vector of charac-teristics using data from several sources and/or several phases of sampling. Wewill concentrate on the use of the information at the estimation stage, omittingdiscussion of use at the design stage.

Two types of two-phase samples can be identified on the basis of sampleselection. In one type, a first-phase sample is selected, some characteristics ofthe sample elements are identified, and a second-phase sample is selected usingthe characteristics of the first-phase units as controls in the selection process.A second type, and the type of considerable interest to us, is one in which afirst-phase sample is selected and a rule for selection of second-phase units isspecified as part of the field procedure. Very often the selection of second-phaseunits is not a function of first-phase characteristics. One example of the secondtype is a survey of soil properties conducted by selecting a large sample ofpoints. At the same time the large sample is selected, a subset of the points, thesecond-phase sample, is specified. In the field operation, a small set of data iscollected from the first-phase sample and a larger set is collected from the

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 331: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

second-phase sample. A second example of the second type is that of a popula-tion census in which most individuals receive a short form, but a subsamplereceives a long from with more data elements.

The sample probability-of-selection structure of the two-phase sample issometimes used for a survey containing item nonresponse. Item nonresponseis the situation in which respondents provide information for some, but not all,items on the questionnaire. The use of the two-phase model for this situationhas been discussed by Sarndal and Swensson (1987) and Rao and Sitter (1995).A very similar situation is a longitudinal survey in which respondents do notrespond at every point of the survey. Procedures closely related to multiplephase estimation for these situations have been discussed by Fuller (1990,1999). See also Little and Rubin (1987).

Our objective is to produce an easy-to-use dataset that meets several criteria.Generally speaking, an easy-to-use dataset is a file of complete records withassociated weights such that linear estimators are simple weighted sums. Theestimators should incorporate all available information, and should be designconsistent for a wide range of population parameters at aggregate levels, suchas states. The dataset will be suitable for analytic uses, such as comparison ofdomains, the computation of regression equations, or the computation of thesolutions to estimating equations. We also desire a dataset that producesreasonable estimates for small areas, such as a county. A model for some ofthe small-area parameters may be required to meet reliability objectives for thesmall areas.

19.2. REGRESSION ESTIMATION regression estimation

19.2.1. Introduction

Our discussion proceeds under the model in which the finite population is asample from an infinite population of (x t , yt) vectors. It is assumed that thevectors have finite superpopulation fourth moments. The members of the finitepopulation are indexed with integers U {1, 2, . . . ,N }. We let ( x , y) denotethe mean of the superpopulation vector and let (xU , yU ) denote the mean of thefinite population.

The set of integers that identify the sample is the set s. In a two-phase sample,we let s1 be the set of elements in the first phase and s2 be the set of elements inthe second phase. Let there be n1 units in the first sample and n2 units in thesecond sample. When we discuss consistency, we assume that n1 and n2 increaseat the same rate.

Assume a first-phase sample is selected with selection probabilities 1t .A second-phase sample is selected by a procedure such that the total probabil-ity of selection is 2t . Thus, the total probability of being selected for thesecond-phase sample can be written

(19.1)

308 ESTIMATION FOR MULTIPLE PHASE SAMPLES

� � �

�.� � ����.���#

Page 332: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where 2t 1 is the conditional probability of selecting unit t at the second phasegiven that it is selected in the first phase.

Recall the two types of two-phase samples discussed in the introduction. Inthe first type, where the second-phase sample is selected on the basis of thecharacteristics of the first-phase sample, it is possible that only the first-phaseprobabilities and the conditional probabilities of the second-phase units giventhe first-phase sample are known. In such a situation, the total probability ofselection for the second phase is

(19.2)

where 2t s1 is the conditional probability of selecting unit t in the second phasegiven that the first-phase sample is the sample s1, and the sum is over thosefirst-phase samples containing element t. If 2t s1 is known only for the observeds1, then 2t cannot be calculated. On the other hand, in the second type of two-phase samples, the probabilities 1t and 2t are known.

19.2.2. Regression estimation for two-phase samples

The two-phase regression estimator of the mean of y is

(19.3)

where

(19.4)

and x 1 is the first-phase mean of x ,

(19.5)

and t are weights. We assume that a vector that is identically equal to one is inthe column space of the x t . Depending on the design, there are several choicesfor the weights in (19.4). If the total probability of selection is known, it ispossible to set t

12t . If 2t is not known, t ( 1t 2t s1 ) 1 is a possibility. In

some cases, such as simple random sampling at both phases, the weights areequivalent.

If the regression variable that is identically equal to one is isolated and thex -vector written as

(19.6)

the regression estimator can be written in the familiar form

(19.7)

REGRESSION ESTIMATION 309

� �

�.� ������

����.����

� �

� ��

� �

���� � �� ��#

�� �����.

�������

� �������.

�������#

�� �����

�����

� �������

����� ��#

�� � (�# ��)#

� � �� � � � � � ��

���� � �. � ( ��� � ��.)��#

Page 333: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where

(19.8)

and

In the form (19.7) we see that each of the two choices for t produces aconsistent estimator of the population mean.

The estimator (19.3) is the regression estimator in which the overall mean(total) of x is used in estimation. Given that the individual x -values for the firstphase are known, extensive information, such as the knowledge of which unitsare the smallest 10%, can be used to construct the x -variables used in theregression estimator.

Using conditional expectations, the variance of the two-phase estimator(19.3) is

(19.9)

If we approximate the variance by the variance of the Op(n 1 2) terms in theTaylor expansion, we have

(19.10)

and

(19.11)

where

and y1 is the (unobservable) mean for all units in the first-phase sample.

19.2.3. Three-phase samples

To extend the results to three-phase estimation, assume a third-phase sample ofsize n3 is selected from a second-phase sample of size n2, which is itself a sampleof a first-phase sample of size n1 selected from the finite population. Let s1, s2, s3be the set of indices in the phase 1, phase 2, and phase 3 samples, respectively.

310 ESTIMATION FOR MULTIPLE PHASE SAMPLES

� �����.

(�� � ��.)���(�� � ��.)

� �������.

(�� � ��.)�����

( �.#��.) �

����.

��

� �������.

��( ��# ��)�

�< ����= � �<� C �������D=� �<� C �������D=�� �

�< �������= � �����.

� ��.����

� �������.

� ��.���� ��#� ���

���

�#

��# � � �� � �� ���#

��� ������

�������� ��

� ��������

�������� ��#

�� ������

�����

� ��������

����� ��#

�<� C �������D= � �< ��=#

Page 334: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

A relatively common situation is that in which the first-phase sample is theentire population.

We assume the vector (1, x ) is observed in the phase 1 sample, the vector(1, x , y) is observed in the phase 2 sample, and the vector (1, x , y, z) is observedin the phase 3 sample. Let x 1 be the weighted mean for the phase 1 sample andlet a two-phase estimator of the mean of y be

(19.12)

where

and 2t are weights. The estimator y is an estimator of the finite populationmean, as well as of the superpopulation mean. Then a three-phase estimator ofthe mean of z is

(19.13)

where

and

We assume that the weights 2t and 3t are such that

is consistent for (xU , yU ) and

is consistent for (xU , yU , zU ). If all three samples are simple random samplesfrom a normal superpopulation, the estimator (19.13) is the maximum likeli-hood estimator. See Anderson (1957).

The variance of the three-phase estimator can be obtained by repeatedapplication of conditional expectation arguments:

(19.14)

REGRESSION ESTIMATION 311

��� � �. � (�� � �.)��.#

��. �����.

(�� � ��)��.�(�� � ��)

� �������.

(�� � ��)��.�( �� � ��)�

� ��

��! � (�� � �+# ��� � �+)(����# ��

�.)

�#

�����.

� �

����+

(�� � �+# �� � �+)��+�(�� � �+# �� � �+)

� ���"#!

"#! �����+

(�� � �+# �� � �+)��+�(!� � !+)

��

� �

����.

�.�

� �������.

�.�(��# ��)

����+

�+�

� �������+

�+�(��# ��# !�)

�<��!= � �<� C��!���D=� �<� C��!���D=� �<!�=� �<� C�(��!��.)���D=� �<� C(��!��.)���D=� �<!�=� �<� C!.���D=� �<� (��!��.)=�

Page 335: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

19.2.4. Variance estimation

Several approaches can be considered for the estimation of the variance of atwo-phase sample. One approach is to attempt to estimate the two terms of(19.9) using an estimator of the form

(19.15)

The estimated conditional variance, the second term on the right of (19.15), canbe constructed as the standard variance estimator for the regression estimatorof y1 given s1. If the first-phase sample is a simple random nonreplacementsample and the x -vector defines strata, then a consistent estimator of V { y1}can be constructed as

where

and

A general estimator of V { y1} can be constructed by recalling that the Horvitz�Thompson estimator of V { y1} of (19.10) can be written in the form

(19.16)

where 1tu is the probability that unit t and unit u appear together in the phase1 sample. Using the observed phase 2 sample, the estimated variance (19.15)can be estimated with

(19.17)

provided is positive for all (t, u). This variance estimator is discussed bySarndal, Swensson and Wretman (1992, ch. 9). Expression (19.17) is not alwayseasy to implement. See Kott (1990, 1995), Rao and Shao (1992), Breidt andFuller (1993), Rao and Sitter (1995), Rao (1996), and Binder (1996) for discus-sions of variance estimation.

We now consider another approach to variance estimation. We can expandthe estimator (19.3) in a first-order Taylor expansion to obtain

where xU is the finite population mean, and U is the finite population regres-sion coefficient

312 ESTIMATION FOR MULTIPLE PHASE SAMPLES

��< ����= � ��< ��=� ��< �������=�

��< ��= � ���(� � ��)����

��.#

��. �����.

����� ���.����

� �������.

����� ���.���� ( �� � ��)

.

�� �����.

����� ���.����

� �������.

����� ���.�������

��< ��= ��

(�# )���

������ (��� � ����� )�

���� ���

��� � #

��< ��= ��

(�# )��.

������ �

��.� ��� (��� � ����� )�

���� ���

��� � #

�� �� � ���� � (�� � �� )�� � �� (��� �� )���(���)#

�.� ���

Page 336: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The n without a subscript is used as the index for a sequence of samples inwhich n1 and n2 increase at the same rate. Then the variance of the approxi-mating random variable has variance

(19.18)

where C{x 1, } is the covariance matrix of the two vectors. In many designs, itis reasonable to assume that C{x 1, } 0, and variance estimation is muchsimplified. See Fuller (1998).

The variances in (19.18) are unconditional variances. The variance of x canbe estimated with a variance formula appropriate for the complete first-phasesample. The component V { } is the more difficult component to estimate inthis representation. If the second-phase sampling can be approximated byPoisson sampling, then V { } can be estimated with a two-stage varianceformula. In the example of simple random sampling at the first phase andstratified sampling for the second phase, the diagonal elements of V { } are thevariances of the stratum means. In this case, use of the common estimatorsproduces estimators (19.15) and (19.18) that are nearly identical.

19.2.5. Alternative representations

The estimator (19.3) can be written in a number of different ways. Because it isa linear estimator, we can write

(19.19)

where

We can also write the estimator as a weighted sum of the predicted values. Letthe first-phase weights be

(19.20)

Then we can write

(19.21)

REGRESSION ESTIMATION 313

�� �����

�����

� �������

������

�< ����=�� ����<��=�� � ���%<��# ��=��� � ��%<��# ��=�� � ���<��=�

�� #

���� �

��

��

��

���� �����.

&.���

&.� � ��� ��.

�� � �

� ���������

&�� �� ���

����

� �������� �

���� ������

&�����#

Page 337: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where

Under certain conditions we can write the estimator as

(19.22)

where

and s1 sc2 is the set of units in s1 but not in s2. The representation (19.22) holdsif t

12t and 1

1t1

2t is in the column space of the matrixX (x 1, x 2, . . . , xn2

) . It will also hold if t ( 1t 2t s1 ) 1 and 2t s1 is in thecolumn space of the matrix X . The equivalence of (19.3) and (19.22) followsfrom the fact that

under the conditions on the column space of X .We have presented estimators for the mean. If total estimation is of interest

all estimators are multiplied by the population size, if it is known, or byt s1

11t if the population size is unknown.

The estimator (19.19), based on a dataset of n2 elements, and the estimator(19.22), based on n1 elements, give the same estimated value for a mean or totalof y. However, the two formulations can be used to create different estimatorsof other parameters, such as estimators of subpopulations.

To construct estimator (19.3) for a domain D, we would define a new second-phase analysis variable

Then, one would regress yDt on x t and compute the regression estimator, wherethe regression estimator is of the form (19.19) with the weights w2t applied tothe variable yDt . The design-consistent estimators for some small subpopula-tions may have very large variances because some of the samples will containfew or no final phase sample elements for the subpopulation.

The estimator for domain D associated with (19.22) is

(19.23)

where the summation is over first-phase units that are in domain D. The estima-tor (19.23) is sometimes called the synthetic estimator for the domain. Anextreme, but realistic, case is a subpopulation containing no phase 2 elements,but containing phase 1 elements. Then, the estimator defined by (19.19) is zero,

314 ESTIMATION FOR MULTIPLE PHASE SAMPLES

��� � �� ���

���� ������

&�����#

��� � �� �� � � �.

� ��� �� � � �� � �'.

�� � �� �� ��

� � � � � � � � � �� � �

����.

( �� � ���)���� � 0#

�� ��

��� � �� � � � �

� 0 ����� ���

�)�� ��

������&�����#

Page 338: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

but the estimator (19.22) is not. The fact that estimator (19.22) produces nonzeroestimates for any subpopulation with a first-phase x is very appealing.

However, the domain estimator (19.23) is a model-dependent estimator.Under the model in which

(19.24)

and the et are zero-mean random variables, independent of xu for all t and u,the estimator (19.23) is unbiased for the subpopulation sum. Under random-ization theory, the estimator (19.23) is biased and it is possible to constructpopulations for which the estimator is very biased.

The method of producing a dataset with ( t , x t) as data elements will besatisfactory for some subpopulation totals. However, the single set of -valuescannot be used to construct estimates for other characteristics of the distributionof y, such as the quantiles or percentiles of y. This is because the estimator ofthe cumulative distribution function evaluated at is the mean of an analysisvariable that is one if yt is less than and is zero otherwise. The creation ofa dataset that will produce valid quantile estimates has been discussed in theimputation literature and as a special topic by others. See Rubin (1987),Little and Rubin (1987), Kalton and Kasprzyk (1986), Chambers and Dunstan(1986), Nascimento Silva and Skinner (1995), Brick and Kalton (1996), andRao (1996). The majority of the imputation procedures require a modelspecification.

19.3. REGRESSION ESTIMATION WITH IMPUTATION regression estimation with imputation

Our objective is to develop an estimation scheme that uses model estimation toimprove estimates for small areas, but that retains many of the desirableproperties of design-based two-phase estimation.

Our model for the population is

(19.25)

where the et are independent zero-mean, finite variance, random variables,independent of xu for all t and u. We subdivide the population into subunitscalled cells. Assume that cells exist such that

(19.26)

where Cg is the set of indices in the gth cell, and the notation is read �distributedindependently and identically.� A common situation is that in which the x -vector defines a set of imputation cells. Then, the model (19.25)�(19.26) reducesto the assumption that

(19.27)

That is, units in the cell, whether observed or not, are independently andidentically distributed. The single assumption (19.27) can be obtained

REGRESSION ESTIMATION WITH IMPUTATION 315

�� � ���� ��

����

��

�� � ���� ��#

�� ,,(0# �.�) ��� � � %�#

�� ,,(��# �.�)�

Page 339: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

through assumptions on the nature of the parent population and the selectionprocess.

Consider a two-phase sample in which the x -vectors are known for the entirefirst phase. Let the weighted sample regression be that defined in (19.3) andassume the t and the selection probabilities are such that

(19.28)

Assume that a dataset is constructed as

(19.29)

where rt is a deviation chosen at random from the set of regression deviations

(19.30)

in the same cell as the recipient. The element contributing the rt is called thedonor and the element t receiving the rt is called the recipient. Let the x -vectordefine the set of imputation cells and assume the 2t are constant within everycell. Then the outlined procedure is equivalent to selection of an observationwithin the cell to use as the donor. The regression estimator of the cell mean isthe weighted sample mean for the cell, when the x -vector is a set of indicatorvariables defining the cells.

To construct a design-consistent estimator when units within a cell have beenselected with unequal probability, we select the donors within a cell withprobability proportional to

(19.31)

Then,

(19.32)

Now

(19.33)

316 ESTIMATION FOR MULTIPLE PHASE SAMPLES

� ����.

( �� � �� ��)���.� � 0�

��� � �� �� � � �$��%�� �� ���� .#

� �� ��� ���� �� � � �$��%�� �� ���� � $�� ��� �� ���� .#

��

�� � � � � ��# � �# .# � � � # �.#

����

�� �� ��.

���� (���. ��� � �)

� �������� (�

��.���� � �)�

������

����� �����.� �

� �����.

����� �����.� �

� ��

������'.����� �����.

���

�� ��.

���� � ��

������'.�����

� ��.

� � �

�� ��.

���� ���. ��� (�� ���. ��� )���

� ��

� ���

���� (�� ���. ��� )

� ��

������'.����� ���

���

��

Page 340: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Therefore, the expectation in (19.32) is approximately equal to the weighted sum

See Rao and Shao (1992) and Platek and Gray (1983) for discussions ofimputation with unequal probabilities.

The estimated total using the dataset defined by (19.29) is

(19.34)

where rt t if t s2. If the first phase is the entire population,

(19.35)

where er is the mean of the imputed regression deviations.Under random-with-replacement imputation, assuming the original sample

is a single-phase nonreplacement sample with t nN 1, and ignoring thevariance of , the variance of Ir in (19.35) is

(19.36)

where the term (N n) 2e is introduced by the random imputation. With

random imputation that restricts the sample sum of the imputed deviations tobe (nearly) equal to zero, the imputation variance term is (nearly) removedfrom (19.36).

Under the described imputation procedure, the estimator of the total for thesmall-area Du is

(19.37)

where Du s1 is the set of first-phase indexes that are in small area Du. Theestimator Iru is the synthetic estimator plus an imputation error,

(19.38)

Under the regression model with simple random sampling, the variance expres-sion (19.36) holds for the small area with (Nu, nu) replacing (N, n), where Nu andnu are the population and sample number of elements, respectively, in smallarea Du. Note that the variance expression remains valid for nu 0.

Expression (19.35) for the grand total defines an estimator that is random-ization valid and does not require the model assumptions associated with(19.25) and (19.26). That is, for simple random sampling we can write

(19.39)

REGRESSION ESTIMATION WITH IMPUTATION 317

����.

����.����� ���

���

�),� ������

����� ��� ������

����� �� ��������

����� ����#

�� � �� �

�),� ������

��� ������

�� ������#

� � ���# �)

� �),�

� ��� (� � �)���C� � �D�.�� �(� � �)����.� � (� � �)�.� #

� �

�),� ��

��� �������� ���#

��)

�),� ��

��� �������� �� ���

���� ���

����� �����

�< �),���= � �(� � �)����.�� � (� � �)�.�� � (����.)#

Page 341: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where

and

On the other hand, to write the variance of the small-area estimator as

(19.40)

where 2eu is the variance of the e in the small area, requires the model assump-

tions. Under certain conditions it is possible to use the imputed values toestimate the variance of the first-phase mean. For example, if the first-phasestratum contains nh1 phase 1 elements and nh2 phase 2 elements, if the subsam-pling is proportional, and if imputation is balanced so that the variance of aphase 2 imputed mean is equal to the variance of the simple phase 2 mean, then

(19.41)

where yhI1 is the first-phase mean for stratum h based on imputed data, yh1 isthe (unobservable) first-phase mean for stratum h, nh1 is the number of first-phase units in stratum h, nh2 is the number of second-phase units in the first-phase stratum h, and [yhI 1] is the estimated variance of the phase 1 stratummean computed treating imputed data as real observations. Thus, if there are alarge number of second-phase units in the first-phase stratum, the bias ofV [yh1] will be small.

In summary, we have described a two-phase estimation scheme in whichestimates of population totals are the standard randomization two-phase esti-mators subject to imputation variance. The estimates for small areas aresynthetic estimators constructed under a model, subject to the consistency-of-total restriction.

19.4. IMPUTATION PROCEDURES imputation procedures

The outlined regression imputation procedure of Section 19.3 yields a design-consistent estimator for the total of the y-variable. Estimates of otherparameters, such as quantiles, based on the imputed values (19.29) are notautomatically design consistent for general x -variables.

Two conditions are sufficient for the imputation procedure to give design-consistent estimates of the distribution function. First, the vector of x -variablesis a vector of indicator variables that defines a set of mutually exclusive andexhaustive subdivisions of the first-phase sample. Second, the original selectionprobabilities are used in the selection of donors. The subdivisions defined by thex -vectors in this case are sometimes called first-phase strata. If the sampling ratesare constant within first-phase strata, simple selection of donors can be used.

318 ESTIMATION FOR MULTIPLE PHASE SAMPLES

�.�� � �<( �� � ��").=

" � (�<�����=)���<�����=�

�< �),� = � � (� � � )��� �.� � (� � � )�

.� #

�< �� C �*,�D= � �< �*,�= �� (�*� � �)�����*. (�*� � �*.)� �

#

��

Page 342: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

In many situations, it is operationally feasible to use ordinary samplingtechniques to reduce or eliminate the variance associated with the selection ofdonors. See Kalton and Kish (1984) and Brick and Kalton (1996). The simplestsituation is equal probability sampling, where the first-phase sample is aninteger multiple, m, of the second-phase sample. If every second-phase unit isused as a donor exactly m times, then there is zero imputation variance for atotal. It is relatively easy, in practical situations, to reduce the imputationvariance to that due to the rounding of the number of times a second-phaseunit is used as a donor. In general, the number of times an element is used as adonor should be proportional to its weight.

An important procedure that can be used to reduce the imputation varianceis to use fractional imputation, described by Kalton and Kish (1984). In thisprocedure k values obtained from k donors are used to create k imputed valuesfor each first-phase element with no second-phase observation. Each of the kimputed values is given a weight, such that the procedure is unbiased. If theoriginal sample and the second-phase sample are equal probability samples, theweight is the original first-phase weight divided by k . A fully efficient fractionalimputation scheme uses every second-phase unit in a cell as a donor for everyfirst-phase unit that is not a second-phase unit. The number of imputationsrequired by this procedure is determined by the number of elements selectedand the number not selected in each cell.

We call the imputation procedure controlled if it produces a dataset such thatall estimates for y-characteristics at the aggregate level are equal to the standard(weighted phase 2) two-phase regression estimator. Controlled imputation canalso be used to produce estimates based on imputed data for designated smallareas that are nearly equal to the synthetic small-area estimates.

In practice, the first-phase can contain many potential control variables andthe formation of first-phase strata will require selecting a subset of variablesand/or defining categories for continuous variables. Thus, implementation ofthe imputation step can require a great deal of effort in a large-scale survey.Certainly the imputed dataset must be passed through an edit operation. Givena large dimensional dataset with many control variables, some of the originalobservations will be internally inconsistent and some created imputed observa-tions will be internally inconsistent. The objective of the imputation and editoperations is to reduce the number of inconsistencies, and to produce a finaldataset that closely approximates reality.

19.5. EXAMPLE example

To illustrate some of the properties of alternative estimation procedures fortwo-phase samples, we use data from the National Resources Inventory (NRI),a large national survey of land cover and land use conducted by the NaturalResources Conservation Service of the US Department of Agriculture in co-operation with the Statistical Laboratory of Iowa State University.

EXAMPLE 319

Page 343: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The study is a panel study and data have been collected for the years 1982,1987, 1992, and 1997. The sample covers the entire United States. The primarysampling units are segments of land, where segments range in size from 40 acresto 640 acres. The design is specified by state and the typical design is stratifiedwith area strata and two segments selected in each stratum. There are about300 000 segments in the sample. In most segments, either two-or three-samplepoint locations are specified. In two states there is one sample point in eachsegment. There are about 800 000 points in the sample. See Nusser and Goebel(1997) for a more complete description of the survey.

In the study, a mutually exclusive and exhaustive land classification calledcoveruse is used. In each segment two types of information are collected:segment information and point information. Segment information includes adetermination of the total number of urban acres in the segment, as well as theacres in roads, water, and farmsteads. Information collected at points is moredetailed and includes a determination of coveruse, as well as information onerosion, land use, soil characteristics, and wildlife habitat. The location of thepoints within the segment was fixed at the time of sample selection. Therefore,for example, a segment may contain urban land but none of the three points fallon urban land. The segment data are the first-phase data and the point data arethe second-phase data.

We use some data on urban land from the 1997 survey. The analysis unit forthe survey is a state, but we carry out estimation for a group of counties.Because the data have not been released, our data are a collection of countiesand do not represent a state or any real area. Our study area is composed of 89counties and 6757 segments.

Imputation is used to represent the first-phase (segment) information. In theimputation operation an imputed point is created for those segments with acresof urban land but no point falling on the urban land. The characteristics forattributes of the point, such as coveruse in previous years, are imputed bychoosing a donor that is geographically close to the recipient. The weightassigned to the imputed point is equal to the inverse of the segment selectionprobability multiplied by the acres of the coveruse. See Breidt, McVey andFuller (1996�7) for a description of the imputation procedure. There are 394real points and 219 imputed points that were urban in both 1992 and 1997 inthe study area.

The sample in our group of counties is a stratified sample of segments, wherea segment is 160 acres in size. The original stratification is such that eachstratum contains two segments. In our calculations, we used collapsed stratathat are nearly the size of a county. Calculations were performed using PCCARP.

We grouped the counties into seven groups that we call cells. These cellsfunction as second-phase strata and are used as the cells from which donors areselected. This represents an approximation to the imputation procedure actu-ally used.

In Section 19.3, we discussed using the imputed values in the estimation ofthe first-phase part of the variance expression. Let 1{ y} be the estimated

320 ESTIMATION FOR MULTIPLE PHASE SAMPLES

�� �

Page 344: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

variance of the total of a y-characteristic computed treating the imputed pointsas real points. This estimator is a biased estimator of the variance of the meanof a complete first-phase sample of y-values because a value from one segmentcan be used as a donor for a recipient in another segment in the same first-phasestratum. Thus, the segment totals are slightly correlated and the variance of astratum mean is larger than the segment variance divided by the number ofsegments in the stratum. Because of the large number of segments per varianceestimation stratum, about 70, the bias is judged to be small. See expression(19.41).

The variance of the weighted two-phase estimator of a total for a character-istic y was estimated by a version of (19.15),

(19.42)

where y is the estimated total of y, x1c is the phase 1 estimator of the total of xfor cell c, x is the indicator for urban in 1992, and x 2c is the estimated mean ofcell c for the second phase.

The second term on the right side of (19.42) is the estimator of the condi-tional variance of the regression estimator given the first-phase sample. In thisexample, the regression estimator of the mean is the stratified estimator usingthe first-phase estimated stratum (cell) sizes x1c. Thus, the regression estimatorof the variance is equivalent to the stratified estimator of variance.

Table 19.1 contains estimated coefficients of variation for the NRI sample.This sample is an unusual two-phase sample in the sense that the number oftwo-phase units is about two-thirds of the number of first-phase units. In themore common two-phase situation, the number of second-phase units is amuch smaller fraction of the number of first-phase units.

The variance of total urban acres in a county using imputed data is simplythe variance of the first-phase segment data. However, if only the observedpoints are used to create a dataset, variability is added to estimates for counties.There were 15 counties with no second-phase points, but all counties had someurban segment data. Thus, the imputation procedure produces positive esti-mates for all counties. Crudely speaking, the imputation procedure makes a100 % improvement for those counties.

Table 19.1 Estimated properties of alternative estimators for a set of NRI counties.

Procedure

Coefficient of variation Fraction ofzero countyestimatesTotal urban acres County urban acresa

Two-phase (weighted) 0.066 0.492 0.169Two-phase (imputed) 0.066 0.488 0.000

a Average of 12 counties with smallest values.

EXAMPLE 321

��<��= � ���<��=��-'��

�.��' ��<���.' �.'���=#

� �

Page 345: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Because of the number of zero-observation and one-observation counties, itis difficult to make a global efficiency comparison at the county level. Thecolumn with heading �County urban acres� contains the average coefficient ofvariation for the 12 counties with the smallest coefficients of variation. Allof these counties have second-phase points. There is a modest differencebetween the variance of the two procedures because the ratio of the total forthe county to the total for the cell is well estimated.

In the production NRI dataset, a single imputation is used. Kim and Fuller(1999) constructed a dataset for one state in which two donors were selectedfor each recipient. The weight for each imputed observation is then one-half ofthe original observation. Kim and Fuller (1999) demonstrate how to constructreplicates using the fractionally imputed data to compute an estimatedvariance.

322 ESTIMATION FOR MULTIPLE PHASE SAMPLES

Page 346: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

CHAPTER 20

Analysis Combining Surveyand GeographicallyAggregated DataD. G. Steel, M. Tranmer and D. Holt

20.1. INTRODUCTION AND OVERVIEW introduction and overview

Surveys are expensive to conduct and are subject to sampling errors and avariety of non-sampling errors (for a review see Groves, 1989). An alternativesometimes used by researchers is to make use of aggregate data available fromadministrative sources or the population census. The aggregate data will oftenbe in the form of means or totals for one or more sets of groups into which thepopulation has been partitioned. The groups will often be geographic areas ormay be institutions, for example schools or hospitals.

Use of aggregate data alone to make inferences about individual-level rela-tionships can introduce serious biases, leading to the ecological fallacy (for areview see Steel and Holt, 1996a). Despite the potential of ecological bias,analysis of aggregate data is an approach employed in social and geographicalresearch. Aggregate data are used because individual-level data cannot be madeavailable due to confidentiality requirements, as in the analysis of electoral data(King, 1997) or because aggregate data are readily available at little or no cost.

Achen and Shively (1995) noted that the use of aggregate data was a popularapproach in social research prior to the Second World War but the rapiddevelopment of sample surveys led to the use of individual-level survey dataas the main approach since the 1950s. More recently researchers have beenconcerned that the analysis of unit-level survey data traditionally treats theunits divorced from their context. Here the concern is with the atomistic fallacy.The context can be the geographic area in which individuals live but can also beother groupings.

Survey data and aggregate data may be used together. One approach iscontextual analysis in which the characteristics of the group in which an

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner2003 John Wiley & Sons, Ltd�

Page 347: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

individual is located are included in the analysis of individual-level survey data(Riley, 1964; Boyd and Iversen, 1979). A more recent response to the concern ofanalysing individuals out of context has been the development of multi-levelmodelling (Goldstein, 1987, 1995). Here the influence of the context is includedthrough group-level random effects and parameters that may be due to the effectof unmeasured contextual variables, although they may also be acting as proxiesfor selection effects. Measured contextual or group-level variables may also beincluded in multi-level modelling. Analytical group-level variables, which arecalculated from the values of individuals in the group, can be included and areoften the means of unit-level variables, but may also include other statisticscalculated from the unit-level variables such as a measure of spread. There mayalso be global group-level variables included, which cannot be derived from thevalues of units in the group, for example the party of the local member ofparliament (Lazarsfeld and Menzel, 1961).

Standard multi-level analysis requires unit-level data, including the indica-tors of which group each unit belongs, to be available to the analyst. If all unitsin selected groups are included in the sample then analytical group-level vari-ables can be calculated from the individual-level data. However, if a sample isselected within groups then sample contextual variables can be used, but thiswill introduce measurement error in some explanatory variables. Alternativelythe contextual variables can be obtained from external aggregate data, but thegroup indicators available must allow matching with the aggregate data. Also,if global group-level variables are to be used the indicators must allow appro-priate matching. In many applications the data available will not be thecomplete set required for such analyses.

Consider a finite population U of N units. Associated with the tth unit in thepopulation are the values of variables of interest yt . We will focus on situationsin which the primary interest is in the conditional relationship of the variablesof interest on a set of explanatory variables, the values of which are x t . Assumethat the population is partitioned into M groups and ct indicates the group towhich the tth unit belongs. The gth group contains Ng units. For conveniencewe will assume only one set of groups, although in general the population maybe subject to several groupings, which may be nested but need not be. Thecomplete finite population values are given by

In this situation there are several different approaches to the type of analysisand targets of inference.

Consider a basic model for the population values for t g

where 1, . . . ,Mand t 1, . . . ,Ng. Define . Independence between all randomvectors is assumed. The elements of the matrices are as follows:

324 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

( ��! ��! ��,! � � )! � � � !��

��� � �� � ��� � ��� (20.1)

�� � �� � ��� � ��� (20.2)

�� � (��� ! ��� ,� � �(!�(�,, �� � � (��� ! ��� ,� � �(!�(),,! � �� � � �(), � �(�,

�� �� �

Page 348: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Associated population regression cofficients are defined by

Consideration has to be given as to whether the targets of inference should beparameters that characterize the marginal relationships between variablesacross the whole population or parameters that characterize relationshipswithin groups, conditional on the group effects. Whether the groups are ofinterest or not depends on the substantive situation. Holt (1989) distinguishedbetween a disaggregated analysis, aimed at relationships taking into accountthe groups, and an analysis that averages over the groups, which is referred toas an aggregated analysis. A disaggregated analysis will often be tackled usingmulti-level modelling. For groups of substantive interest such as schools orother institutions there will be interest in determining the effect of the groupand a disaggregated analysis will be appropriate. When the groups are geo-graphically defined both approaches can be considered. The groups may beseen as a nuisance and so interest is in parameters characterizing relationshipsmarginal to the groups. This does not assume that there are not influencesoperating at the group level but that we have no direct interest in them. Thesegroup-level effects will impact the relationships observed at the individual level,but we have no interest in separating purely individual- and group-level effects.

We may be interested in disaggregated parameters such as (1), (2) andfunctions of them such as (1)

yx , (2)yx . For these situations we must consider how

to estimate the variance and covariance components at individual and grouplevels. In situations where interest is in marginal relationships averaged over thegroups then the target is and functions of it such as yx . Estimation of thecomponents separately may or may not be necessary depending on whetherindividual or aggregate data are available and the average number of units pergroup in the dataset, as well as the relative sizes of the effects at each level. Wewill consider how estimation can proceed in the different cases of data avail-ability and the bias and variance of various estimators that can be used.

Geographical and other groups can often be used in the sample selection, forexample through cluster and multi-stage sampling. This is done to reduce costsor because there is no suitable sampling frame of population units available. Anissue is then how to account for the impact of the population structure, that isgroup effects in the statistical analysis. There is a large literature on both theaggregated approach and the disaggregated approach (see Skinner, Holt andSmith (1989), henceforth abbreviated as SHS).

If there is particular interest in group effects, namely elements of (2), thenthat can also be taken into account in the sample design by making sure thatmultiple units are selected in each group. Cohen (1995) suggested that whencosts depend on the number of groups and the number of individuals in thesample then in many cases an optimal number of units to select per group is

INTRODUCTION AND OVERVIEW 325

�(�, � �(�,�� �(�,

��

�(�,�� �(�,

��

� �! � � )! �! �� � � ��� ���

��� ���

� ��

�(�,�� � �(�,�)

�� �(�,�� �� ��� � ��)

������

� �� �

��

Page 349: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

approximately proportional to the square root of the ratio of unit-level togroup-level variance components. This is also the optimal choice of sampleunits to select per primary sampling unit (PSU) in two-stage sampling with asimple linear cost function (Hansen, Hurwitz and Madow, 1953). Longford(1993) noted that for a fixed sample size the optimal choice is one plus the ratioof the unit-level to group-level variance components.

The role of group-level variables has to be considered carefully. Manyanalyses do not include any group-level variables. However, there is often theneed to reflect the influence of group effects through measured variables. Thiscan be done to explain random group-level effects and assess the impact ofgroup-level variables that are of direct substantive interest. Such variables canbe measures of the context, but they may also reflect within-group interactionsand may also act as a proxy for selection effects. Inclusion of group-levelvariables requires either collection of this information from the respondent orcollection of group indicators, for example address or postcode, to enablelinking with the aggregate group-level data.

We will consider situations where the interest is in explaining variation in anindividual-level variable. There are cases when the main interest is in a group-level response variable. However, even then the fact that groups are composedof individuals can be important, leading to a structural analysis (Riley, 1964).

In this chapter we consider the potential value of using aggregate and surveydata together. The survey data available may not contain appropriate groupindicators or may not include multiple units in the same group, which will affectthe ability to estimate effects at different levels. We consider how a smallamount of survey data may be used to eliminate the ecological bias in analysingaggregate data. We also show how aggregate data may be used to improveanalysis of multi-level populations using survey data, which may or may notcontain group indicators. Section 20.2 considers the situations that can ariseinvolving individual-level survey data and aggregate data. The value of differ-ent types of information is derived theoretically in Section 20.3. In Section 20.4a simulation study is reported that demonstrates the information in eachsource. Section 20.5 considers the use of individual-level data on auxiliaryvariables in the analysis of aggregate data.

20.2. AGGREGATE AND SURVEY DATA AVAILABILITY aggregate and survey data availability

Consider two data sources, the first based on a sample of units, s0, from whichunit-level data are available and the second on the sample, s1, from whichaggregate data are available. To simplify the discussion and focus on the role ofsurvey and aggregate data we will suppose that the sample designs used to selects0 and s1 do not involve any auxiliary variables, and depend at most on thegroup indicators ct . Hence the selection mechanisms are ignorable for model-based inferences about the parameters of interest. We also assume that there isno substantive interest in the effect of measured group-level variables, althoughthe theory can be extended to allow for such variables.

326 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

Page 350: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The sample sk contains mk groups and nk units with an average of nk nk mksample units per selected group and the sample size in the gth group is nkg. Animportant case is when the aggregate data are based on the whole population,that is s1 U , for example aggregate data from the population census or someadministrative system, so n1g Ng. The group means based upon sample sk aredenoted ykg and xkg. The overall sample means are

We assume that we are concerned with analytical inference so that we caninvestigate relationships that may apply to populations other than the oneactually sampled (Smith, 1984). Thus we will take only a purely model-basedapproach to analysis and suppose that the values in the finite population aregenerated by some stochastic process or drawn from some superpopulation.We wish to make inferences about parameters of the probability distributionsassociated with this stochastic process as discussed in SHS, section 1.6.

We will consider three types of data:

(1) d0 {( yt , x t , ct), t s0},(2) d0 {( yt , x t), t s0},(3) d1 {(yg, xg, n1g), g s1}.

Case (1) corresponds to unit-level survey data with group indicators. Ifm0 < n0 then there are multiple observations in at least some groups and so itwill be possible to separate group- and unit-level effects. Alternatively if ananalysis of marginal parameters is required it is possible to estimate the vari-ances of descriptive and analytical statistics taking the population structureinto account (see Skinner, 1989). If m0 n0 then it will not usually be possibleto separate group- and unit-level effects. For an analysis of marginal param-eters, this is not a problem and use of standard methods, such as ordinary leastsquares (OLS) regression, will give valid estimates and inferences. In this casethe group identifiers are of no use. If n0 n0 m0 is close to one then theestimates of group and individual-level effects will have high variances. Insuch a situation an analysis of marginal parameters will not be affected muchby the clustering and so there will be little error introduced by ignoring it. Theissues associated with designing surveys for multi-level modelling are summar-ized by Kreft and De Leeuw (1998, pp. 124�6).

Case (2) corresponds to unit-level survey data without group indicators. Thissituation can arise in several ways. For example, a sample may be selected bytelephone and no information is recorded about the area in which the selectedhousehold is located. Another example is the release of unit record files bystatistical agencies for which no group identifiers are released to help protectconfidentiality. The Australian Bureau of Statistics (ABS) releases such data formany household surveys in this way. An important example is the release of datafor samples of households and individuals from the population census, which isdone in several countries, including Australia, the UK and the USA.

AGGREGATE AND SURVEY DATA AVAILABILITY 327

� ��

��

� �

��� �)

��

��

������� �� ��� �)

��

��

��������

����

��

� ��

� �

Page 351: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Case (3) corresponds to purely aggregate data. As noted before, analysisbased soley on aggregate data can be subject to considerable ecological bias.

We will consider what analyses are possible in each of these cases, focusingon the bias and variance of estimators for a model with normally distributedvariables. We then consider how aggregate and unit-level data can be usedtogether in the following situations of available data;

(4) d0 and d1,(5) d0 and d1.

Case (4) corresponds to having unit-level survey data containing group indica-tors and aggregate data. Case (5) corresponds to having unit-level survey datawithout group indicators and aggregate data. For these cases the impact ofdifferent relative values of m0, n0, n0 and m1, n1, n1 will be investigated.

20.3. BIAS AND VARIANCE OF VARIANCE COMPONENTESTIMATORS BASED ON AGGREGATE AND SURVEY DATAbias and variance of variance component estimators

We will consider various estimates that can be calculated in the different casesof data availability, focusing on covariances and linear regression coefficients.It is easier to consider the case when the number of sample units per group isthe same across groups, so n0g n0 for all g s0 and n1g n1 for all g s1.This case is sufficient to illustrate the main conclusions concerning the roles ofunit and aggregate data in most cases. Hence we will mainly focus on this case,although results for more general cases are available (Steel, 1999).Case ( 1) Data available: d0 {( yt , x t , ct), t s0}, individual-level data

with group indicators. Conventional multi-level modelling is carried out usinga unit-level dataset that has group indicators. Maximum likelihood estimation(MLE) can be done using iterated generalized least squares (IGLS) (Goldstein,1987), a Fisher fast scoring algorithm (Longford, 1987) or the EM algorithm(Bryk and Raudenbush, 1992). These procedures have each been implementedin specialized computer packages. For example, the IGLS procedures areimplemented in MLn.

Estimates could also be obtained using moment or ANOVA-type estimates.Let wt ( yt , x t) . Moment estimators of (1) and (2) would be based on thebetween- and within-group covariance matrices

where

328 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

� �

� �� �

� �

� � �� � �

(�, � )

� � )

����

��(�$� � �$,(�$� � �$,�

(% , � )

� ��

����

(�� � ),�

� �)

�� � )

����! �

($� � �$�,($� � �$�,��

Page 352: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

For the model given by (20.1) and (20.2) Steel and Holt (1996a) showed

where

Unbiased estimates of (1) and (2) can be obtained:

The resulting unbiased estimator of is

We will focus on the estimation of the elements of , that is . Define thefunction of and theparameter finite sample variances of the estimatorsare

These results can be used to give an idea of the effect of using different values ofn0 and m0.

Maximum likelihood estimates can be obtained using IGLS. Theasymptotic variances of these estimates are the same as the moment estimatorsin the balanced case.Case ( 2) Data available: d0 {( yt , x t), t s0}, individual-level data with-

out group indicators. In this situation we can calculate

the covariance matrix calculated from the unit-level data ignoring the groups.Steel and Holt (1996a) show

BIAS AND VARIANCE OF VARIANCE COMPONENT ESTIMATORS 329

'G(�, H � �(), � ����

(�,! 'G�H � �(),! �� 'G(% , H � �(),

��� � �� ) � �

� � )

� �! �� �

� )

����

(�� � ��,������

� �

��(),) �

(% , ! ��(�,

) � )

���(

(�, � (% ,

,�

��) � ��(),) � ��(�,

) � )

���

(�, � (% ,

) � )

���

� ��

� �

�(�, �(�,

��!��((��, (�(),�� ���

(�,�� ,(�(),

(( � ���(�,(( , � (�(),

�( � ���(�,�( ,�

��( � (�(),���

(),(( � �(),�

�( , ���

��(��(),)�(, �

��(� ��

��(��(�,)�(, �

)

���

��((��,

� � )� ��(� ��

� �

��(��)�(, �)

���

��((��,

� � )� ��((�� � ),

� ��

�(

� �

��(),! ��(�,

(), � )

� � )

����

($� � �$,($� � �$,�!

(20.3)

(20.4)

(20.5)

� �

'G(), H � �(), � ) � ��� � )

� � )

� ��(�, (20.6)

Page 353: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

where 0(1 C 20).

The elements of S (1) have

The term ( 1) (n0 1) is O(m 10 ), so provided m0 is large we can estimate

with small bias. However, the variance of the estimates will be underestimatedif the clustering is ignored, and we assumed that the variance of S (1)

0ab wasab(1) (n0 1) instead of (20.7).Case ( 3) Data available: d1 g, n1g), g s1}, aggregate data only.

This is the situation that is usually tackled by ecological analysis. There areseveral approaches that can be attempted. From these data we can calculateS (2)

1 , which has the properties

As an estimate of , S (2)1 has expectation that is a linear combination of (1)

and (2) with the weights given to the two components leading to a biasedestimate of and functions of it, unless 1. If is large then the estimatorwill be heavily biased, as an estimator of , towards (2). To remove the biasrequires estimation of (1) and (2), even if the marginal parameters are ofinterest.

As noted by Holt, Steel and Tranmer (1996) as an estimate of (2), ( )S (2)1

has bias ) (1), which may be small if the group sample sizes are large.However, even if the parameter of interest is (2) or functions of it, the fact thatthere is some contribution from (1) needs to be recognized and removed.

Consider the special case when yt and x t are both scalar and

The ecological regression coefficients relating y to x are B(2)1yx S (2)

1xx S(2)1xy , and

where a ]. The effect of aggregation is to shift theweight given to the population regression parameters towards the group level.Even a small value of xx can lead to a considerable shift if is large.

If there are no group-level effects, so (2) 0, which is equivalent to randomgroup formation, then S (2)

1 is unbiased for and ab(n1) ab. Steel and Holt(1996b) showed that in this case there is no ecological fallacy for linear statis-tics, and parameters such as means, variances, regression and correlation coef-ficients can be unbiasedly estimated from group-level data. They also showedthat the variances of statistics are mainly determined by the number of groups

330 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

��� � �� �

��((),�(, �

)

� � )��((��,

� � )

� � )� ��(

(� ��,

� � )

� �� (20.7)

��� � � �

'G(�,) H � �(), � ���)�

(�, � �� (���) � ),�(�, (20.8)

��((�,)�(, �

��((��),

�) � )� (20.9)

� �

� � �� D(���! �� �

� ��

�� �

���) � ���)

� )����)()����) �

��

(�,) �

(�,)��

(�,)��

(�,)��

(�,)��

� ��

� �)

'G+(�,)��H � () � ��),�(),

�� � ��)�(�,��

�) � ���)����G) � (���) � ),���

� ���)� �

�� � ��

Page 354: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

in the analysis. A regression or correlation analysis of m groups has the samestatistical properties as an analysis based on m randomly selected individualsirrespective of the sample sizes on which the group means are based.

When only aggregate data are available estimates of the variance compon-ents can be obtained in theory. The key to separating (1) and (2) is the form ofheteroscedasticity for the group means implied by models (20.1) and (20.2),that is

A moments approach can be developed using group-level covariance matriceswith different weightings. An MLE approach can also be developed (Goldstein,1995). The effectiveness of these approaches depends on the variation in thegroup sample sizes, n1g, and the validity of the assumed form of the hetero-scedasticity in (20.10). If the covariance matrices of either t or g are notconstant but related in an unknown way to ng then this approach will haveproblems.

Maximum likelihood estimates can be obtained using IGLS. The key steps inthe IGLS procedure involve regression of the elements of (against the vector of values ) , where is the current estimate of

. The resulting coefficient of 1 ng estimates (2) and that of1 estimates (1), giving the estimates (1)

3 and (2)3 . If there is no variation in the

group sample sizes, n1g, then the coefficients cannot be estimated; hence we willconsider cases when they do vary. The asymptotic variance of these estimates canbe derived as (Steel, 1999)

An idea of the implications of these formulae can be obtained by consideringthe case when n1g

(2)aa is large compared with (1)

aa so (2)aa . Then

where

BIAS AND VARIANCE OF VARIANCE COMPONENT ESTIMATORS 331

� �

��(�$)�, ��(),

�)�

� �(�,� (20.10)

� �

�$)� � ��$,(�$)� � ��$,�

()! )��)�� ��$

�$ � (���! ���,

� � �

� �� ��

��(��(),8��, �

����)

��)�(���(��)�,,

�)��! -��)

�)-(�)- � �)�,(���(��)�,,�)(���(��)-,,

�)

�)

��(��(�,8��, �

����)

(���(��)�,,�)

��! -��)

�)-(�)- � �)�,(���(��)�,,�)(���(��)-,,

�)

�)

� � ���(��)�, � �)��

��(��(),8��, �

��(�,��

�����)

��)

��(��(�,8��, �

��(�,�� () � �

�),

� ��)

���) � )

����)

)

�)�

Page 355: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

and C 21 is the square of the coefficient of variation of the inverses of the group

sample sizes. This case would often apply when the aggregate data are meansbased on a complete enumeration of the group, so that n1g Ng.

The other extreme is when n1g(2)aa is small compared with (1)

aa , so that thevariation in aa(n1g) is much less than ng and can be approximated by aa(n1).In this case

where C 2 is the square of the coefficient of variation of the group sample sizes.Unless the group sample sizes vary extremely we would expect both C2 and

C21 to be less than one and often they will be much less than one. In both

situations we would expect the variances to be much larger than if we had beenable to use the corresponding individual-level data. Also we should expect theestimate of (1)

aa to be particularly poor.Case ( 4) Data available: d0 and d1, individual-level data with group indi-

cators and aggregate data. As estimates can be produced using only d0, theinterest here is: what does the aggregate data add? Looking at (20.3) and (20.4)we see that the variances of the estimates of the level 1 variance componentsbased solely on d0 will be high when n0 m0 is small. The variances of theestimates of the level 2 variance components based solely on d0 will be highwhen n0 m0 or m0 1 is small. In these conditions the aggregate data basedon s1 will add useful information to help separate the effects at different levels ifit gives data for additional groups or if it refers to the same groups but providesgroup means based on larger sample sizes than s0.

It has been assumed that the group labels on the unit-level data enable thelinking of the units with the relevant group mean in d1. In some cases the grouplabels may have been anonymized so that units within the same group can beidentified, but they cannot be matched with the group-level data. This situationcan be handled as described in case (5).Case ( 5) Data available: d0 and d1, individual-level data without group

indicators and aggregate data. In practice there are two situations we have inmind. The first is when the main data available are the group means and we areinterested in how a small amount of unit-level data can be used to eliminate theecological bias. So this means that we will be interested in n0 smaller than m1.The other case is when the unit-level sample size is large but the lack of groupindicators means that the unit- and group-level effects cannot be separated.A small amount of group-level data can then be used to separate the effects.So this means that m1 is smaller than n0. In general we can consider the impactof varying m0, n0,m1, n1.

Define ]. Moment estimators of (1) and (2) can bebased on (20.6) and (20.8) giving

332 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

�� �

� �� �

��(��(),8��, �

���(��),() � �,

� �

��(��(�,8��, �

���(��),

����)

�!

� �

�� � ���) � G(��� � ),�(��) �, �

Page 356: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The resulting estimator of is

These estimators are linear combinations of S (1)0 and S (2)

1 and so the variancesof their components are determined by var(S (1)

0ab), var(S(2)1ab) and Cov(S (1)

0ab, S(2)1ab).

We already have expressions for var(S (1)0ab) and var(S (2)

1ab) given by (20.7) and(20.9) respectively. The covariance will depend on the overlap between thesamples s0 and s1. If the two samples have no groups in common the covariancewill be zero. At the other extreme when s0 s1,

and the variances given in case (1) apply. A practical situation in between thesetwo extremes occurs when we have aggregate data based on a census or largesample and we select s0 to supplement the aggregate data by selecting a sub-sample of the groups and then a subsample is selected in each selected group,using the same sampling fraction, f2, so m0 f1m1, n0 f2n1 and n0 fn1where f f1f2. In this situation

For given sample sizes in s0 and s1 the variances of the estimators of thevariance components are minimized when the covariance is maximized. Forthis situation this will occur when f2 is minimized by selecting as many groupsas possible. If the same groups are included in the two samples and s0 is asubsample of s1 selected using the same sampling fraction, f2, in each group, sof1 1, f2 f , n0 fn1 and m0 m1, then

This situation arises, for example, when s0 is the Sample of Anonymized Recordsreleased from the UK Population Census and s1 is the complete population fromwhich group means are made available for Enumeration Districts (EDs).

For the estimation of the level 1 and marginal parameters the variances aremainly a function of n 1

0 . For the estimation of the level 2 parameters both n 10

and m 11 are factors. For the estimation of the group-level variance component

of variable a the contributions of s0 and s1 to the variance are approximatelyequal when

BIAS AND VARIANCE OF VARIANCE COMPONENT ESTIMATORS 333

��(),3 � )

��� )���)

(), � ��� � )

� � )� )

� �

(�,)

� �(20.11)

��(�,3 � )

��� )(

(�,) � (),

,� (20.12)

��3 � ��(),3 � ��(),

3 � )

��� )

��� � )

� � )

(�,) � (���) � ),

(),

� �� (20.13)

� ((),�(!

(�,)�(, �

��((��),

� � )

� �� � �

� ((),�(!

(�,)�(, � �( )�) � ),

��((��),

(� � ),(�) � ),� ��(

(�,) ,�) � )

� � )�

� � �

� ((),�(!

(�,)�(, �

��((��),

� � )�

� ��

Page 357: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Aggregate and unit-level data without group indicatives can also be com-bined in MLn by treating each unit observation as coming from a group with asample size of one. This assumes that the unit-level data come from groupsdifferent from those for which aggregate data are available. In practice this maynot be true, in particular if the aggregate data cover all groups in the popula-tion. But the covariances will be small if f is small.

20.4. SIMULATION STUDIES simulation studies

To investigate the potential value of combining survey and aggregate data asimulation study was carried out. The values of a two-level bivariate populationwere generated using the model given by (20.1) and (20.2). We used the Districtof Reigate as a guide to the group sizes, which had a population of 188 800people aged 16 and over at the 1991 census and 371 EDs. The total populationsize was reduced by a factor of 10 to make the simulations manageable, and sothe values ofM 371 and N 18 880 were used. Group sizes were also a tenthof the actual EDs, giving an average of N 50.9. The variation in grouppopulation sizes led to C2 0.051 and C 2

1 0.240.The values of the parameters ( 1000) used were

These values were based on values obtained from an analysis of the variablesy limiting long-term illness (LLTI) and x living in a no-car household(CARO) for Reigate using 1991 census aggregate data and individual-level dataobtained from the Sample of Anonymized Records released from the census.The method of combining aggregate and individual-level data described in case(5) in Section 20.3 was used to obtain the variance and covariance components.These parameter values imply the following values for the main parameters ofthe linear regression relationships relating the response variable y to the ex-planatory variable x :

334 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

� � ) � �) � )

G) � (��) � ),���H��

� ��

� ��

�(), ��(),�� �(),

��

�(),�� �(),

��

� �� 2+�/ ���+

���+ *�3

� �

�(�, ��(�,�� �(�,

��

�(�,�� �(�,

��

� �� �*2* )�+:

)�+: /�*

� ��

� �

� ���� ���

��� ���

� �� 2*�/ �:�2

�:�2 */�/

� �

��� � ��33! �(),�� � ��3�! �(�,

�� � �8�!

��� � �)�)! ��� � �/8! ��� � ��+! �� � �22:�

Page 358: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

The process of generating the finite population values was repeated 100 times.The following cases and designs in terms of numbers of units and groups wereconsidered:

Case (1): unit-level data with group indicators for the population,n0 18 880,m0 371.Case (1): unit-level data with group indicators for a 20 % sample,n0 3776,m0 371.Case (5): aggregate population data n1 18 880,m1 371, sample unit-leveldata with group indicators; m0 371 and n0 3776, 378, 38, that is 20 %, 2 %and 0.2 % samples respectively.Case (3): aggregate population data only, n1 18 880,m1 371.

Estimates of (1), (2), , (1)yx , (2)

yx and yx were obtained. Results are given inTable 20.1, which shows the mean and standard deviation of the variousestimates of the variance components and the total variance and the associ-ated regression coefficients at each level using the IGLS-based procedures foreach case. Moment estimators were also calculated and gave similar results.Table 20.2 summarizes the properties of the direct estimates based on S (1) andS (2).

The results show that for level 2 parameters using aggregate data for thepopulation plus individual-level sample data (case (5)) is practically as good asusing all the population data (case (1)) when the individual data are based on asample much larger than the number of groups. This is because the number ofgroups is the key factor. As n0 decreases the variances of the estimates of thelevel 2 parameters in case (5) increase but not in direct proportion to n 1

0 ,because of the influence of the term involving m1 1. For the level 2 param-eters the aggregate data plus 20 % individual-level data without group indica-tors is a lot better than 20 % individual-level data with group indicators. Thisshows the benefit of the group means being based on a larger sample, that is thecomplete population.

As expected, level 1 parameters are very badly estimated from purely aggre-gate data (case (3)) and the estimates of the level 2 parameters are poor. Forlevel 1 and level 2 parameters the use of a small amount of individual-level datagreatly improves the performance relative to using aggregate data only. Forexample, case (5) with n0 38 gives estimates with a little over half thestandard deviation of the estimates of (1) and (2) compared with using solelyaggregate data.

For the level 1 parameters the size of the individual-level sample is thekey factor for case (1) and case (5), with the variances varying roughly inproportion to n 1

0 . Use of aggregate data for the population plus 20 % indi-vidual-level sample data without group indicators is as good as using20 % individual-level sample data with group indicators. The estimates ofoverall individual-level parameters perform in the same way as the estimatesof the pure level 1 parameters. We found that the moment and IGLS estimatorshad similar efficiency.

SIMULATION STUDIES 335

� �

� �� �

� �

� �

� � � � � �

�� �

Page 359: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 20.1 Performance of IGLS variance components estimates ( 1000) over 100replications.(a) LLTI; (1)

yy 78.6, (2)yy 0.979, yy 79.6

(1)yy

(2)yy yy

Case Design Mean SD Mean SD Mean SD

1 m0 371, n0 18 880 78.6 0.820 0.955 0.185 79.6 0.8331 m0 371, n0 3776 78.6 1.856 0.974 0.650 79.6 1.8465 n0 3776,m1 371 78.6 1.851 0.955 0.186 79.6 1.8385 n0 378,m1 371 78.4 5.952 0.959 0.236 79.3 5.8045 n0 38,m1 371 77.3 16.648 0.980 0.374 78.3 16.3243 m1 371 75.6 32.799 0.995 0.690 76.6 32.142

(b) LLTI/CARO; (1)yx 22.8, (2)

yx 1.84, yx 24.7,

(1)yx

(2)yx yx

Case Design Mean SD Mean SD Mean SD

1 m0 371, n0 18 880 22.8 0.637 1.81 0.279 24.6 0.6481 m0 371, n0 3776 22.7 1.536 1.72 0.797 24.5 1.4945 n0 3776,m1 371 22.7 1.502 1.81 0.280 24.5 1.4805 n0 378,m1 371 23.0 4.910 1.81 0.307 24.8 4.7775 n0 38,m1 371 21.9 13.132 1.83 0.381 23.7 12.8743 m1 371 29.2 43.361 1.55 0.962 30.8 42.578

(c) CARO; (1)xx 90.5, (2)

xx 6.09, xx 96.6

(1)xx

(2)xx xx

Case Design Mean SD Mean SD Mean SD

1 m0 371, n0 18 880 90.6 0.996 6.05 0.642 96.6 1.1741 m0 371, n0 3776 90.5 2.120 5.97 1.285 96.6 2.2745 n0 3776,m1 371 90.5 2.168 6.06 0.639 96.5 2.2715 n0 378,m1 371 90.5 7.713 6.05 0.645 96.6 7.6425 n0 38,m1 371 86.7 21.381 6.13 0.737 92.8 21.0203 m1 371 110.8 99.107 5.47 2.082 116.3 97.329

(d) LLTI vs. CARO; (1)yx 252, (2)

yx 302, yx 255

(1)yx

(2)yx yx

Case Design Mean SD Mean SD Mean SD

1 m0 371, n0 18 880 252 6.64 299 34.96 255 6.151 m0 371, n0 3776 251 16.84 284 124.17 254 14.855 n0 3776,m1 371 251 15.85 300 34.97 254 14.745 n0 378,m1 371 254 49.28 298 38.51 257 44.925 n0 38,m1 371 261 157.32 299 54.30 262 142.153 m1 371 300 1662.10 294 200.16 254 607.39

336 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

� � � � � ��� �� ��

�����

����

��

� � � � � ��� �� ��

�����

���

���

� � � � � ��� �� ��

�����

��

��

��

� � � � ��� �� ��

�����

��

��

��

Page 360: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 20.2 Properties of (a) direct variance estimates, (b) direct covarianceestimates, (c) direct variance estimates and (d) direct regression estimates ( 1000)over 100 replications.(a) LLTI (1)

yy 78.6, (2)yy 0:979, yy 79.6

S (1)yy S (2)

yy

Data Mean SD Mean SD

m0 371, n0 18 880 79.6 0.834 128 9.365m0 371, n0 3776 79.6 1.845 88.5 6.821m0 371, n0 378 79.3 5.780m0 371, n0 38 77.8 18.673

(b) LLTI/CARO; (1)yx 22.8, (2)

yx 1.84, yx 24.7

S (1)yx S (2)

yx

Data Mean SD Mean SD

m0 371, n0 18 880 24.7 0.645 115 14.227m0 371, n0 3776 24.5 1.480 41.1 6.626m0 371, n0 378 24.8 4.856m0 371, n0 38 23.2 14.174

(c) CARO; (1)xx 90.5, (2)

xx 6.09, xx 96.6

S (1)xx S (2)

xx

Data Mean SD Mean SD

m0 371, n0 18 880 96.6 1.167 399 32.554m0 371, n0 3776 96.5 2.264 152 12.511m0 371, n0 378 96.6 7.546m0 371, n0 38 92.0 22.430

(d) LLTI vs. CARO; (1)yx 252, (2)

yx 302, yx 255

B(1)yx B(2)

yx

Data Mean SD Mean SD

m0 371, n0 18 880 255 6.155 288 26.616m0 371, n0 3776 254 14.766 270 36.597m0 371, n0 378 256 45.800m0 371, n0 38 258 155.517

SIMULATION STUDIES 337

� � � � � �

� �� �� �� �

� � � � � �

� �� �� �� �

� � � � � �

� �� �� �� �

� � � � � �

� �� �� �� �

Page 361: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

20.5. USING AUXILIARY VARIABLES TO REDUCE AGGREGATIONEFFECTS using auxiliary variables to reduce aggregation effects

In Section 20.3 it was shown how individual-level data on y and x can be usedin combination with aggregate data. In some cases individual-level data maynot be available for y and x , but may be available for some auxiliary variables.Steel and Holt (1996a) suggested introducing extra variables to account for thecorrelations within areas. Suppose that there is a set of auxiliary variables, z,that partially characterize the way in which individuals are clustered within thegroups and, conditional on z, the observations for individuals in area g areinfluenced by random group-level effects. The auxiliary variables in z may onlyhave a small effect on the individual-level relationships and may not be of anydirect interest. The auxiliary variables are only included as they may be used inthe sampling process or to help account for group effects and we assume thatthere is no interest in the influence of these variables in their own right. Hencethe analysis should focus on relationships averaging out the effects of auxiliaryvariables. However, because of their strong homogeneity within areas they mayaffect the ecological analysis greatly. The matrices zU [z1, . . . , zN ] ,cU [c1, . . . , cN ] give the values of all units in the population. Both theexplanatory and auxiliary variables may contain group-level variables, al-though there will be identification problems if the mean of an individual-levelexplanatory variable is used as a group-level variable. This leads to:Case ( 6) Data available: d1 and {zt , t s0}, aggregate data and individual-

level data for the auxiliary variables. This case could arise, for example, whenwe have individual-level data on basic demographic variables from a surveyand we have information in aggregate form for geographic areas on health orincome obtained from the health or tax systems. The survey data may be asample released from the census, such as the UK Sample of AnonymizedRecords (SAR).

Steel and Holt (1996a) considered a multi-level model with auxiliary vari-ables and examined its implications for the ecological analysis of covariancematrices and correlation coefficients obtained from them. They also developeda method for adjusting the analysis of aggregate data to provide less biasedestimates of covariance matrices and correlation coefficients. Steel, Holt andTranmer (1996) evaluated this method and were able to reduce the biases byabout 70 % by using limited amounts of individual-level data for a small set ofvariables that help characterize the differences between groups. Here we con-sider the implications of this model for ecological linear regression analysis.

The model given in (20.1) to (20.2) is expanded to include z by assuming thefollowing model conditional on zU and the groups used:

where

338 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

� �

� �

$� � �$. � ��$..� � �� � ��! � � � (20.14)

��(��.� ! �� , � �(�,

. �� ��(��.� ! �� , � �(),

. � (20.15)

Page 362: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

This model implies

and

The random effects in (20.14) are different from those in (20.1) and (20.2) andreflect the within-group correlations after conditioning on the auxiliary vari-ables. The matrix z has components (2)

xx z ,(2)xy z and (2)

yy z and( xz , yz) . Assuming var(zt) then the marginal covariance matrix is

which has components xx , xy and yy . We assume that the target of inferenceis , although this approach can be used to estimate regressioncoefficients at each level.

Under this model, Steel and Holt (1996a) showed

Providing that the variance of S (2)1 is O(m 1

1 ) (see (20.9)) the expectation of theecological regression coefficients, B(2)

1yx S (2) 11xx S

(2)1xy , can be obtained to O(m 1

1 )by replacing S (2)

1xx and S (2)1xy by their expectations.

20.5.1. Adjusted aggregate regression

If individual-level data on the auxiliary variables are available, the aggregationbias due to them may be estimated. Under (20.14), E [B(2)

1wz zU , cU ] wz whereB(2)

1wz S (2) 11zz S (2)

1zw. If an estimate of the individual-level population covariancematrix for zU was available, possibly from another source such as s0, Steel andHolt (1996) proposed the following adjusted estimator of :

where zz is the estimate of zz calculated from individual-level data. Thisestimator corresponds to a Pearson-type adjustment, which has been proposedas a means of adjusting for the effect of sampling schemes that depend on a setof design variables (Smith, 1989). This estimator removes the aggregation biasdue to the auxiliary variables. This estimator can be used to adjust simultan-eously for the effect of aggregation and sample selection involving designvariables by including these variables in zU . For normally distributed datathis estimator is MLE.

Adjusted regression coefficients can then be calculated from 6, that is

USING AUXILIARY VARIABLES TO REDUCE AGGREGATION EFFECTS 339

'($�.� ! �� , � �$. � ��$..�! (20.16)

��($�.� ! �� , � �(),

. � �(�,

. � �. (20.17)

� ($�!$�.� ! �� , � �(�,

. �&�� � ��! � �� ��

�(�,

� � � ��$. ��� � �. � ��$.�..�$.

� � � �..

(20.18)

� � ���� � ��)

�����

'G(�,) .� ! �� H � �� ��$.(

(�,).. � �..,�$. � (���) � ),�(�,

. � (20.19)

� � �

� �� �

��/ � (�,) � +(�,�

)$.��.. � (�,

)..

� +

(�,)$.

� (�,

). � +(�,�

)$.��..+(�,

)$.

�� �

��

Page 363: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Steel, Tranmer and Holt (1999) carried out an empirical investigation intothe effects of aggregation on multiple regression analysis using data from theAustralian 1991 Population Census for the city of Adelaide. Group-level datawere available in the form of totals for 1711 census Collection Districts (CDs),which contain an average of about 250 dwellings. The analysis was confined topeople aged 15 or over and there was an average of about 450 such people perCD. To enable an evaluation to be carried out data from the households samplefile (HSF), which is a 1 % sample of households and the people within them,released from the population census were used.

The evaluation considered the dependent variable of personal income. Thefollowing explanatory variables were considered: marital status, sex, degree,employed�manual occupation, employed�managerial or professional occupa-tion, employed�other, unemployed, born Australia, born UK and four agecategories.

Multiple regression models were estimated using the HSF data and the CDdata, weighted by CD population size. The results are summarized in Table20.3. The R2 of the CD-level equation, 0.880, is much larger than that of theindividual-level equation, 0.496. However, the CD-level R2 indicates how muchof the variation in CD mean income is being explained. The difference betweenthe two estimated models can also be examined by comparing their fit at theindividual level. Using the CD-level equation to predict individual-level incomegave an R 2 of 0.310. Generally the regression coefficients estimated at the twolevels are of the same sign, the exceptions being married, which is non-significant at the individual level, and the coefficient for age 20�29. The valuescan be very different at the two levels, with the CD-level coefficients beinglarger than the corresponding individual-level coefficients in some cases andsmaller in others. The differences are often considerable: for example, the

340 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

bb6yx � b�ÿ16xxb�6xy:

The adjust ed estimator replaces the compo nents of bias in (20.19) due to

b0wz(S(2)1zz ÿ �zz) bwz by b0wz(b�zz ÿ �zz) bwz. If b�zz is an estimat e based on an

individual-level sampl e involving m0 first-stag e units then for many sampledesigns b�zz ÿ �zz � O(1=m0), and so b0wz(b�zz ÿ �zz)bwz is O(1=m0).

The ad justed estimat or can be rewritten as

bb6yx � B(2)1yxjz � bb6zxB

(2)1yzjx (20.20)

where bb6zx � b�ÿ16xxB

(2)0

1xzb�zz. Corr esponding decompositions apply at the group

and individual levels:

B(2)1yx �B

(2)1yxjz � B

(2)1zxB

(2)1yzjx

B(1)1yx �B

(1)1yxjz � B

(1)1zxB

(1)1yzjx:

The adjustment is tryin g to correct for the bias in the estimat ion of bzx byreplacing B(2)

1zx by bb6zx. The bias due to the conditional varia nce compon ents�(2)jz remains.

Page 364: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Table 20.3 Comparison of individual, CD and adjusted CD-level regression equations.

Individual level CD level Adjusted CD level

Variable Coefficient SE Coefficient SE Coefficient SE

Intercept 11 876.0 496.0 4 853.6 833.9 1 573.0 1 021.3Married 8.5 274.0 4 715.5 430.0 7 770.3 564.4Female 6 019.1 245.2 3 067.3 895.8 2 195.0 915.1Degree 8 471.5 488.9 21 700.0 1 284.5 23 501.0 1 268.9Unemp 962.5 522.1 390.7 1 287.9 569.5 1 327.2Manual 9 192.4 460.4 1 457.3 1101.2 2 704.7 1 091.9ManProf 20 679.0 433.4 23 682.0 1 015.5 23 037.0 1 023.7EmpOther 11 738.0 347.8 6 383.2 674.9 7 689.9 741.7Born UK 1 146.3 425.3 2 691.1 507.7 2 274.6 506.0Born Aust 1 873.8 336.8 2 428.3 464.6 2 898.8 491.8Age 15�19 9 860.6 494.7 481.9 1 161.6 57.8 1 140.6Age 20�29 3 529.8 357.6 2 027.0 770.2 1 961.6 758.4Age 45�59 585.6 360.8 434.3 610.2 1 385.1 1 588.8Age 60 255.2 400.1 1 958.0 625.0 2 279.5 1 561.4R 2 0.496 0.880 0.831

coefficient for degree increases from 8471 to 21 700. The average absolutedifference was 4533. The estimates and associated estimated standard errorsobtained at the two levels are different and hence so is the assessment of theirstatistical significance.

Other variables could be added to the model but the R 2 obtained wasconsidered acceptable and this sort of model is indicative of what researchersmight use in practice. The R 2 obtained at the individual level is consistent withthose found in other studies of income (e.g. Davies, Joshi and Clarke, 1997). Aswith all regression models there are likely to be variables with some explanatorypower omitted from the model; however, this reflects the world of practicaldata analysis. This example shows the aggregation effects when a reasonablebut not necessarily perfect statistical model is being used. The log transform-ation was also tried for the income variable but did not result in an appreciablybetter fit.

Steel, Tranmer and Holt (1999) reported the results of applying the adjustedecological regression method to the income regression. The auxiliary variablesused were: owner occupied, renting from government, housing type, aged 45�59and aged 60 . These variables were considered because they had relatively highwithin-CD correlations and hence their variances were subject to stronggrouping effects and also it is reasonable to expect that individual-level datamight be available for them. Because the adjustment relies on obtaining a goodestimate of the unit-level covariance matrix of the adjustment variables, weneed to keep the number of variables small. By choosing variables that charac-terize much of the difference between CDs we hope to have variables that willperform effectively in a range of situations.

USING AUXILIARY VARIABLES TO REDUCE AGGREGATION EFFECTS 341

��

��

Page 365: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

These adjustment variables remove between 9 and 75 % of the aggregationeffect on the variances of the variables in the analysis. For the income variablethe reduction was 32 % and the average reduction was 52 %.

The estimates of the regression equation obtained from 6, that is 6yx , aregiven in Table 20.3. In general the adjusted CD regression coefficients are nocloser than those for the original CD-level regression equation. The resultingadjustment of R2 is still considerably higher than that in the individual-levelequation indicating that the adjustment is not working well. The measure of fitat the individual level gives an R 2 of 0.284 compared with 0.310 for theunadjusted equation, so according to this measure the adjustment has had asmall detrimental effect. The average absolute difference between the CD- andindividual-level coefficients has also increased slightly to 4771.

While the adjustment has eliminated about half the aggregation effects in thevariables it has not resulted in reducing the difference between the CD- andindividual-level regression equations. The adjustment procedure will be effect-ive if B(2)

1yx z B(1)1yx z , B

(2)1yz x B(1)

1yz x and 6zx (z) B(1)1zx . Steel, Tranmer and Holt

(1999) found that the coefficients in B(1)yx z and B(2)

yx z are generally very differentand the average absolute difference is 4919. Inclusion of the auxiliary variablesin the regression has had no appreciable effect on the aggregation effect on theregression coefficients and the R 2 is still considerably larger at the CD levelthan the individual level.

The adjustment procedure replaces B(2)1zxB

(2)1yz x by 6zxB

(2)1 . Analysis of these

values showed that the adjusted CD values are considerably closer to theindividual-level values than the CD-level values. The adjustment has hadsome beneficial effect in the estimation of zx yz x and the bias of the adjustedestimators is mainly due to the difference between the estimates of yx z . Theadjustment has altered the component of bias it is designed to reduce. Theremaining biases mean that the overall effect is largely unaffected. It appearsthat conditioning on the auxiliary variables has not sufficiently reduced thebiases due to the random effects.

Attempts were made to estimate the remaining variance components frompurely aggregate data using MLn but this proved unsuccessful. Plots of thesquares of the residuals against the inverse of the population sizes of groupsshowed that there was not always an increasing trend that would be needed toobtain sensible estimates. Given the results in Section 20.3 concerning the use ofpurely aggregate data, these results are not surprising.

The multi-level model that incorporates grouping variables and randomeffects provides a general framework through which the causes of ecologicalbiases can be explained. Using a limited number of auxiliary variables it waspossible to explain about half the aggregation effects in income and a numberof explanatory variables. Using individual-level data on these adjustment vari-ables the aggregation effects due to these variables can be removed. However,the resulting adjusted regression coefficients are no less biased.

This suggests that we should attempt to find further auxiliary variables thataccount for a very large proportion of the aggregation effects and for which it

342 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA

�� �

� � �� �

��.�

� � �

Page 366: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

would be reasonable to expect that the required individual-level data areavailable. However, in practice there are always going to be some residualgroup-level effects and because of the impact of in (20.19) there is still thepotential for large biases.

20.6. CONCLUSIONS conclusions

This chapter has shown the potential for survey and aggregate data to be usedtogether to produce better estimates of parameters at different levels. In par-ticular, survey data may be used to remove biases associated with analysis usinggroup-level aggregate data even if it does not contain indicators for the groupsin question. Aggregate data may be used to produce estimates of variancecomponents when the primary data source is a survey that does not containindicators for the groups. The model and methods described in this chapter arefairly simple. Development of models appropriate to categorical data and moreevaluation with real datasets would be worthwhile.

Sampling and nonresponse are two mechanisms that lead to data beingmissing. The process of aggregation also leads to a loss of information andcan be thought of as a problem missing data. The approaches in this chaptercould be viewed in this light. Further progress may be possible through use ofmethods that have been developed to handle incomplete data, such as thosediscussed by Little in this volume (chapter 18).

ACKNOWLEDGEMENTS acknowledgements

This work was supported by the Economic and Science Research Council(Grant number R 000236135) and the Australian Research Council.

CONCLUSIONS 343

���

Page 367: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 368: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

References

Aalen, O. and Husebye, E. (1991) Statistical analysis of repeated events forming renewalprocesses. Statistics in Medicine, 10, 1227�40.

Abowd, J. M. and Card, D. (1989) On the covariance structure of earnings and hourschanges. Econometrica, 57, 411�45.

Achen, C. H. and Shively, W. P. (1995) Cross Level Inference. Chicago: The Universityof Chicago Press.

Agresti, A. (1990) Categorical Data Analysis. New York: Wiley.Allison, P. D. (1982) Discrete-time methods for the anlaysis of event-histories. In Socio-

logical Methodology 1982 (S. Leinhardt, ed.), pp. 61�98. San Francisco: Jossey-Bass.Altonji, J. G. and Segal, L. M. (1996) Small-sample bias in GMM estimation of

covariance structures. Journal of Business and Economic Statistics, 14, 353�66.Amemiya, T. (1984) Tobit models: a survey. Journal of Econometrics, 24, 3�61.Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1993) Statistical Models Based

on Counting Processes. New York: Springer-Verlag.Anderson, T. W. (1957) Maximum likelihood estimates for the multivariate normal

distribution when some observations are missing. Journal of the American Statis-tical Association, 52, 200�3.

Andrews, M. and Bradley, S. (1997) Modelling the transition from school and thedemand for training in the United Kingdom. Economica, 64, 387�413.

Assakul, K. and Proctor, C. H. (1967) Testing independence in two- way contingencytables with data subject to misclassification. Psychometrika, 32, 67�76.

Baltagi, B. H. (2001) Econometric Analysis of Panel Data. 2nd Edn. Chichester: Wiley.Basu, D. (1971) An essay on the logical foundations of survey sampling, Part 1.

Foundations of Statistical Inference, pp. 203�42. Toronto: Holt, Rinehart andWinston.

Bellhouse, D. R. (2000) Density and quantile estimation in large- scale surveys when acovariate is present. Unpublished report.

Bellhouse, D. R. and Stafford, J. E. (1999) Density estimation from complex surveys.Statistica Sinica, 9, 407�24.

Bellhouse, D. R. and Stafford, J. E. (2001) Local polynomial regression techniques incomplex surveys. Survey Methodology, 27, 197�203..

Berman, M. and Turner, T. R. (1992) Approximating point process likelihoods withGLIM. Applied Statistics, 41, 31�8.

Berthoud, R. and Gershuny, J. (eds) (2000) Seven Y ears in the Lives of British Families.Bristol: The Policy Press.

Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993) Efficient andAdaptive Estimation for Semiparametric Models. Baltimore, Maryland: JohnsHopkins University Press.

Binder, D. A. (1982) Non-parametric Bayesian models for samples from finite popula-tions. Journal of the Royal Statistical Society, Series B, 44, 388�93.

Binder, D. A. (1983) On the variances of asymptotically normal estimators fromcomplex surveys. International Statistical Review, 51, 279�92.

Page 369: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Binder, D. A. (1992) Fitting Cox�s proportional hazards models from survey data.Biometrika, 79, 139�47.

Binder, D. A. (1996) Linearization methods for single phase and two-phase samples: acookbook approach. Survey Methodology, 22, 17�22.

Binder, D. A. (1998) Longitudinal surveys: why are these different from all othersurveys? Survey Methodology, 24, 101�8.

Birnbaum, A. (1962) On the foundations of statistical inference (with discussion).Journal of the American Statistical Association, 53, 259�326.

Bishop, Y. M. M., F ienberg, S. E. and Holland, P. W. (1975) Discrete MultivariateAnalysis: Theory and Practice. Cambrdige, Massachusetts: MIT Press.

Bjørnstad, J. F . (1996) On the generalization of the likelihood function and the likeli-hood principle. Journal of the American Statistical Association, 91, 791�806.

Blau, D. M. and Robins, P. K. (1987) Training programs and wages � a generalequilibrium analysis of the effects of program size. Journal of Human Resources,22, 113�25.

Blossfeld, H. P., Hamerle, A. and Mayer, K. U. (1989) Event History Analysis. Hillsdale,New Jersey: L. Erlbaum Associates.

Boudreau, C. and Lawless, J. F . (2001) Survival analysis based on Cox proportionalhazards models and survey data. University of Waterloo, Dept. of Statistics andActuarial Science, Technical Report.

Box, G. E. P. (1980) Sampling and Bayes� inference in scientific modeling and robustness.Journal of the Royal Statistical Society, Series A , 143, 383�430 (with discussion).

Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations. Journal of theRoyal Statistical Society, Series B, 26, 211�52.

Boyd, L. H. and Iversen, G. R. (1979) Contex tual Analysis: Concepts and StatisticalTechniques. Belmont, California: Wadsworth.

Breckling, J. U., Chambers, R. L., Dorfman, A. H., Tam, S. M. and Welsh, A. H. (1994)Maximum likelihood inference from sample survey data. International StatisticalReview, 62, 349�63.

Breidt, F . J. and Fuller, W. A. (1993) Regression weighting for multiphase samples.Sankhya, B, 55, 297�309.

Breidt, F . J., McVey, A. and Fuller, W. A. (1996�7) Two-phase estimation by imput-ation. Journal of the Indian Society of Agricultural Statistics, 49, 79�90.

Breslow, N. E. and Holubkov, R. (1997) Maximum likelihood estimation of logisticregression parameters under two-phase outcome-dependent sampling. Journal ofthe Royal Statistical Society, Series B, 59, 447�61.

Breunig, R. V. (1999) Nonparametric density estimation for stratified samples. WorkingPaper, Department of Statistics and Econometrics, The Australian National Uni-versity.

Breunig, R. V. (2001) Density estimation for clustered data. Econometric Reviews, 20,353�67.

Brewer, K. R. W. and Mellor, R. W. (1973) The effect of sample structure on analyticalsurveys. Australian Journal of Statistics, 15, 145�52.

Brick, J. M. and Kalton, G. (1996) Handling missing data in survey research. StatisticalM ethods in Medical Research, 5, 215�38.

Brier, S. E. (1980) Analysis of contingency tables under cluster sampling. Biometrika, 67,591�6.

Browne, M. W. (1984) Asymptotically distribution-free methods for the analysis ofcovariance structures. British Journal of Mathematical and Statistical Psychology,37, 62�83.

Bryk, A. S. and Raudenbush, S. W. (1992) Hierarchical Linear Models: Application andData Analysis Methods. Newbury Park, California: Sage.

Bull, S. and Pederson, L. L. (1987) Variances for polychotomous logistic regressionusing complex survey data. Proceedings of the American Statistical Association,Survey Research Methods Section, pp. 507�12.

346 REFERENCES

Page 370: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Buskirk, T. (1999) Using nonparametric methods for density estimation with complexsurvey data. Unpublished Ph. D. thesis, Arizona State University.

Carroll, R. J., Ruppert, D. and Stefanski, L. A. (1995) Measurement Error in NonlinearModels. London: Chapman and Hall.

Cassel, C.-M., Sarndal, C.-E. and Wretman, J. H. (1977) Foundations of Inference inSurvey Sampling. New York: Wiley.

Chamberlain, G. (1982) Multivariate regression models for panel data. Journal ofEconometrics, 18, 5�46.

Chambers, R. L. (1996) Robust case-weighting for multipurpose establishment surveys.Journal of Official Statistics, 12, 3�22.

Chambers, R. L. and Dunstan, R. (1986) Estimating distribution functions from surveydata. Biometrika, 73, 597�604.

Chambers, R. L. and Steel, D. G. (2001) Simple methods for ecological inference in 2x2tables. Journal of the Royal Statistical Society, Series A , 164, 175�92.

Chambers, R. L., Dorfman, A. H. and Wang, S. (1998) Limited information likelihoodanalysis of survey data. Journal of the Royal Statistical Society, Series B, 60,397�412.

Chambers, R. L., Dorfman, A. H. and Wehrly, T. E. (1993) Bias robust estimation infinite populations using nonparametric calibration. Journal of the American Statis-tical Association, 88, 268�77.

Chesher, A. (1997) Diet revealed? Semiparametric estimation of nutrient intake-agerelationships (with discussion). Journal of the Royal Statistical Society, Series A ,160, 389�428.

Clayton, D. (1978) A model for association in bivariate life tables and its application inepidemiological studies of familial tendency in chronic disease incidence. Biome-trika, 65, 14�51.

Cleave, N., Brown, P. J. and Payne, C. D. (1995) Evaluation of methods for ecologicalinference. Journal of the Royal Statistical Society, Series A , 158, 55�72.

Cochran, W. G. (1977) Sampling Techniques. 3rd Edn. New York: Wiley.Cohen, M. P. (1995) Sample sizes for survey data analyzed with hierarchical linear

models. National Center for Educational Statistics, Washington, DC.Cosslett, S. (1981) Efficient estimation of discrete choice models. In Structural Analysis

of Discrete Data with Econometric Applications (C. F . Manski and D. McFadden,eds), pp. 191�205. New York: Wiley.

Cox, D. R. (1972) Regression models and life tables (with discussion). Journal of theRoyal Statistical Society, Series B, 34, 187�220.

Cox, D. R. and Isham, V. (1980) Point Processes. London: Chapman and Hall.Cox, D. R. and Oakes, D. (1984) Analysis of Survival Data. London: Chapman and

Hall.David, M. H., Little, R. J. A., Samuhel, M. E. and Triest, R. K. (1986) Alternative

methods for CPS income imputation. Journal of the American Statistical Associ-ation, 81, 29�41.

Davies, H., Joshi, H. and Clarke, L. (1997) Is it cash that the deprived are short of?Journal of the Royal Statistical Society, Series A , 160, 107�26.

Deakin, B. M. and Pratten, C. F . (1987) The economic effects of YTS. EmploymentGazette, 95, 491�7.

Decady, Y. J. and Thomas, D. R. (1999) Testing hypotheses on multiple response tables:a Rao-Scott adjusted chi-squared approach. In Managing on the Digital Frontier(A. M. Lavack, ed. ). Proceedings of the Administrative Sciences Association ofCanada, 20, 13�22.

Decady, Y. J. and Thomas, D. R. (2000) A simple test of association for contingencytables with multiple column responses. Biometrics, 56, 893�896.

Deville, J.-C. and Sarndal, C.-E. (1992) Calibration estimators in survey sampling.Journal of the American Statistical Association, 87, 376�82.

REFERENCES 347

Page 371: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Diamond, I. D. and McDonald, J. W. (1992) The analysis of current status data. InDemographic Applications of Event History Analysis (J. Trussell, R. Hankinson andJ. Tilton, eds), pp. 231�52. Oxford: Clarendon Press.

Diggle, P. and Kenward, M. G. (1994) Informative dropout in longitudinal data analysis(with discussion). Applied Statistics, 43, 49�94.

Diggle, P. J., Heagerty, P. J., Liang, K.-Y. and Zeger, S. L. (2002) Analysis of Longitu-dinal Data. 2nd Edn. Oxford: Clarendon Press.

Diggle, P. J., Liang, K.-Y. and Zeger, S. L. (1994) Analysis of Longitudinal Data.Oxford: Oxford University Press.

Dolton, P. J. (1993) The economics of youth training in Britain. Economic Journal, 103,1261�78.

Dolton, P. J., Makepeace, G. H. and Treble, J. G. (1994) The Youth Training Schemeand the school-to-work transition. Oxford Economic Papers, 46, 629�57.

Dumouchel, W. H. and Duncan, G. J. (1983) Using survey weights in multiple regres-sion analysis of stratified samples. Journal of the American Statistical Association,78, 535�43.

Edwards, A. W. F. (1972) Likelihood. London: Cambridge University Press.Efron, B. and Tibshirani, R. J. (1993) An Introduction to the Bootstrap. London:

Chapman and Hall.Elliott, M. R. and Little, R. J. (2000) Model-based alternatives to trimming survey

weights. Journal of Official Statistics, 16, 191�209.Ericson, W. A. (1988) Bayesian inference in finite populations. Handbook of Statistics, 6,

213�46. Amsterdam: North-Holland.Ericson, W. A. (1969) Subjective Bayesian models in sampling finite populations (with

discussion). Journal of the Royal Statistical Society, Series B, 31, 195�233.Eubank, R. L. (1999) Nonparametric Regression and Spline Smoothing. New York:

Marcel Dekker.Ezzati, T. and Khare, M. (1992) Nonresponse adjustments in a National Health Survey.

Proceedings of the American Statistical Association, Survey Research MethodsSection, pp. 00.

Ezzati-Rice, T. M., Khare, M., Rubin, D. B., Little, R. J. A. and Schafer, J. L. (1993)A comparison of imputation techniques in the third national health and nutritionexamination survey. Proceedings of the American Statistical Association, SurveyResearch Methods Section, pp. 00.

Ezzati-Rice, T., Johnson, W., Khare, M., Little, R., Rubin, D. and Schafer, J. (1995)A simulation study to evaluate the performance of model-based multiple imput-ations in NCHS health examination surveys. Proceedings of the 1995 AnnualResearch Conference, US Bureau of the Census, pp. 257�66.

Fahrmeir, L. and Tutz, G. (1994) Multivariate Statistical Modelling Based on General-ized Linear Models. New York: Springer-Verlag.

Fan, J. (1992) Design-adaptive nonparametric regression. Journal of the AmericanStatistical Association, 87, 998�1004.

Fan, J. and Gijbels, I. (1996) Local Polynomial Modelling and its Applications. London:Chapman and Hall.

Fay, R. E. (1979) On adjusting the Pearson chi-square statistic for clustered sampling.Proceedings of the American Statistical Association, Social Statistics Section,pp. 402�6.

Fay, R. E. (1984) Application of linear and log-linear models to data from complexsamples. Survey Methodology, 10, 82�96.

Fay, R. E. (1985) A jackknifed chi-square test for complex samples. Journal of theAmerican Statistical Association, 80, 148�57.

Fay, R. E. (1988) CPLX, Contingency table analysis for complex survey designs,unpublished report, U.S. Bureau of the Census.

Fay, R. E. (1996) Alternative paradigms for the analysis of imputed survey data. Journalof the American Statistical Association, 91, 490�8 (with discussion).

348 REFERENCES

Page 372: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Feder, M., Nathan, G. and Pfeffermann, D. (2000) Multilevel modelling of complexsurvey longitudinal data with time varying random effects. Survey Methodology, 26,53�65.

Fellegi, I. P. (1980) Approximate test of independence and goodness of fit based onstratified multistage samples. Journal of the American Statistical Association, 75,261�8.

Fienberg, S. E. and Tanur, J. M. (1986) The design and analysis of longitudinal surveys:controversies and issues of cost and continuity. In Survey Research Designs:Towards a Better Understanding of the Costs and Benefits (R. W. Pearson andR. F. Boruch, eds), Lectures Notes in Statistics 38, 60�93. New York: Springer-Verlag.

F isher, R. (1994) Logistic regression analysis of CPS overlap survey split panel data.Proceedings of the American Statistical Association, Survey Research MethodsSection, pp. 620�5.

Folsom, R., LaVange, L. and Williams, R. L. (1989) A probability sampling perspectiveon panel data analysis. In Panel Surveys (D. Kasprzyk G. Duncan, G. Kalton andM. P. Singh, eds), pp. 108�38. New York: Wiley.

Fuller, W. A. (1984) Least-squares and related analyses for complex survey designs.Survey Methodology, 10, 97�118.

Fuller, W. A. (1987) Measurement Error Models. New York: Wiley.Fuller, W. A. (1990) Analysis of repeated surveys. Survey Methodology, 16, 167�80.Fuller, W. A. (1998) Replicating variance estimation for two-phase sample. Statistica

Sinica, 8, 1153�64.Fuller, W. A. (1999) Environmental surveys over time. Journal of Agricultural, Biological

and Environmental Statistics, 4, 331�45.Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (1995) Bayesian Data Analysis.

London: Chapman and Hall.Ghosh, M. and Meeden, G. (1986) Empirical Bayes estimation of means from stratified

samples. Journal of the American Statistical Association, 81, 1058�62.Ghosh, M. and Meeden, G. (1997) Bayesian Methods for Finite Population Sampling.

London: Chapman and Hall.Godambe, V. P. (ed.) (1991) Estimating Functions. Oxford: Oxford University Press.Godambe, V. P. and Thompson, M. E. (1986) Parameters of super populations and

survey population: their relationship and estimation. International StatisticalReview, 54, 37�59.

Goldstein, H. (1987) Multilevel Models in Educational and Social Research. London:Griffin.

Goldstein, H. (1995) Multilevel Statistical Models. 2nd Edn. London: Edward Arnold.Goldstein, H., Healy, M. J. R. and Rasbash, J. (1994) Multilevel time series models with

applications to repeated measures data. Statistics in Medicine, 13, 1643�55.Goodman, L. (1961) Statistical methods for the mover-stayer model. Journal of the

American Statistical Association, 56, 841�68.Gourieroux, C. and Monfort, A. (1996) Simulation-Based Econometric Methods.

Oxford: Oxford University Press.Graubard, B. I., Fears, T. I. and Gail, M. H. (1989) Effects of cluster sampling on

epidemiologic analysis in population-based case-control studies. Biometrics, 45,1053�71.

Greenland, S., Robins, J. M. and Pearl, J. (1999) Confounding and collapsibility incausal inference. Statistical Science, 14, 29�46.

Greenlees, W. S., Reece, J. S. and Zieschang, K. D. (1982) Imputation of missing valueswhen the probability of nonresponse depends on the variable being imputed.Journal of the American Statistical Association, 77, 251�61.

Gritz, M. (1993) The impact of training on the frequency and duration of employment.Journal of Econometrics, 57, 21�51.

Groves, R. M. (1989) Survey Errors and Survey Costs. New York: Wiley.

REFERENCES 349

Page 373: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Guo, G. (1993) Event-history analysis for left-truncated data. Sociological Methodology,23, 217�43.

Hacking, I. (1965) Logic of Statistical Inference. New York: Cambridge University Press.Hansen, M. H., Madow, W. G. and Tepping, B. J. (1983) An evaluation of model-

dependent and probability-sampling inferences in sample surveys. Journal of theAmerican Statistical Association, 78, 776�93.

Hansen, M. H., Hurwitz, W. N. and Madow, W. G. (1953) Sample Survey Methods andTheory. New York: Wiley.

Hardle, W. (1990) Applied Nonparametric Regression Analysis. Cambridge: CambridgeUniversity Press.

Hartley, H. O. and Rao, J. N. K. (1968) A new estimation theory for sample surveys, II.In New Developments in Survey Sampling (N. L. Johnson and H. Smith, eds),pp. 147�69. New York: Wiley Interscience.

Hartley, H. O. and Sielken, R. L., Jr (1975) A �super-population viewpoint� for finitepopulation sampling. Biometrics, 31, 411�22.

Heckman, J. (1976) The common structure of statistical models of truncation, sampleselection and limited dependent variables, and a simple estimator for such models.Annals of Economic and Social Measurement, 5, 475�92.

Heckman, J. J. and Singer, B. (1984) A method for minimising the impact of distribu-tional assumptions in econometric models for duration data. Econometrica, 52,271�320.

Heeringa, S. G., Little, R. J. A. and Raghunathan, T. E. (2002) Multivariate imputationof coarsened survey data on household wealth. In Survey Nonresponse(R. M. Groves, D. A. Dillman, J. L. Eltinge and R. J. A. Little, eds), pp.357�371. New York: Wiley.

Heitjan, D. F . (1994) Ignorability in general complete-data models. Biometrika, 81,701�8.

Heitjan, D. F . and Rubin, D. B. (1991) Ignorability and coarse data. Annals of Statistics,19, 2244�53.

Hidiroglou, M. A. (2001) Double sampling. Survey Methodology, 27, 143�54.Hinkins, S., Oh, F . L. and Scheuren, F . (1997) Inverse sampling design algorithms,

Survey Methodology, 23, 11�21.Hoem, B. and Hoem, J. (1992) The disruption of marital and non- marital unions in

contemporary Sweden. In Demographic Applications of Event History Analysis(J. Trussell, R. Hankinson and J. Tilton, eds), pp. 61�93. Oxford: Clarendon Press.

Hoem, J. (1989) The issue of weights in panel surveys of individual behavior. In PanelSurveys (D. Kasprzyk et al., eds), pp. 539�65. New York: Wiley.

Hoem, J. M. (1985) Weighting, misclassification, and other issues in the analysis ofsurvey samples of life histories. In Longitudinal Analysis of Labor Market Data(J. J. Heckman and B. Singer, eds), Ch. 5. Cambridge: Cambridge University Press.

Holland, P. (1986) Statistics and causal inference. Journal of the American StatisticalAssociation, 81, 945�61.

Holt, D. (1989) Aggregation versus disaggregation. In Analysis of Complex Surveys(C. Skinner, D. Holt and T. M. F. Smith, eds), Ch. 10. 1. Chichester: Wiley.

Holt, D. and Smith, T. M. F. (1979) Poststratification. Journal of the Royal StatisticalSociety, Series A , 142, 33�46.

Holt, D., McDonald, J. W. and Skinner, C. J. (1991) The effect of measurement error onevent history analysis. In Measurement Errors in Surveys (P. P. Biemer et al., eds),pp. 665�85. New York: Wiley.

Holt, D., Scott, A. J. and Ewings, P. D. (1980) Chi-squared test with survey data.Journal of the Royal Statistical Society, Series A , 143, 303�20.

Holt, D., Smith, T. M. F. and Winter, P. D. (1980) Regression analysis of data fromcomplex surveys. Journal of the Royal Statistical Society, Series A , 143, 474�87.

Holt, D., Steel, D. and Tranmer, M. (1996) Area homogeneity and the modifiable arealunit problem. Geographical Systems, 3, 181�200.

350 REFERENCES

Page 374: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Horvitz, D. G. and Thompson, D. J. (1952) A generalization of sampling withoutreplacement from a finite universe. Journal of the American Statistical Association,47, 663�85.

Hougaard, P. (2000) Analysis of Multivariate Survival Data. New York: Springer-Verlag.Hsiao, C. (1986) Analysis of Panel Data. Cambridge: Cambridge University Press.Huster, W. J., Brookmeyer, R. L. and Self, S. G. (1989) Modelling paired survival data

with covariates. Biometrics, 45, 145�56.Jeffreys, H. (1961) Theory of Probability. 3rd Edn. Oxford: Oxford University Press.Joe, H. (1997) Multivariate Models and Dependence Concepts. London: Chapman and

Hall.Johnson, G. E. and Layard, R. (1986) The natural rate of unemployment: explanation

and policy. In Handbook of Labour Economics (O. Ashenfelter and R. Layard, eds).Amsterdam: North-Holland.

Jones, I. (1988) An evaluation of YTS. Oxford Review of Economic Policy, 4, 54�71.Jones, M. C. (1989) Discretized and interpolated kernel density estimates. Journal of the

American Statistical Association, 84, 733�41.Jones, M. C. (1991) Kernel density estimation for length biased data. Biometrika, 78,

511�19.Kalbfleisch, J. D. and Lawless, J. F . (1988) Likelihood analysis of multi-state models for

disease incidence and mortality. Statistics in Medicine, 7, 149�60.Kalbfleisch, J. D. and Lawless, J. F . (1989) Some statistical methods for panel life

history data. Proceedings of the Statistics Canada Symposium on the Analysis ofData in Time, pp. 185�92. Ottawa: Statistics Canada.

Kalbfleisch, J. D. and Prentice, R. L. (2002) The Statistical Analysis of Failure TimeData. 2nd Edn. New York: Wiley.

Kalbfleisch, J. D. and Sprott, D. A. (1970) Application of likelihood methods toproblems involving large numbers of parameters (with discussion). Journal ofRoyal Statistical Society, Series B, 32, 175�208.

Kalton, G. and Citro, C. (1993) Panel surveys: adding the fourth dimension. SurveyMethodology, 19, 205�15.

Kalton, G. and Kasprzyk, D. (1986) The treatment of missing survey data. SurveyMethodology, 12, 1�16.

Kalton, G. and Kish, L. (1984) Some efficient random imputation methods. Communi-cations in Statistics Theory and Methods, 13, 1919�39.

Kasprzyk, D., Duncan, G. J., Kalton, G. and Singh, M. P. (1989) Panel Surveys. NewYork: Wiley.

Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American StatisticalAssociation, 90, 773�95.

Khare, M., Little, R. J. A., Rubin, D. B. and Schafer, J. L. (1993) Multiple imputationof NHANES III. Proceedings of the American Statistical Association, SurveyResearch Methods Section, pp. 00.

Kim, J. K. and Fuller, W. A. (1999) Jackknife variance estimation after hot deckimputation. Proceedings of the American Statistical Association, Survey ResearchMethods Section, pp. 825�30.

King, G. (1997) A Solution to the Ecological Inference Problem: Reconstructing IndividualBehavior from Aggregate Data. Princeton, New Jersey: Princeton University Press.

Kish, L. and Frankel, M. R. (1974) Inference from complex samples (with discussion).Journal of the Royal Statistical Society, Series B, 36, 1�37.

Klein, J. P. and Moeschberger, M. L. (1997) Survival Analysis. New York: Springer-Verlag.

Koehler, K. J. and Wilson, J. R. (1986) Chi-square tests for comparing vectors ofproportions for several cluster samples. Communications in Statistics, Part A �Theory and Methods, 15, 2977�90.

Konijn, H. S. (1962) Regression analysis in sample surveys. Journal of the AmericanStatistical Association, 57, 590�606.

REFERENCES 351

Page 375: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Korn, E. L. and Graubard, B. I. (1990) Simultaneous testing of regression coefficientswith complex survey data: use of Bonferroni t-statistics. American Statistician, 44,270�6.

Korn, E. L. and Graubard, B. I. (1998a) Variance estimation for superpopulationparameters. Statistica Sinica, 8, 1131�51.

Korn, E. L. and Graubard, B. I. (1998b) Scatterplots with survey data. AmericanStatistician, 52, 58�69.

Korn, E. L. and Graubard, B. I. (1999) Analysis of Health Surveys. New York: Wiley.Korn, E. L. Graubard, B. I. and Midthune, D. (1997) Time-to-event analysis of longi-

tudinal follow-up of a survey: choice of the time-scale. American Journal of Epi-demiology, 145, 72�80.

Kott, P. S. (1990) Variance estimation when a first phase area sample is restratified.Survey Methodology, 16, 99�103.

Kott, P. S. (1995) Can the jackknife be used with a two-phase sample?Proceedings of theSurvey Research Methods Section, Statistical Society of Canada, pp. 107�10.

Kreft, I. and De Leeuw, J. (1998) Introducing Multilevel Modeling. London: Sage.Krieger, A. M. and Pfeffermann, D. (1992) Maximum likelihood from complex sample

surveys. Survey Methodology, 18, 225�39.Krieger, A. M. and Pfeffermann, D. (1997) Testing of distribution functions from

complex sample surveys. Journal of Official Statistics, 13, 123�42.Lancaster, T. (1990) The Econometric Analysis of Transition Data. Cambridge: Cam-

bridge University Press.Lawless, J. F . (1982) Statistical Models and Methods for Lifetime Data. New York:

Wiley.Lawless, J. F . (1987) Regression methods for Poisson process data. Journal of the

American Statistical Association, 82, 808�15.Lawless, J. F . (1995) The analysis of recurrent events for multiple subjects. Applied

Statistics, 44, 487�98.Lawless, J. F . (2002) Statistical Models and Methods for Lifetime Data. 2nd Edn. New

York: Wiley.Lawless, J. F . and Nadeau, C. (1995) Some simple robust methods for the analysis of

recurrent events. Technometrics, 37, 158�68.Lawless, J. F ., Kalbfleisch, J. D. and Wild, C. J. (1999) Semiparametric methods for

response-selective and missing data problems in regression. Journal of the RoyalStatistical Society, Series B, 61, 413�38.

Lazarsfeld, P. F . and Menzel, H. (1961) On the relation between individual and collect-ive properties. In Complex Organizations: A Sociological Reader (A. Etzioni, ed. ).New York: Holt, Rinehart and Winston.

Lazzeroni, L. C. and Little, R. J. A. (1998) Random-effects models for smoothing post-stratification weights. Journal of Official Statistics, 14, 61�78.

Lee, E. W., Wei, L. J. and Amato, D. A. (1992) Cox-type regression analysis for largenumbers of small groups of correlated failure time observations. In Survival Analysis:State of the Art (J. P. Klein and P. K. Goel, eds), pp. 237�47. Dordrecht: Kluwer.

Lehtonen, R. and Pahkinen, E. J. (1995) Practical Methods for Design and Analysis ofComplex Surveys. Chichester: Wiley.

Lepkowski, J. M. (1989) Treatment of wave nonresponse in panel surveys. In PanelSurveys (D. Kasprzyk et al., eds), pp. 348�74. New York: Wiley.

Lessler, J. T. and Kalsbeek, W. D. (1992) Nonsampling Errors in Surveys. New York:Wiley.

Liang, K.-Y. and Zeger, S. L. (1986) Longitudinal data analysis using generalized linearmodels. Biometrika, 73, 13�22.

Lillard, L., Smith, J. P. and Welch, F . (1982) What do we really know about wages? Theimportance of nonreporting and census imputation. The Rand Corporation, SantaMonica, California.

352 REFERENCES

Page 376: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Lillard, L., Smith, J. P. and Welch, F . (1986) What do we really know about wages? Theimportance of nonreporting and census imputation. Journal of Political Economy,94, 489�506.

Lillard, L. A. and Willis, R. (1978) Dynamic aspects of earnings mobility. Econometrica,46, 985�1012.

Lin, D. Y. (1994) Cox regression analysis of multivariate failure time data: the marginalapproach. Statistics in Medicine, 13, 2233�47.

Lin, D. Y. (2000) On fitting Cox�s proportional hazards models to survey data. Biome-trika, 87, 37�47.

Lindeboom, M. and van den Berg, G. (1994) Heterogeneity in models for bivariatesurvival: the importance of the mixing distribution. Journal of the Royal StatisticalSociety, Series B, 56, 49�60.

Lindsey, J. K. (1993) Models for Repeated Measurements. Oxford: Clarendon Press.Lipsitz, S. R. and Ibrahim, J. (1996) Using the E-M algorithm for survival data with

incomplete categorical covariates. Lifetime Data Analysis, 2, 5�14.Little, R. J. A. (1982) Models for nonresponse in sample surveys. Journal of the

American Statistical Association, 77, 237�50.Little, R. J. A. (1983a) Comment on �An evaluation of model dependent and probability

sampling inferences in sample surveys� by M. H. Hansen, W. G. Madow andB. J. Tepping. Journal of the American Statistical Association, 78, 797�9.

Little, R. J. A. (1983b) Estimating a finite population mean from unequal probabilitysamples. Journal of the American Statistical Association, 78, 596�604.

Little, R. J. A. (1985) A note about models for selectivity bias. Econometrica, 53, 1469�74.Little, R. J. A. (1988) Missing data in large surveys. Journal of Business and Economic

Statistics, 6, 287�301 (with discussion).Little, R. J. A. (1989) On testing the equality of two independent binomial proportions.

The American Statistician, 43, 283�8.Little, R. J. A. (1991) Inference with survey weights. Journal of Official Statistics, 7,

405�24.Little, R. J. A. (1992) Incomplete data in event history analysis. In Demographic

Applications of Event History Analysis (J. Trussell, R. Hankinson and J. Tilton,eds), Ch. 8. Oxford: Clarendon Press.

Little, R. J. A. (1993a) Poststratification: a modeler�s perspective. Journal of theAmerican Statistical Association, 88, 1001�12.

Little, R. J. A. (1993b) Pattern-mixture models for multivariate incomplete data.Journal of the American Statistical Association, 88, 125�34.

Little, R. J. A. (1994) A class of pattern-mixture models for normal missing data.Biometrika, 81, 471�83.

Little, R. J. A. (1995) Modeling the drop-out mechanism in repeated-measures studies.Journal of the American Statistical Association, 90, 1112�21.

Little, R. J. A. and Rubin, D. B. (1983) Discussion of �Six approaches to enumerativesampling� by K. R. W. Brewer and C. E. Sarndal. In Incomplete Data in SampleSurveys, Vol. 3: Proceedings of the Symposium (W. G. Madow and I. Olkin, eds).New York: Academic Press.

Little, R. J. A. and Rubin, D. B. (1987) Statistical Analysis with M issing Data. NewYork: Wiley.

Little, R. J. A. and Rubin, D. B. (2002) Statistical Analysis with M issing Data, 2nd. Ed.New York: Wiley

Little, R. J. A. and Wang, Y.-X. (1996) Pattern-mixture models for multivariateincomplete data with covariates. Biometrics, 52, 98�111.

Lohr, S. L. (1999) Sampling: Design and Analysis. Pacific Grove, California: Duxbury.Longford, N. (1987) A fast scoring algorithm for maximum likelihood estimation in

unbalanced mixed models with nested random effects. Biometrika, 74, 817�27.Longford, N. (1993) Random Coefficient Models. Oxford: Oxford University Press.

REFERENCES 353

Page 377: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Loughin, T. M. and Scherer, P. N. (1998) Testing association in contingency tables withmultiple column responses. Biometrics, 54, 630�7.

Main, B. G. M. and Shelly, M. A. (1988) The effectiveness of YTS as a manpowerpolicy. Economica, 57, 495�514.

McCullagh, P. and Nelder, J. A. (1983) Generalized Linear Models. London: Chapmanand Hall.

McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Models. 2nd Edn. London:Chapman and Hall.

Mealli, F . and Pudney, S. E. (1996) Occupational pensions and job mobility in Britain:estimation of a random-effects competing risks model. Journal of Applied Econo-metrics, 11, 293�320.

Mealli, F . and Pudney, S. E. (1999) Specification tests for random-effects transitionmodels: an application to a model of the British Youth Training Scheme. LifetimeData Analysis, 5, 213�37.

Mealli, F ., Pudney, S. E. and Thomas, J. M. (1996) Training duration and post-trainingoutcomes: a duration limited competing risks model. Economic Journal, 106,422�33.

Meyer, B. (1990) Unemployment insurance and unemployment spells. Econometrica, 58,757�82.

Molina, E. A., Smith, T. M. F and Sugden, R. A. (2001) Modelling overdispersion forcomplex survey data. International Statistical Review, 69, 373�84.

Morel, J. G. (1989) Logistic regression under complex survey designs. Survey Method-ology, 15, 203�23.

Mote, V. L. and Anderson, R. L. (1965) An investigation of the effect of misclassifica-tion on the properties of chi-square tests in the analysis of categorical data.Biometrika, 52, 95�109.

Narendranathan, W. and Stewart, M. B. (1993) Modelling the probability of leavingunemployment: competing risks models with flexible baseline hazards. Journal ofthe Royal Statistical Society, Series C, 42, 63�83.

Nascimento Silva, P. L. D. and Skinner, C. J. (1995) Estimating distribution functionswith auxiliary information using poststratification. Journal of Official Statistics, 11,277�94.

Neuhaus, J. M., Kalbfleisch, J. D. and Hauck, W. W. (1991) A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data.International Statistical Review, 59, 25�36.

Neyman, J. (1938) Contribution to the theory of sampling human populations. Journalof the American Statistical Association, 33, 101�16.

Ng, E. and Cook, R. J. (1999) Robust inference for bivariate point processes. CanadianJournal of Statistics, 27, 509�24.

Nguyen, H. H. and Alexander, C. (1989) On 2 test for contingency tables from complexsample surveys with fixed cell and marginal design effects. Proceedings of theAmerican Statistical Association, Survey Research Methods Section, pp. 753�6.

Nusser, S. M. and Goebel, J. J. (1977) The National Resource Inventory: a long-termmulti-resource monitoring programme. Environmental and Ecological Statistics, 4,181�204.

Obuchowski, N. A. (1998) On the comparison of correlated proportions for clustereddata. Statistics in Medicine, 17, 1495�1507.

Oh, H. L. and Scheuren, F . S. (1983) Weighting adjustments for unit nonresponse. InIncomplete Data in Sample Surveys, Vol. II: Theory and Annotated Bibliography(W. G. Madow, I. Olkin and D. B. Rubin, eds). New York: Academic Press.

Ontario Ministry of Health (1992) Ontario Health Survey: User�s Guide, Volumes I andII. Queen�s Printer for Ontario.

Østbye, T., Pomerleau, P., Speechley, M., Pederson, L. L. and Speechley, K. N. (1995)Correlates of body mass index in the 1990 Ontario Health Survey. CanadianMedical Association Journal, 152, 1811�17.

354 REFERENCES

Page 378: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Pfeffermann, D. (1993) The role of sampling weights when modeling survey data.International Statistical Review, 61, 317�37.

Pfeffermann, D. (1996) The use of sampling weights for survey data analysis. StatisticalM ethods in Medical Research, 5, 239�61.

Pfeffermann, D. and Holmes, D. J. (1985) Robustness considerations in the choice ofmethod of inference for regression analysis of survey data. Journal of the RoyalStatistical Society, Series A , 148, 268�78.

Pfeffermann, D. and Sverchkov, M. (1999) Parametric and semi- parametric estimationof regression models fitted to survey data. Sankhya, B, 61, 166�86.

Pfeffermann, D., Krieger, A. M. and Rinott, Y. (1998) Parametric distributions ofcomplex survey data under informative probability sampling. Statistica Sinica, 8,1087� 1114.

Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H. and Rasbash, J. (1998)Weighting for unequal selection probabilities in multilevel models. Journal of theRoyal Statistical Society, Series B, 60, 23�40.

Platek, R. and Gray, G. B. (1983) Imputation methodology: total survey error. InIncomplete Data in Sample Surveys (W. G. Madow, I. Olkin and D. B. Rubin,eds), Vol. 2, pp. 249�333. New York: Academic Press.

Potter, F . (1990) A study of procedures to identify and trim extreme sample weights.Proceedings of the American Statistical Association, Survey Research MethodsSection, pp. 225�30.

Prentice, R. L. and Pyke, R. (1979) Logistic disease incidence models and case-controlstudies. Biometrika, 66, 403�11.

Pudney, S. E. (1981) An empirical method of approximating the separable structure ofconsumer preferences. Review of Economic Studies, 48, 561�77.

Pudney, S. E. (1989) Modelling Individual Choice: The Econometrics of Corners, Kinksand Holes. Oxford: Basil Blackwell.

Ramos, X. (1999) The covariance structure of earnings in Great Britain. British House-hold Panel Survey Working Paper 99�5, University of Essex.

Rao, J. N. K. (1973) On double sampling for stratification and analytic surveys.Biometrika, 60, 125�33.

Rao, J. N. K. (1996) On variance estimation with imputed survey data. Journal of theAmerican Statistical Association, 91, 499�506 (with discussion).

Rao, J. N. K. (1999) Some current trends in sample survey theory and methods.Sankhya, B, 61, 1�57.

Rao, J. N. K. and Scott, A. J. (1981) The analysis of categorical data from complexsample surveys: chi-squared tests for goodness of fit and independence in two-waytables. Journal of the American Statistical Association, 76, 221�30.

Rao, J. N. K. and Scott, A. J. (1984) On chi-squared tests for multi-way tables with cellproportions estimated from survey data. Annals of Statistics, 12, 46�60.

Rao, J. N. K. and Scott, A. J. (1992) A simple method for the analysis of clustered data.Biometrics, 48, 577�85

Rao, J. N. K. and Scott, A. J. (1999a) A simple method for analyzing overdispersion inclustered Poisson data. Statistics in Medicine, 18, 1373�85.

Rao, J. N. K. and Scott, A. J. (1999b) Analyzing data from complex sample surveysusing repeated subsampling. Unpublished Technical Report.

Rao, J. N. K. and Shao, J. (1992) Jackknife variance estimation with survey data underhot deck imputation. Biometrika, 79, 811�22.

Rao, J. N. K. and Sitter, R. R. (1995) Variance estimation under two-phase samplingwith application to imputation for missing data, Biometrika, 82, 452�60.

Rao, J. N. K. and Thomas, D. R. (1988) The analysis of cross- classified categoricaldata from complex sample surveys. Sociological Methodology, 18, 213�69.

Rao, J. N. K. and Thomas, D. R. (1989) Chi-squared tests for contingency tables. InAnalysis of Complex Surveys (C. J. Skinner, D. Holt and T. M. F. Smith, eds),pp. 89�114. Chichester: Wiley.

REFERENCES 355

Page 379: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Rao, J. N. K. and Thomas, D. R. (1991) Chi-squared tests with complex survey datasubject to misclassification error. In Measurement Errors in Surveys (P. P. Biemer etal., eds), pp. 637�63. New York: Wiley.

Rao, J. N. K., Hartley, H. O. and Cochran, W. G. (1962) On a simple procedure ofunequal probability sampling without replacement. Journal of the Royal StatisticalSociety, Series B, 24, 482�91.

Rao, J. N. K., Kumar, S. and Roberts, G. (1989) Analysis of sample survey datainvolving categorical response variables: methods and software. Survey Method-ology, 15, 161�85.

Rao, J. N. K., Scott, A. J. and Skinner, C. J. (1998) Quasi-score tests with survey data.Statistica Sinica, 8, 1059�70.

Rice, J. A. (1995) Mathematical Statistics and Data Analysis. 2nd Edn. Belmont,California: Duxbury.

Ridder, G. (1987) The sensitivity of duration models to misspecified unobserved hetero-geneity and duration dependence. Mimeo, University of Amsterdam.

Riley, M. W. (1964) Sources and types of sociological data. In Handbook of ModernSociology (R. Farris, ed. ). Chicago: Rand McNally.

Roberts, G., Rao, J. N. K. and Kumar, S. (1987) Logistic regression analysis of samplesurvey data. Biometrika, 74, 1�12.

Robins, J. M., Greenland, S. and Hu, F . C. (1999) Estimation of the causal effect of atime-varying exposure on the marginal mean of a repeated binary outcome. Journalof the American Statistical Association, 94, 447, 687�700.

Ross, S. M. (1983) Stochastic Processes. New York: Wiley.Rotnitzky, A. and Robins, J. M. (1997) Analysis of semi-parametric regression models

with nonignorable nonresponse. Statistics in Medicine, 16, 81�102.Royall, R. M. (1976) Likelihood functions in finite population sampling theory. Biome-

trika, 63, 606�14.Royall, R. M. (1986) Model robust confidence intervals using maximum likelihood

estimators. International Statistical Review, 54, 221�6.Royall, R. M. (1997) Statistical Evidence: A Likelihood Paradigm. London: Chapman

and Hall.Royall, R. M. (2000) On the probability of observing misleading statistical evidence.

Journal of the American Statistical Association, 95, 760�8.Royall, R. M. and Cumberland, W. G. (1981) An empirical study of the ratio estimator

and estimators of its variance (with discussion). Journal of the American StatisticalAssociation, 76, 66�88.

Rubin, D. B. (1976) Inference and missing data. Biometrika, 63, 581�92.Rubin, D. B. (1977) Formalizing subjective notions about the effect of nonrespondents

in sample surveys. Journal of the American Statistical Association, 72, 538�43.Rubin, D. B. (1983) Comment on �An evaluation of model dependent and probability

sampling inferences in sample surveys� by M. H. Hansen, W. G. Madow andB. J. Tepping. Journal of the American Statistical Association, 78, 803�5.

Rubin, D. B. (1984) Bayesianly justifiable and relevant frequency calculations for theapplied statistician. Annals of Statistics, 12, 1151�72.

Rubin, D. B. (1985) The use of propensity scores in applied Bayesian inference. InBayesian Statistics 2 (J. M. Bernado et al., eds). Amsterdam: Elsevier Science.

Rubin, D. B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley.Rubin, D. B. (1996) Multiple imputation after 18 years. Journal of the American

Statistical Association, 91, 473�89 (with discussion).Rubin, D. B. and Schenker, N. (1986) Multiple imputation for interval estimation from

simple random samples with ignorable nonresponse. Journal of the American Stat-istical Association, 81, 366�74.

Sarndal, C. E. and Swensson, B. (1987) A general view of estimation for two-phases ofselection with application to two- phase sampling and nonresponse. InternationalStatistical Review, 55, 279�94.

356 REFERENCES

Page 380: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Sarndal, C.-E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling.New York: Springer-Verlag.

SAS (1992) The mixed procedure. In SAS/STAT Software: Changes and Enhancements,Release 6.07. Technical Report P-229, SAS Institute, Inc., Cary, North Carolina.

SAS (2001) SAS/STAT Release 8.2. SAS Institute, Inc., Cary, North Carolina.Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. London: Chapman and

Hall.Scharfstein, D. O., Robins, J. M. and Rotnitsky, A. (1999) Adjusting for nonignorable

drop-out using semiparametric models (with discussion). Journal of the AmericanStatistical Association, 94, 1096�1146.

Scott, A. J. (1977a) Some comments on the problem of randomisation in surveysampling. Sankhya, C, 39, 1�9.

Scott, A. J. (1977b) Large-sample posterior distributions for finite populations. Annalsof Mathematical Statistics, 42, 1113�17.

Scott, A. J. and Smith, T. M. F. (1969) Estimation in multistage samples. Journal of theAmerican Statistical Association, 64, 830�40.

Scott, A. J. and Wild, C. J. (1986) Fitting logistic models under case-control or choicebased sampling. Journal of the Royal Statistical Society, Series B, 48, 170�82.

Scott, A. J. and Wild, C. J. (1989) Selection based on the response in logistic regression.In Analysis of Complex Surveys (C. J. Skinner, D. Holt and T. M. F. Smith, eds),pp. 191�205. New York: Wiley.

Scott, A. J. and Wild, C. J. (1997) Fitting regression models to case-control data bymaximum likelihood. Biometrika, 84, 57�71.

Scott, A. J. and Wild, C. J. (2001) Maximum likelihood for generalised case-controlstudies. Journal of Statistical Planning and Inference, 96, 3�27.

Scott, A. J., Rao, J. N. K. and Thomas, D. R. (1990) Weighted least squares andquasilikelihood estimation for categorical data under singular models. Linear Alge-bra and its Applications, 127, 427�47.

Scott, D. W. (1992) Multivariate Density Estimation: Theory, Practice and Visualization.New York: Wiley.

Sen, P. K. (1988) Asymptotics in finite populations. Handbook of Statistics, Vol. 6,Sampling (P. R. Krishnaiah and C. R. Rao, eds). Amsterdam: North-Holland.

Servy, E., Hachuel, L. and Wojdyla, D. (1998) Analisis de tablas de contingencia paramuestras de diseno complejo. Facultad De Ciencias Economicas Y Estadistica DeLa Universidad Nacional De Rosario, Argentina.

Shin, H. (1994) An approximate Rao-Scott modification factor in two-way tables withonly known marginal deffs. Proceedings of the American Statistical Association,Survey Research Methods Section, pp. 600�1.

Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. London:Chapman and Hall.

Simonoff, J. S. (1996) Smoothing Methods in Statistics. New York: Springer-Verlag.Singh, A. C. (1985) On optimal asymptotic tests for analysis of categorical data from

sample surveys. Methodology Branch Working Paper SSMD 86�002, StatisticsCanada.

Singh, A. C., Stukel, D. M. and Pfeffermann, D. (1998) Bayesian versus frequentistmeasures of error in small area estimation. Journal of the Royal Statistical Society,Series B, 60, 377�96.

Sitter, R. V. (1992) Bootstrap methods for survey data. Canadian Journal of Statistics,20, 135�54.

Skinner, C. (1989) Introduction to Part A. In Analysis of Complex Surveys (C. Skinner,D. Holt and T. M. F. Smith, eds), Ch. 2. Chichester: Wiley.

Skinner, C. J. (1994) Sample models and weights. Proceedings of the American StatisticalAssociation, Survey Research Methods Section, pp. 133�42.

Skinner, C. J. (1998) Logistic modelling of longitudinal survey data with measurementerror. Statistica Sinica, 8, 1045�58.

REFERENCES 357

Page 381: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Skinner, C. J. (2000) Dealing with measurement error in panel analysis. In ResearchingSocial and Economic Change: the Use of Household Panel Studies (D. Rose, ed.),pp. 113�25. London: Routledge.

Skinner, C. J. and Humphreys, K. (1999) Weibull regression for lifetimes measured witherror. Lifetime Data Analysis, 5, 23�37.

Skinner, C. J., Holt, D. and Smith, T. M. F. (eds) (1989) Analysis of Complex Surveys.Chichester: Wiley.

Smith, T. M.F (1976) The foundations of survey sampling: a review (with discussion),Journal of the Royal Statistical Society, Series A , 139, 183�204.

Smith, T. M. F. (1984) Sample surveys. Journal of the Royal Statistical Society, 147,208�21.

Smith, T. M. F. (1988) To weight or not to weight, that is the question. In BayesianStatistics 3 (J. M. Bernado, M. H. DeGroot and D. V. Lindley, eds), pp. 437�51.Oxford: Oxford University Press.

Smith, T. M. F. (1989) Introduction to Part B: aggregated analysis. In Analysis ofComplex Surveys (C. Skinner, D. Holt and T. M. F. Smith, eds). Chichester: Wiley.

Smith, T. M. F. (1994) Sample surveys 1975�1990; an age of reconciliation? Inter-national Statistical Review, 62, 5�34 (with discussion).

Smith, T. M. F. and Holt, D. (1989) Some inferential problems in the analysis of surveysover time. Proceedings of the 47th Session of the ISI, Vol. 1, pp. 405�23.

Solon, G. (1989) The value of panel data in economic research. In Panel Surveys(D. Kasprzyk et al., eds) , pp. 486�96. New York: Wiley.

Spiekerman, C. F . and Lin, D. Y. (1998) Marginal regression models for multivariatefailure time data. Journal of the American Statistical Association, 93, 1164�75.

Stata Corporation (2001) Stata Reference Manual Set. College Station, Texas: StataPress.

Statistical Sciences (1990) S-PLUS Reference Manual. Seattle: Statistical Sciences.Steel, D. (1999) Variances of estimates of covariance components from aggregate and

unit level data. School of Mathematics and Applied Statistics, University of Wol-longong, Preprint.

Steel, D. and Holt, D. (1996a) Analysing and adjusting aggregation effects: the eco-logical fallacy revisited. International Statistical Review, 64, 39�60.

Steel, D. and Holt, D. (1996b) Rules for random aggregation. Environment and Planning,A , 28, 957�78.

Steel, D., Holt, D. and Tranmer, M. (1996) Making unit-level inference from aggregatedata. Survey Methodology, 22, 3�15.

Steel, D., Tranmer, M. and Holt, D. (1999) Unravelling ecological analysis. School ofMathematics and Applied Statistics, University of Wollongong, Preprint.

Steel, D. G. (1985) Statistical analysis of populations with group structure. Ph. D. thesis,Department of Social Statistics, University of Southampton.

Stolzenberg, R. M. and Relles, D. A. (1990) Theory testing in a world of constrainedresearch design � the significance of Heckman�s censored sampling bias correctionfor nonexperimental research. Sociological Methods and Research, 18, 395�415.

Sugden, R. A. and Smith, T. M. F. (1984) Ignorable and informative designs in surveysampling inference. Biometrika, 71, 495�506.

Sverchkov, M. and Pfeffermann, D. (2000) Prediction of finite population totals underinformative sampling utilizing the sample distribution. Proceedings of the AmericanStatistical Association, Survey Research Methods Section, pp. 41�6.

Tanner, M. A. (1996) Tools for Statistical Inference: Methods for the Exploration ofPosterior Distributions and Likelihood Functions. 3rd Edn. New York: Springer-Verlag.

Tanner, M. A. and Wong, W. H. (1987) The calculation of posterior distributions bydata augmentation. Journal of the American Statistical Association, 82, 528�50.

Therneau, T. M. and Grambsch, P. M. (2000) Modeling Survival Data: Extending theCox Model. New York: Springer-Verlag.

358 REFERENCES

Page 382: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Therneau, T. M. and Hamilton, S. A. (1997) rhDNase as an example of recurrent eventanalysis. Statistics in Medicine, 16, 2029�47.

Thomas, D. R. (1989) Simultaneous confidence intervals for proportions under clustersampling. Survey Methodology, 15, 187�201.

Thomas, D. R. and Rao, J. N. K. (1987) Small-sample comparisons of level and powerfor simple goodness-of-fit statistics under cluster sampling. Journal of the AmericanStatistical Association, 82, 630�6.

Thomas, D. R., Singh, A. C. and Roberts, G. R. (1995) Tests of independence on two-way tables under cluster sampling: an evaluation. Working Paper WPS 95�04,School of Business, Carleton University, Ottawa.

Thomas, D. R., Singh, A. C. and Roberts, G. R. (1996) Tests of independence on two-way tables under cluster sampling: an evaluation. International Statistical Review,64, 295�311.

Thompson, M. E. (1997) Theory of Sample Surveys. London: Chapman and Hall.Trivellato, U. and Torelli, N. (1989) Analysis of labour force dynamics from rotating

panel survey data. Proceedings of the 47th Session of the ISI, Vol. 1, pp. 425�44.Trussell, J., Rodriguez, G. and Vaughan, B. (1992) Union dissolution in Sweden. In

Demographic Applications of Event History Analysis (J. Trussell, R. Hankinson andJ. Tilton, eds), pp. 38�60. Oxford: Clarendon Press.

Tukey, J. W. (1977) Exploratory Data Analysis. Reading, Massachusetts: Addison-Wesley.

US Department of Health and Human Services (1990) The Health Benefits of SmokingCessation. Public Health Service, Centers for Disease Control, Center for ChronicDisease Prevention and Health Promotion, Office on Smoking and Health. DHSSPublication No. (CDC) 90�8416.

Valliant, R., Dorfman, A. H. and Royall, R. M. (2000) Finite Population Sampling andInference: a Prediction Approach. New York: Wiley.

van den Berg, G. J. (1997) Association measures for durations in bivariate hazardmodels. Journal of Econometrics ( Annals of Econometrics) , 79, 221�45.

Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. London: Chapman and Hall.Weeks, M. (1995) Circumventing the curse of dimensionality in applied work using

computer intensive methods. Economic Journal, 105, 520�30.Weisberg, S. (1985) Applied Linear Regression. 2nd Edn. New York: Wiley.White, H. S. (1994) Estimation, Inference and Specification Analysis. Cambridge: Cam-

bridge University Press.Williams, R. L. (1995) Product-limit survival functions with correlated survival times.

Lifetime Data Analysis, 1, 17�86.Wolter, K. M. (1985) Introduction to Variance Estimation. New York: Springer-Verlag.Woodruff, R. S. (1952) Confidence intervals for medians and other position measures.

Journal of the American Statistical Association, 47, 635�46.Xue, X. and Brookmeyer, R. (1996) Bivariate frailty model for the analysis of multivari-

ate failure time. Lifetime Data Analysis, 2, 277�89.Yamaguchi, K. (1991) Event History Analysis. Newbury Park, California: Sage.Zeger, S. L. and Liang, K. (1986) Longitudinal data analysis for discrete and continuous

outcomes. Biometrics, 42, 121�30.

REFERENCES 359

Page 383: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 384: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

T. M. F. Smith: Publicationsup to 2002

JOURNAL PAPERS

1. R. P. Brooker and T. M. F. Smith. (1965), Business failures � the English insolv-ency statistics, Abacus, 1, 131�49.

2. T. M. F. Smith. (1966), Ratios of ratios and their applications, Journal of the RoyalStatistical Society, Series A , 129, 531�3.

3. T. M. F. Smith. (1966), Some comments on the index of retail prices, AppliedStatistics, 5, 128�35.

4. R. P. Brooker and T. M. F. Smith. (1967), Share transfer audits and the use ofstatistics, Secretaries Chronicle, 43, 144�7.

5. T. M. F. Smith. (1967), A comparison of some models for predicting time seriessubject to seasonal variation, The Statistician, 17, 301�5.

6. F . G. Foster and T. M. F. Smith. (1969), The computer as an aid in teachingstatistics, Applied Statistics, 18, 264�70.

7. A. J. Scott and T. M. F. Smith. (1969), A note on estimating secondary character-istics in multivariate surveys, Sankhya, Series A , 31, 497�8.

8. A. J. Scott and T. M. F. Smith. (1969), Estimation in multi- stage surveys, Journalof the American Statistical Association, 64, 830�40.

9. T. M. F. Smith. (1969), A note on ratio estimates in multi- stage sampling, Journalof the Royal Statistical Society, Series A , 132, 426�30.

10. A. J. Scott and T. M. F. Smith. (1970), A note on Moran�s approximation toStudent�s t, Biometrika, 57, 681�2.

11. A. J. Scott and T. M. F. Smith. (1971), Domains of study in stratified sampling,Journal of the American Statistical Association, 66, 834�6.

12. A. J. Scott and T. M. F. Smith. (1971), Interval estimates for linear combinationsof means, Applied Statistics, 20, 276�85.

13. J. A. John and T. M. F. Smith. (1972), Two factor experiments in non-orthogonaldesigns, Journal of the Royal Statistical Society, Series B, 34, 401�9.

14. S. C. Cotter, J. A. John and T. M. F. Smith. (1973), Multi-factor experiments in non-orthogonal designs, Journal of the Royal Statistical Society, Series B, 35, 361�7.

15. A. J. Scott and T. M. F. Smith. (1973), Survey design, symmetry and posteriordistributions, Journal of the Royal Statistical Society, Series B, 35, 57�60.

16. J. A. John and T. M. F. Smith. (1974), Sum of squares in non-full rank generallinear hypotheses, Journal of the Royal Statistical Society, Series B, 36, 107�8.

17. A. J. Scott and T. M. F. Smith. (1974), Analysis of repeated surveys using timeseries models, Journal of the American Statistical Association, 69, 674�8.

18. A. J. Scott and T. M. F. Smith. (1974), Linear superpopulation models in surveysampling, Sankhya, Series C, 36, 143�6.

19. A. J. Scott and T. M. F. Smith. (1975), Minimax designs for sample surveys,Biometrika, 62, 353�8.

Page 385: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

20. A. J. Scott and T. M. F. Smith. (1975), The efficient use of supplementary infor-mation in standard sampling procedures, Journal of the Royal Statistical Society,Series B, 37, 146�8.

21. D. Holt and T. M. F. Smith. (1976), The design of survey for planning purposes,The Australian Journal of Statistics, 18, 37�44.

22. T. M. F. Smith. (1976), The foundations of survey sampling: a review, Journal ofthe Royal Statistical Society, Series A , 139, 183�204 (with discussion) [Read beforethe Society, January 1976].

23. A. J. Scott, T. M. F. Smith and R. Jones. (1977), The application of time seriesmethods to the analysis of repeated surveys, International Statistical Review, 45,13�28.

24. T. M. F. Smith. (1978), Some statistical problems in accountancy, Bulletin of theInstitute of Mathematics and its Applications, 14, 215�19.

25. T. M. F. Smith. (1978), Statistics: the art of conjecture, The Statistician, 27, 65�86.26. D. Holt and T. M. F. Smith. (1979), Poststratification, Journal of the Royal

Statistical Society, Series A , 142, 33�46.27. D. Holt, T. M. F. Smith and T. J. Tomberlin. (1979), A model-based approach to

estimation for small subgroups of a population, Journal of the American StatisticalAssociation, 74, 405�10.

28. T. M. F. Smith. (1979), Statistical sampling in auditing: a statistician�s viewpoint,The Statistician, 28, 267�80.

29. D. Holt, T. M. F. Smith and P. D. Winter. (1980), Regression analysis of data fromcomplex surveys, Journal of the Royal Statistical Society, Series A , 143, 474�87.

30. B. Gomes da Costa, T. M. F. Smith and D. Whitley. (1981), German languageproficiency levels attained by language majors: a comparison of U. S. A. andEngland and Wales results, The Incorporated Linguist, 20, 65�7.

31. I. Diamond and T. M. F. Smith. (1982), Whither mathematics? Comments on thereport by Professor D. S. Jones. Bulletin of the Institute of Mathematics and itsApplications, 18, 189�92.

32. G. Hoinville and T. M. F. Smith. (1982), The Rayner Review of GovernmentStatistical Service, Journal of the Royal Statistical Society, Series A , 145, 195�207.

33. T. M. F. Smith. (1983), On the validity of inferences from non-random samples,Journal of the Royal Statistical Society, Series A , 146, 394�403.

34. T. M. F. Smith and R. W. Andrews. (1983), Pseudo-Bayesian and Bayesianapproach to auditing, The Statistician, 32, 124�6.

35. I. Diamond and T. M. F. Smith. (1984), Demand for higher education: commentson the paper by Professor P. G. Moore. Bulletin of the Institute of Mathematics andits Applications, 20, 124�5.

36. T. M. F. Smith. (1984), Sample surveys: present position and potential develop-ments: some personal views. Journal of the Royal Statistical Society, Series A , 147,208�21.

37. R. A. Sugden and T. M. F. Smith. (1984), Ignorable and informative designs insurvey sampling inference, Biometrika, 71, 495�506.

38. T. J. Murrells, T. M. F. Smith, J. C. Catford and D. Machin. (1985), The use oflogit models to investigate social and biological factors in infant mortality I:methodology. Statistics in Medicine, 4, 175�87.

39. T. J. Murrells, T. M. F. Smith, J. C. Catford and D. Machin. (1985), The use oflogit models to investigate social and biological factors in infant mortality II:stillbirths, Statistics in Medicine, 4, 189�200.

40. D. Pfeffermann and T. M. F. Smith. (1985), Regression models for groupedpopulations in cross-section surveys, International Statistical Review, 53, 37�59.

41. T. M. F. Smith. (1985), Projections of student numbers in higher education,Journal of the Royal Statistical Society, Series A , 148, 175�88.

42. E. A. Molina C. and T. M. F. Smith. (1986), The effect of sample design on thecomparison of associations, Biometrika, 73, 23�33.

362 T. M. F. SMITH: PUBLICATIONS UP TO 2002

Page 386: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

43. T. Murrells, T. M. F. Smith, D. Machin and J. Catford. (1986), The use of logitmodels to investigate social and biological factors in infant mortality III: neonatalmortality, Statistics in Medicine, 5, 139�53.

44. C. J. Skinner, T. M. F. Smith and D. J. Holmes. (1986), The effect of sample designon principal component analysis. Journal of the American Statistical Association,81, 789�98.

45. T. M. F. Smith. (1987), Influential observations in survey sampling, Journal ofApplied Statistics, 14, 143�52.

46. E. A. Molina Cuevas and T. M. F. Smith. (1988), The effect of sampling onoperative measures of association with a ratio structure, International StatisticalReview, 56, 235�42.

47. T. Murrells, T. M. F. Smith, D. Machin and J. Catford. (1988), The use of logitmodels to investigate social and biological factors in infant mortality IV: post-neonatal mortality, Statistics in Medicine, 7, 155�69.

48. T. M. F. Smith and R. A. Sugden. (1988), Sampling and assignment mechanisms inexperiments, surveys and observational studies, International Statistical Review, 56,165�80.

49. T. J. Murrells, T. M. F. Smith and D. Machin. (1990), The use of logit models toinvestigate social and biological factors in infant mortality V: a multilogit analysisof stillbirths, neonatal and post-neonatal mortality, Statistics in Medicine, 9,981�98.

50. T. M. F. Smith. (1991), Post-stratification, The Statistician, 40, 315�23.51. T. M. F. Smith and E. Njenga. (1992), Robust model-based methods for analytic

surveys, Survey Methodology, 18, 187�208.52. T. M. F. Smith. (1993), Populations and selection: limitations of statistics, Journal

of the Royal Statistical Society, Series A , 156, 145�66 [Presidential address to theRoyal Statistical Society].

53. T. M. F. Smith and P. G. Moore. (1993), The Royal Statistical Society: currentissues, future prospects, Journal of Official Statistics, 9, 245�53.

54. T. M. F. Smith. (1994), Sample surveys 1975�90; an age of reconciliation?,International Statistical Review, 62, 5�34 [First Morris Hansen Lecture with dis-cussion].

55. T. M. F. Smith. (1994), Taguchi methods and sample surveys, Total QualityManagement, 5, No. 5.

56. D. Bartholomew, P. Moore and T. M. F. Smith. (1995), The measurement ofunemployment in the UK, Journal of the Royal Statistical Society, Series A , 158,363�17.

57. Sujuan Gao and T. M. F. Smith. (1995), On the nonexistence of a global non-negative minimum bias invariant quadratic estimator of variance components,Statistics and Probability Letters, 25, 117�120.

58. T. M. F. Smith. (1995), The statistical profession and the Chartered Statistician(CStat), Journal of Official Statistics, 11, 117�20.

59. J. E. Andrew, P. Prescott and T. M. F. Smith. (1996), Testing for adverse reactionsusing prescription event monitoring, Statistics in Medicine, 15, 987�1002.

60. T. M. F. Smith. (1996), Public opinion polls: the UK general election, 1992, Journalof the Royal Statistical Society, Series A , 159, 535�45.

61. R. A. Sugden, T. M. F. Smith and R. Brown. (1996), Chao�s list sequential schemefor unequal probability sampling, Journal of Applied Statistics, 23, 413�21.

62. T. M. F. Smith. (1997), Social surveys and social science, The Canadian Journal ofStatistics, 25, 23�44.

63. R. A. Sugden and T. M. F. Smith. (1997), Edgeworth approximations to thedistribution of the sample mean under simple random sampling, Statistics andProbability Letters, 34, 293�9.

64. T. M. F. Smith and T. M. Brunsdon. (1998), Analysis of compositional time series,Journal of Official Statistics, 14, 237�54.

T. M. F. SMITH: PUBLICATIONS UP TO 2002 363

Page 387: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

65. Sujuan Gao and T. M. F. Smith. (1998), A constrained MINQU estimator ofcorrelated response variance from unbalanced data in complex surveys, StatisticaSinica, 8, 1175�88.

66. R. A. Sugden, T. M. F. Smith and R. P. Jones. (2000), Cochran�s rule for simplerandom sampling, Journal of the Royal Statistical Society, Series B, 62, 787�94.

67. V. Barnett, J. Haworth and T. M. F. Smith. (2001), A two-phase sampling schemewith applications to auditing or sed quis custodiet ipsos custodes?, Journal of theRoyal Statistical Society, Series A , 164, 407�22.

68. E. A. Molina, T. M. F. Smith and R. A. Sugden. (2001), Modelling overdispersionfor complex survey data. International Statistical Review, 69, 373�84.

69. D. B. N. Silva and T. M. F. Smith. (2001), Modelling compositional time seriesfrom repeated surveys, Survey Methodology, 27, 205�15.

70. T. M. F. Smith. (2001), Biometrika centenary: sample surveys, Biometrika, 88,167�94.

71. R. A. Sugden and T. M. F. Smith. (2002), Exact linear unbiased estimation insurvey sampling, Journal of Statistical Planning and Inference, 102, 25�38 (withdiscussion).

BOOKS

1. B. Gomes da Costa, T. M. F. Smith and D. Whiteley. (1975), German LanguageAttainment: a Sample Survey of Universities and Colleges in the U.K., Heidelberg:Julius Groos Verlag.

2. T. M. F. Smith. (1976), Statistical Sampling for Accountants, London: HaymarketPress, 255pp.

3. C. J. Skinner, D. Holt and T. M. F. Smith (eds). (1989), Analysis of ComplexSurveys, Chichester: Wiley, 309pp.

BOOK CONTRIBUTIONS

1. T. M. F. Smith. (1971), Appendix 0, Second Survey of Aircraft Noise Annoyancearound London (Heathrow) Airport, London: HMSO.

2. A. Bebbington and T. M. F. Smith. (1977), The effect of survey design on multi-variate analysis. In O�Muircheartaigh, C. A. and Payne, C. (eds) The Analysis ofSurvey Data, Vol. 2: Model Fitting, New York: Wiley, pp. 175�92.

3. T. M. F. Smith. (1978), Principles and problems in the analysis of repeated surveys.In N. K. Namboodiri (ed.) Survey Sampling and Measurement, New York: Aca-demic Press, Ch. 13, pp. 201�16.

4. T. M. F. Smith. (1981), Regression analysis for complex surveys. In D. Krewski,J. N. K. Rao and R. Platek (eds) Current Topics in Survey Sampling, New York:Academic Press, pp. 267�92.

5. T. M. F. Smith. (1987), Survey sampling. Unit 12, Course M345, Milton Keyres:Open University Press, 45pp.

6. T. M. F. Smith. (1988), To weight or not to weight, that is the question. In J. M.Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith (eds) BayesianStatistics 3, Oxford: Oxford University Press, pp. 437�51.

7. G. Nathan and T. M. F. Smith. (1989), The effect of selection on regressionanalysis. In C. J. Skinner, D. Holt and T. M. F. Smith (eds) Analysis of ComplexSurveys, Chichester, Wiley, Ch. 7, pp. 149�64.

8. T. M. F. Smith. (1989), Introduction to Part B: aggregated analysis. In C. J.Skinner, D. Holt and T. M. F. Smith (eds) Analysis of Complex Surveys, Chiches-ter: Wiley, Ch. 6 pp. 135�148.

364 T. M. F. SMITH: PUBLICATIONS UP TO 2002

Page 388: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

9. T. M. F. Smith and D. J. Holmes. (1989), Multivariate analysis. In C. J. Skinner,D. Holt and T. M. F. Smith (eds) Analysis of Complex Surveys, Chichester: Wiley,Ch. 8, pp. 165�90.

PUBLISHED DISCUSSION AND COMMENTS

1. T. M. F. Smith. (1969), Discussion of �A theory of consumer behaviour derivedfrom repeat paired preference testing� by G. Horsnell, Journal of the Royal Statis-tical Society, Series A , 132, 186�7.

2. T. M. F. Smith. (1976), Comments on �Some results on generalized differenceestimation and generalized regression estimation for finite populations� by C. M.Cassel, C.-E. Sarndal and J. H. Wretman, Biometrika, 63, 620.

3. T. M. F. Smith. (1979), Discussion of �Public Opinion Polls� by A. Stuart, N. L.Webb and D. Butler, Journal of the Royal Statistical Society, Series A , 142, 460�1.

4. T. M. F. Smith. (1981), Comment on �An empirical study of the ratio estimator andestimators of its variance� by R. M. Royall and W. G. Cumberland, Journal of theAmerican Statistical Association, 76, 83.

5. T. M. F. Smith. (1983), Comment on �An Evaluation of Model-Dependent andProbability-Sampling Inference in Sample Surveys� by M. H. Hansen, W. G.Madow, and B. J. Tepping, Journal of the American Statistical Association, 78,801�2.

6. T. M. F. Smith. (1983), Invited discussion on �Six approaches to enumerative surveysampling�by K. R. W. Brewer and C.-E. Sarndal, in W. G. Madow and I. Olkin (eds)Incomplete Data in Sample Surveys, Vol. 3, New York: Academic Press, pp. 385�7.

7. T. M. F. Smith. (1990), Comment on �History and development of the theoreticalfoundations of surveys� by J. N. K. Rao and D. R. Bellhouse, Survey Methodology,16, 26�9.

8. T. M. F. Smith. (1990), Discussion of �Public confidence in the integrity andvalidity of official statistics� by J. Hibbert, Journal of the Royal Statistical Society,Series A , 153, 137.

9. T. M. F. Smith. (1991), Discussion of A Hasted et al �Statistical analysis of publiclending right loans�, Journal of the Royal Statistical Society, Series A , 154, 217�19.

10. T. M. F. Smith. (1992), Discussion of �A National Statistical Commission� by P. G.Moore, Journal of the Royal Statistical Society, Series A , 155, 24.

11. T. M. F. Smith. (1992), Discussion of �Monitoring the health of unborn popula-tions� by C. Thunhurst and A. MacFarlane. Journal of the Royal StatisticalSociety, Series A , 155, 347.

12. T. M. F. Smith. (1992), The Central Statistical Office and agency status, Journal ofthe Royal Statistical Society, Series A , 155, 181�4.

CONFERENCE PROCEEDINGS

1. T. M. F. Smith and M. H. Schueth. (1983), Validation of survey results in marketresearch, Proceedings of Business and Economics Section of the American StatisticalAssociation, 757�63.

2. T. M. F. Smith and R. A. Sugden. (1985), Inference and ignorability of selectionfor experiments and surveys, Bulletin of the International Statistical Institute, 45thSession, Vol. LI, Book 2, 10.2, 1�12.

3. T. M. F. Smith and T. M. Brunsdon. (1987), The time series analysis of compos-itional data, Bulletin of the International Statistical Institute, 46th Session, Vol. LII,417�18.

T. M. F. SMITH: PUBLICATIONS UP TO 2002 365

Page 389: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

4. T. M. F. Smith. (1989), Invited discussion of �History, development, and emergingmethods in survey sampling�, Proceedings of the American Statistical AssociationSesquicentennial invited paper sessions, 429�32.

5. T. M. F. Smith and R. W. Andrews. (1989), A Bayesian analysis of the auditprocess. Bulletin of the International Statistical Insitute, 47th Session, Vol. LIII,Book 1, 66�7.

6. T. M. F. Smith and R. W. Andrews. (1989), Statistical analysis of auditing,Proceedings of the American Statistical Association.

7. T. M. F. Smith and T. M. Brunsdon. (1989), Analysis of compositional time series,Proceedings of the American Statistical Association.

8. T. M. F. Smith and D. Holt (1989), Some inferential problems in the analysis ofsurveys over time, Bulletin of the International Statistical Institute, 47th Session,Vol. LIII, Book 2, 405�24.

9. T. M. F. Smith and E. Njenga. (1991), Robust model-based methods for samplesurveys. Proceedings of a Symposium in honour of V. P. Godambe, University ofWaterloo.

10. T. M. F. Smith. (1993), The Chartered Statistician (CStat), Bulletin of the Inter-national Statistical Institute, 49th Session, Vol. LV, Book 1, 67�78.

11. T. M. F. Smith. (1995), Problem of resource allocation, Proceedings of StatisticsCanada Symposium 95. From Data to Information � Methods and Systems.

12. T. M. F. Smith. (1995), Public opinion polls: The UK General Election 1992,Bulletin of the International Statistical Institute, 50th Session, Vol. LVI, Book 2,112�13.

13. T. M. F. Smith. (1995), Social surveys and social science, Proceedings of the SurveyMethods Section, Statistical Society of Canada Annual Meeting, July 1995.

14. T. M. F. Smith. (1996), Disaggregation and inference from sample surveys, in 100Anni di Indagini Campionare, Centro d�Informazione e Stampa Universitaria,Rome, 29�34.

15. N. Davies and T. M. F. Smith. (1999), A strategy for continuing professionaleducation in statistics, Proceedings of the 5th International Conference on Teachingof Statistics, Vol. 1, 379�84.

16. T. M. F. Smith. (1999), Defining parameters of interest in longitudinal studies andsome implications for design, Bulletin of the International Statistical Institute, 52ndSession, Vol. LVIII, Book 2, 307�10.

17. T. M. F. Smith. (1999), Recent developments in sample survey theory and theirimpact on official statistics, Bulletin of the International Statistical Institute, 52ndSession, Vol. LVIII, Book 1, 7�10 [President�s invited lecture].

OTHER PUBLICATIONS

1. T. M. F. Smith. (1966), The variances of the 1966 sample census, Report to the G.R. O., October 1966, 30pp.

2. T. M. F. Smith. (1977), Statistics: a universal discipline, Inaugural Lecture, Uni-versity of Southampton.

3. T. M. F. Smith. (1991), The measurement of value added in higher education,Report for the Committee of Vice-Chancellors and Principals.

4. T. M. F. Smith. (1995), Chartered status � a sister society�s experience. How theRoyal Statistical Society fared, OR Newsletter, February, 11�16.

5. N. Davies and T. M. F. Smith. (1999), Continuing professional development: viewfrom the statistics profession. Innovation, 87�90.

366 T. M. F. SMITH: PUBLICATIONS UP TO 2002

Page 390: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Author Index

Aalen, O. 228Abowd, J. M. 206Achen, C. H. 323Agresti, A. xvi, 78, 79, 81Alexander, C. 89Allison, P. D. 202Altonji, J. G. 209, 216, 217, 219Amato, D. A. 225, 235, 236Amemiya, T. 305Andersen, P. K. 222, 223, 225,

226, 236, 237, 239Anderson, R. L. 106Anderson, T. W. 311Andrews, M. 247Assakul, K. 107

Baltagi, B. H. 201, 202, 205, 206Basu, D. 49Bellhouse, D. R. 126, 133, 135,

136, 137, 143, 147Berman, M. 236Berthoud, R. 213Bickel, P. J. 182, 185, 186Binder, D. A. 22, 23, 41, 49, 54,

56, 105, 180, 181, 186, 236, 312Birnbaum, A. 59, 61Bishop, Y. M. M. 76Bjornstad, J. F . 62, 70Blau, D. M. 268Blossfeld, H. P. 222, 225Borgan, O. 222, 223, 225, 226,

236, 237, 239Boudreau, C. 236Box, G. E. P. 296, 305Boyd, L. H. 324Bradley, S. 247Breckling, J. U. 14, 151Breidt, F . J. 312, 320Breslow, N. E. 113Breunig, R. V. 174Brewer, K. R. W. 50Brick, J. M. 315, 319Brier, S. E. 97

Brookmeyer, R. L. 235Brown, P. J. 278Browne, M. W. 216Bryk, A. S. 328Bull, S. 105Buskirk, T. 133

Card, D. 206Carlin, J. B. 52, 54, 290Carroll, R. J. 229, 242Cassel, C.-M. 296Chamberlain, G. 207, 208Chambers, R. L. 8, 14, 47, 77, 151, 163,

278, 315Chesher, A. 133, 151Citro, C. 200, 230Clarke, L. 341Clayton, D. 235Cleave, N. 278Cochran, W. G. 2, 7, 76, 163, 307Cohen, M. P. 325Cook, R. J. 225, 238Cosslett, S. 114Cox, D. R. xvi, 203, 221, 224, 225, 233,

235, 240, 305Cumberland, W. G. 66, 69

David, M. H. 306Davies, H. 341De Leeuw, J. 327Deakin, B. M. 268Decady, Y. J. 108Deville, J.-C. 6Diamond, I. D. 241Diggle, P. J. 6, 121, 201, 202, 206, 237, 305Dolton, P. J. 245, 270Dorfman, A. H. 2, 8, 14, 47, 151, 163Dumouchel, W. H. 50Duncan, G. J. 50, 175, 231, 242Dunstan, R. 315

Edwards, A. W. F. 60Efron, B. 141, 143

Page 391: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Elliott, M. R. 297Ericson, W. A. 26, 49Eubanc, R. L. 133Ewings, P. D. 89Ezzati, T. 301Ezzati-Rice, T. M. 301

Fahrmeir, L. 239Fan, J. 133, 162Fay, R. E. 86, 94, 95, 299Fears, T. I. 110, 113Feder, M. 207Fellegi, I. P. 97Fienberg, S. E. 76, 242Fisher, R. 103Folsom, R. 233Frankel, M. R. 22, 117Fuller, W. A. 207, 208, 308, 312, 313, 320,

322

Gail, M. H. 110, 113Gelman, A. 52, 54, 290Gershuny, J. 213Ghosh, M. 26, 49, 296Gijbels, I. 133Gill, R. D. 222, 223, 225, 226, 236, 237,

239Godambe, V. P. 22, 181, 242Goebel, J. J. 320Goldstein, H. 201, 206, 207, 210, 211, 212,

213, 218, 324, 328, 331Goodman, L. 246, 252Gourieroux, C. 255Grambsch, P. M. 229, 237, 238Graubard, B. I. 42, 82, 83, 84, 86, 110,

113, 133, 146, 236Gray, G. B. 317Greenland, S. 265, 268Greenlees, W. S. 305Gritz, M. 245, 252Groves, R. M. 323Guo, G. 228

Hachuel, L. 97, 98, 99, 100Hacking, I. 59, 61Hamerle, A. 222, 225Hamilton, S. A. 237Hansen, M. H. 37, 326Hardle, W. 126, 133, 134, 152, 158Hartley, H. O. 35, 133, 163Hauck, W. W. 121Heagerty, P. J. 6, 201, 202Healy, M. J. R. 206Heckman, J. J. 252, 305Heeringa, S. G. 306

Heitjan, D. F . 306Hidiroglou, M. A. 307Hinkins, S. 120Hoem, B. 227, 236Hoem, J. M. 227, 229, 230, 231, 236, 242Holland, P. W. 76, 266Holmes, D. J. 49, 207, 210, 211, 212, 213,

218Holt, D. xv, xvi, 2, 5, 6, 8, 9, 14, 18, 22,

23, 49, 51, 85, 86, 87, 89, 91, 104, 151,175, 181, 207, 208, 209, 210, 213, 214,229, 231, 241, 242, 292, 323, 325, 327,329, 330, 338, 339, 340, 341, 342

Holubkov, R. 113Horvitz, D. G. 55Hougaard, P. 235, 237Hsiao, C. 201, 202, 205, 206, 212Hu, F . C. 265Humphreys, K. 242Hurwitz, W. N. 326Husebye, E. 228Huster, W. J. 235

Ibrahim, J. 229Isham, V. 225Iversen, G. R. 324

Jefferys, H. 60Joe, H. 225, 235Johnson, G. E. 268Johnson, W. 301Jones, I. 268Jones, M. C. 133, 137, 143, 147, 162Joshi, H. 341

Kalbfleisch, J. D. 69, 113, 121, 221, 223,225, 229, 231, 232, 236, 239

Kalsbeek, W. D. xviKalton, G. 175, 200, 230, 231, 242, 315,

319Kasprzyk, D. 175, 231, 242, 315Kass, R. E. 60Keiding, N. 222, 223, 225, 226, 236, 237,

239Kenward, M. G. 305Khare, M. 301Kim, J. K. 322King, G. 278, 323Kish, L. 22, 117, 319Klaassen, C. A. J. 182, 185, 186Klein, J. P. 236, 240Koehler, K. J. 97Konijn, H. S. 50Korn, E. L. 42, 82, 83, 84, 86, 133, 146,

236

368 AUTHOR INDEX

Page 392: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Kott, P. S. 312Kreft, I. 327Krieger, A. M. 20, 129, 152, 153, 171, 175,

176, 177, 182, 187, 189, 193Kumar, S. 81, 101, 102, 103, 104

Lancaster, T. 249LaVange, L. 233Lawless, J. F . xvi, 113, 200, 203, 221, 225,

227, 229, 231, 232, 235, 236, 237, 238,239, 240

Layard, R. 268Lazarsfeld, P. F . 324Lazzeroni, L. C. 296Lee, E. W. 225, 235, 236Lehtonen, R. 79, 86Lepkowski, J. M. 209, 210Lessler, J. T. xviLiang, K. -Y 6, 47, 107, 121, 201, 202,

206, 237Lillard, L. 206, 305Lin, D. Y. 225, 236Lindeboom, M. 252Lindsey, J. K. 239Lipsitz, S. R. 229Little, R. J. A. 7, 49, 50, 51, 56,

175, 229, 242, 280, 290, 292,296, 297, 299, 301, 302, 304, 305,306, 308, 315

Lohr, S. L. 86Longford, N. 326, 328Loughin, T. M. 108

Madow, W. G. 37, 326Main, B. G. M. 245Makepeace, G. H. 245, 270Mayer, K. U. 222, 225McCullagh, P. xvi, 47, 176McDonald, J. W. 229, 241, 242McVey, A. 320Mealli, F . 245, 248, 254, 255Meeden, G. 26, 49, 296Mellor, R. W. 50Menzel, H. J. 324Meyer, B. 246, 253Midthune, D. 236Moeschberger, M. L. 236, 240Molina, E. A. 31, 35Monfort, A. 255Morel, J. G. 98, 106Mote, V. L. 106

Nadeau, C. 225, 237, 238Narengranathan, W. 253Nascimento Silva, P. L. D. 315

Nathan, G. 207Nelder, J. A. xvi, 47, 176Neuhaus, J. M. 121Neyman, J. 307Ng, E. 225, 238Nguyen, H. H. 89Nusser, S. M. 320

Oakes, D. xvi, 203, 221, 225, 233Obuchowskin, N. A. 107Oh, F . L. 120, 293Ontario Ministry of Health 138Østbye, T. 140

Pahkinen, E. J. 79, 86Payne, C. D. 278Pearl, J. 268Pederson, L. L. 105, 140Pfeffermann, D. 20, 42, 46, 49, 50, 83, 129,

151, 152, 153, 171, 172, 175, 176, 177,178, 181, 182, 183, 184, 187, 189, 191,193, 195, 207, 210, 211, 212, 213, 218, 296

Platek, R. 317Pomerleau, P. 140Potter, F . 295Pratten, C. F . 268Prentice. R. L. 112, 221, 223, 225, 232,

236, 239Proctor, C. H. 107Pudney, S. E. 245, 246, 248, 249, 254, 255Pyke, R. 112

Raftery, A. E. 60Raghunathan, T. E. 306Ramos, X. 213Rao, J. N. K. 79, 81, 86, 87, 88, 90, 91, 92,

97, 99, 101, 102, 103, 104, 105, 106, 107,120, 133, 163, 178, 299, 307, 308, 312,315, 317

Rasbash, J. 206, 207, 210, 211, 212, 213,218

Raudenbush, S. W. 328Reece, J. S. 305Relles, D. A. 305Rice, J. A. 143Ridder, G. 245, 251Riley, M. W. 324, 326Rinott, Y. 129, 152, 153, 171, 175, 177,

182, 187, 189, 193Ritov, Y. 182, 185, 186Roberts, G. R. 81, 86, 97, 98, 99, 100, 101,

102, 103, 104Robins, J. M. 179, 265, 268, 306Robins, P. K. 268Rodriguez, G. 227, 236, 238

AUTHOR INDEX 369

Page 393: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Ross, S. M. 225Rotnitzky, A. 179, 306Royall, R. M. 2, 47, 60, 61, 64, 66, 69Rubin, D. B. 7, 27, 30, 42, 49, 52, 53, 54,

175, 229, 280, 282, 290, 296, 297, 299,301, 302, 304, 306, 308, 315

Ruppert, D. 229, 242

Samuhel, M. E. 306Sarndal, C.-E. xvi, 2, 6, 7, 13, 147, 296,

308, 312SAS 57, 306Schafer, J. L. 301, 303Scharfstein, D. O. 306Schenker, N. 297Scherer, P. N. 108Scheuren, F . 120, 293Scott, A. J. 30, 42, 49, 54, 86, 87, 88, 89,

91, 92, 105, 106, 107, 110, 112, 113, 117,120, 121

Scott, D. W. 133, 137Segal, L. M. 209, 216, 217, 219Self, S. G. 235Sen, P. K. 31Servy, E. 97, 98, 99, 100Shao, J. 312, 317Shelley, M. A. 245Shin, H. 89Shively, W. P. 323Sielken, R. L., Jr 35Silverman, B. W. 126, 133, 137, 141, 158Simonoff, J. S. 133Singer, B. 252Singh, A. C. 86, 97, 98, 99, 100, 296Singh, M. P. 175, 231, 242Sitter, R. R. 308, 312Sitter, R. V. 187Skinner, C. J. xv, xvi, 2, 5, 6, 8, 9, 14, 18,

22, 23, 85, 86, 87, 89, 91, 104, 105, 106,107, 151, 152, 175, 181, 200, 207, 208,209, 210, 211, 212, 213, 214, 218, 229,231, 241, 242, 315, 325, 327

Smith, J. P. 305Smith, T. M. F. xv, xvi, 2, 5, 6, 8, 9, 14,

18, 22, 23, 30, 31, 35, 42, 49, 50, 51, 85,86, 87, 89, 91, 104, 151, 152, 170, 175,181, 207, 208, 209, 210, 213, 214, 231,242, 292, 325, 327, 339

Solon, G. 205Speechley, K. N. 140Speechley, M. 140Spiekerman, C. F . 236Sprott, D. A. 69Stafford, J. E. 126, 133, 135, 136, 137,

143, 147

Stata Corporation 9Statistical Sciences 173Steel, D. G. 278, 287, 323, 328, 329, 330,

331, 338, 339, 340, 341, 342Stefanski, L. A. 229, 242Stern, H. S. 52, 54, 290Stewart, M. B. 253Stolzenberg, R. M. 305Stukel, D. M. 296Sugden, R. A. 30, 31, 35, 42, 175Sverchkov, M. 152, 171, 172, 175, 176,

178, 181, 182, 183, 184, 191, 195Swensson, B. xvi, 2, 7, 13, 147, 308, 312

Tam, S. M. 14, 151Tanner, m. A. 51, 299Tanur, J. M. 242Tepping, B. J. 37Therneau, T. M. 229, 236, 237, 238Thomas, D. R. 79, 86, 90, 91, 92, 93, 97,

98, 99, 100, 106, 107, 108Thomas, J. M. 248, 254Thompson, D. J. 55Thompson, M. E. 22, 181, 231Tibshirani, R. J. 141, 143Torelli, N. 242Tranmer, M. 330, 338, 340, 341, 342Treble, J. G. 245, 270Triest, R. K. 306Trivellato, U. 242Trussell, J. 227, 236, 238Tukey, J. W. 163Turner, T. R. 236Tutz, G. 239

US Deparment of Health and HumanServices 138, 140

Valliant, R. 2van den Berg, G. J. 252Vaughan, B. 227, 236, 238

Wand, M. P. 133, 147Wang, Y.-X. 304, 306Wang. S. 8, 14Weeks, M. 246Wehrly, T. E. 47, 163Wei, L. J. 225, 235, 236Weisberg, S. xviWelch, F . 305Wellner, J. A. 182, 185, 186Welsh, A. H. 14, 151White, H. S. 242Wild, C. J. 110, 112, 113, 117, 121,

232

370 AUTHOR INDEX

Page 394: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Williams, R. L. 233, 234Willis, R. 206Wilson, J. R. 97Winter, P. D. 49, 210Wojdyla, D. 97, 98, 99,

100Wolter, K. M. 35, 44, 79, 208,

234Wong, W. H. 299Woodruff, R. S. 136

Wretman, J. H. xvi, 2, 7, 13, 147,296, 312

Xue, X. 235

Yamaguchi, K. 202

Zeger, S. L. 6, 47, 107, 121, 201, 202, 206,237

Zieschang, K. D. 305

AUTHOR INDEX 371

Page 395: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner
Page 396: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Subject Index

Aggregated data 278, 286�287, 323�343Aggregated analysis 5, 6, 9, 325Aggregation bias 287, 339�343Analytic surveys: see SurveysANOVA-type estimates 328Atomistic fallacy 323Attrition: see NonresponseAutoregressive model 206�207

Bayesian inference 14, 49�57, 277�278,289�306

Best linear unbiased estimator 30, 34Bias

adjustment 141calibration 163

Binomial distribution 80, 101Bonferroni procedure: see TestingBootstrap method 126, 141�143, 187, 192Burr distribution 253�254, 258

Case control study 82, 83, 109�121, 155Categorical data: see Part BCensoring 226�229, 233Census parameter 3, 23, 82, 84, 104, 180Classification error: see Measurement

ErrorClustered data 107, 234�235, 242Competing risks 239, 252, 258Complex survey data 1Confidence intervals 15

simultaneous 93Contextual analysis 323Covariance structure model 207�210Cross-classified data 76�81, 86�96Current status data 241

Descriptive surveys: see SurveysDesign adjustment 158�163Design-based inference: see InferenceDesign effect 87, 88, 90, 107Design variable 4, 14, 52, 154�156,

289Diagnostics 103, 242

Disaggregated analysis 5, 6, 9, 325Domain estimation 76, 80, 101�104,

314�315Duration analysis: see Survival analysisDuration dependence 257�259

Ecological inference 330, 338�343bias 323fallacy 278, 323, 330regression 330, 339

Effective sample size 107, 116Efficiency 83, 112, 115�116EM algorithm 328Estimating equations 40, 112,

128, 162�163, 179�182, 233�235,242�243

Estimating function 31, 43�44Event history analysis 202�204, 221�274Exploratory data analysis 151

Finite population 2, 3mean 30

Fisher fast scoring algorithm 328Fixed effects models 201, 292�293

Generalised linear model 128, 175�195Generalised least squares 208Gibbs sampler 299, 303Goodness-of-fit test: see TestingGroup dependence model 287Group level variable 326

Hajek estimator 22Hazard function 203, 223, 224, 232, 237Heterogeneity 237, 251�252, 256,

263�265Hierarchical models: see Multilevel

modelsHierarchical population 286Histogram estimate 134, 138�141Horvitz�Thompson estimator 22, 181,

283, 312Hypothesis testing: see Testing

Page 397: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Ignorable 1, 7, 8 (see also Sampling)Independence 78, 97�98Inference (see also Likelihood)

design-based 6, 13, 29�42, 49, 56model-assisted 7, 13, 278, 296model-based 6, 13, 30�32, 42�47

Informationfunction 15�17, 19, 283�284matrix 105, 185observed 16

Intensity function: see Hazard functionIntracluster correlation 110Imputation 277, 278, 281�282, 284�285,

297, 315�319cells 281, 315�318controlled 282, 285, 319donor 316fractional 319improper 299multiple 289, 297�303random 317regression hot deck 282, 284�285variance 282, 285, 319

Integrated mean squared error 137Iterated generalized least squares 328

Jackknife: see Variance estimation andTesting

Kaplan�Meier estimation 233, 242�243Kernel smoothing 126�127, 135�149, 162

Likelihood inference 7, 13, 51, 59�61face value 17, 18, 21, 182full information 13�19, 21function 15, 27�28, 61, 63�71, 290maximum 78, 79, 81, 87, 328partial 226, 237profile 28, 68pseudo 14, 22�23, 79, 81�83, 87, 101,

104, 112, 126, 181, 210�211, 231,283

sample 13, 20�21, 127�130, 176,179�180, 185

semi-parametric maximumlikelihood 112

simulated 255Linear estimator 32�37, 313Linear parameter 24, 31, 32Linearization variance estimation: see

Variance estimationLocal regression: see Nonparametric

methodsLog�linear model 77�79, 86�96Longitudinal data: see Part D and 308

Mantel�Haenszel test: see TestingMarginal density 15Marginal parameter 327, 330Marginal modelling 6, 201, 230Markov chain Monte Carlo 26Markov model 224, 225, 239Measurement error

misclassification 106�107recall and longitudinal 200, 229, 241,

242Method of moments estimation 331Misclassification: see Measurement errorMissing data 277�282, 289�306 (see also

Nonresponse)at random 280, 290ignorable 290nonignorable 303�306

Misspecification 45, 51, 83, 84, 117�120,231, 300

Misspecification effect 9Mixed effects models: see Multilevel

modelsModel-assisted inference: see InferenceModel-based inference: see InferenceModel checking: see Diagnostics and

Model selectionModel selection 255�256Mover�stayer model 252Multilevel models: 121, 201�202,

205�219, 237, 249�256, 281,286�287, 294�297, 324�326, 328�334,338�343

Multinomial distribution 76�79, 87, 291(see also Regression)Dirichlet 97

Multiple response tables 108

Negative binomial model 67�68Nelson�Aalen estimator 237Non-linear statistic 39�42Nonparametric methods

density estimation 126, 134�145local linear regression 127, 162local polynomial regression 133,

146�149regression estimation 126, 151�174scatterplot smoothing 151survival analysis 232�234

Nonresponse 200, 209�213, 229, 242, 246,260�261, 277�282, 289�306 (see alsoMissing Data and Imputation)indicator 14, 280�281, 289�290informative 15item 277�278, 284, 289, 297�303noninformative 15, 280

374 SUBJECT INDEX

Page 398: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

unit 278, 289, 291�297weighting 278, 281, 289, 291�297, 301

Nuisance parameter 51

Ordinary least squares 209Outliers 51Overdispersion 107, 237

Pattern-mixture model 304, 306Pearson adjustment 18, 339Plug-in estimator 128, 158�161Poisson model 67, 107, 225Polytomous data 104Population regression 125, 151�152,

155�158, 325Post-stratification104, 105Prior information 51Prospective study 226

Quantile estimation 136, 315, 318Quasi-score test: see TestingQuota sampling 57

Random effects models: see Multilevelmodels

Rao�Scott adjustment: see TestingRatio estimation 37�38Recurrent events 236�238Regression 1, 16

logistic regression 75, 80,81, 84, 101�106, 109�121, 183, 202

longitudinal, 200�202multinomial regression models 239, 252survival models 224�225

Regression estimator 283, 295, 308�318Renewal process 225Replication methods 94Residual 91, 93Response homogeneity class: see

Imputation cellsRetrospective measurement 229, 232Robustness 31, 117�120, 231, 237, 238,

242, 296

Sample density 128, 155�156,176

Sample distribution 20, 127�130, 175�179Sample inclusion indicator 3, 14, 52, 152,

289Sampling 1, 4

array 160cluster 27, 97�98, 109, 325cut-off 19, 21ignorable 1, 7, 8, 30, 53, 155, 157, 183,

232, 290

informative 4, 15, 19, 125, 130, 155,171�173, 175�195

inverse 120�121multi-stage 286, 325nonignorable 233noninformative4,5,8, 15, 61, 158, 171,

281Poisson 179, 186probability proportional to size 20, 182rotation 200simple random 27, 54size-biased 20stratified 27,55�56, 109�121, 159,

293�294systematic 56, 69three-phase 310�311two-phase 277, 282�285, 307�310,

312�318two-stage 27, 56�57, 326

Score function (including scoreequations and pseudo-score) 15�17, 19,22�23, 40�42, 82, 112, 231, 234

Score test: see TestingSecondary analysis 14, 153Selection at random 53, 280, 290Selection mechanism 52�57Selection model 304Semi-parametric methods 110, 178, 231,

235�237, 240�241, 252�253Sensitivity analysis 306Simulation 259�263, 265�270Simultaneous confidence intervals: see

Confidence intervalsSmoothed estimate 102Superpopulation model 3, 30, 50, 230, 327Survey

analytic 1, 2, 6, 230�232data 14, 52, 153�154descriptive 2, 6, 13, 53weights: see Weighting

Survival analysis 203, 221, 224�226,232�236

Synthetic estimator 314, 317

Tabular data 76�81Testing 79, 82

Bonferroni procedure 92�93, 97�100,106

goodness-of-fit 82, 87, 90, 93, 213�215jackknife 88, 94�101Hotelling 131, 183�185likelihood ratio 86�88, 102Mantel�Haenszel 107nested hypotheses 88, 90, 105, 106Pearson 86�88, 101�102

SUBJECT INDEX 375

Page 399: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

Testing (contd.)Rao�Scott adjustment 79, 88�91,

97�104, 106, 209, 214score and quasi-score 105Wald 79, 91�93, 97�101, 105, 131,

171�173Total expectation 24Total variance 24, 31, 34Transformations 104Transition models: see Event history

analysisTruncation 227Twicing 163

Variance components models: seeMultilevel models

Variance Estimationbootstrap 187jackknife 105�6, 171Kaplan�Meier estimation 242�243likelihood-based 233�235, 185linearization 102, 105�6, 114, 213, 234,

235random groups 234sandwich 23, 127, 186

Wald test: see TestingWeibull distribution 240, 254, 258Weighting 6, 77, 82, 83, 104, 113�120,

125, 130�131, 153, 170�173, 207�212,230�232

376 SUBJECT INDEX

Page 400: Analysis of Survey Data (Wiley Series in Survey Methodology) · PDF fileEmail (for orders and ... List of Contributors xviii Chapter 1 Introduction 1 R. L. Chambers and C. J. Skinner

WILEY SERIES IN SURVEY METHODOLOGYEstablished in Part by Walter A. Shew har t and Samuel S. Wilk s

Editors: Robert M . Groves, Graham Kalton, J. N. K. Rao, Norbert Schwarz, ChristopherSkinner

Wiley Series in Survey Methodology covers topics of current research and practicalinterests in survey methodology and sampling. While the emphasis is on application,theoretical discussion is encouraged when it supports a broader understanding of thesubject matter.

The authors are leading academics and researchers in survey methodology andsampling. The readership includes professionals in, and students of, the fields of appliedstatistics, biostatistics, public policy, and government and corporate enterprises.

BIEMER, GROVES, LYBERG, MATHIOWETZ, and SUDMAN MeasurementErrors in Surveys

COCHRAN Sampling Techniques, Third EditionCOUPER, BAKER, BETHLEHEM, CLARK, MARTIN, NICHOLLS, and O�REILLY

(editors) Computer Assisted Survey Information CollectionCOX, BINDER, CHINNAPPA, CHRISTIANSON, COLLEDGE, and KOTT (editors)

Business Survey Methods*DEMING Sample Design in Business ResearchDILLMAN Mail and Telephone Surveys: The Total Design Method, Second EditionDILLMAN Mail and Internet Surveys: The Tailored Design MethodGROVES and COUPER Nonresponse in Household Interview SurveysGROVES Survey Errors and Survey CostsGROVES, DILLMAN, ELTINGE, and LITTLE Survey NonresponseGROVES, BIEMER, LYBERG, MASSEY, NICHOLLS, and WAKSBERG Tele-

phone Survey Methodology*HANSEN, HURWITZ, and MADOW Sample Survey Methods and Theory, Volume

1: Methods and Applications*HANSEN, HURWITZ, and MADOW Sample Survey Methods and Theory, Volume

II: TheoryKISH Statistical Design for Research*KISH Survey SamplingKORN and GRAUBARD Analysis of Health SurveysLESSLER and KALSBEEK Nonsampling Error in SurveysLEVY and LEMESHOW Sampling of Populations: Methods and Applications, Third

EditionLYBERG, BIEMER, COLLINS, de LEEUW, DIPPO, SCHWARZ, TREWIN

(editors) Survey Measurement and Process QualityMAYNARD, HOUTKOOP-STEENSTRA, SCHAEFFER, VAN DER ZOUWEN

Standardization and Tacit Knowledge: Interaction and Practice in the SurveyInterview

SIRKEN, HERRMANN, SCHECHTER, SCHWARZ, TANUR, and TOURAN-GEAU (editors) Cognition and Survey Research

VALLIANT, DORFMAN, and ROYALL Finite Population Sampling and Inference:A Prediction Approach

CHAMBERS and SKINNER (editors). Analysis of Survey Data

*Now available in a lower priced paperback edition in the Wiley Classics Library.