Statistical Inference in Science SPROTT

262

Transcript of Statistical Inference in Science SPROTT

  • To Muriel

  • This page intentionally left blank

  • Preface

    This book is based on material presented in courses for fourth-year undergraduatestatistics students, masters and Ph.D. statistics students over a number of years atthe University of Waterloo, Ontario, and Centro de Investigacion en Matematicas(CIMAT), Guanajuato, Mexico. In any given course a selection of the material couldbe used depending on the time available. A knowledge of probability up to thelevel of J. G. Kalbeisch, Probability and Statistical Inference, Volume 1, 2nd edition(1979 Springer-Verlag) or the level of Feller, Introduction to Probability Theory andits Applications, Volume 1, is assumed. The mathematical level requires only thecalculation of probability functions and densities, and transformations of variablesinvolving Jacobians. However, some algebraic dexterity would facilitate many ofthese calculations.

    I originally intended to end each chapter with a set of problems, as is customary.But many of the problems involve analysis of data from ongoing research in thejournals, and it seemed that putting them at the end of a given chapter would limitand prejudge how they could be approached. In addition, depending on the particularaspect of the data being considered, the same problems would occur at the end of anumber of chapters. Therefore, it seemed preferable and also more realistic to putthem all at the end of the book in Chapter 11. This more closely emulates howthey occur in practice. In some cases both forward and backward references betweenrelevant sections and corresponding problems are given.

    I am grateful to Professors J. K. Lindsey and R. Viveros for having read a pre-liminary version of the manuscript and for oering many valuable comments andsuggestions. I should like to thank the students of CIMAT who endured the coursesand notes that have culminated in this book, particularly J. A. Domnguez, J. L.

    vii

  • viii

    Batun, and I. Sols, for their helpful suggestions. A special thanks to Dr. ElosaDaz-Frances of CIMAT for her interest in the development of this book and its prac-tical applications. Finally, I thank Dr. M. D. Vogel-Sprott for her enduring supportand encouragement throughout the lengthy process of developing and writing thisbook.

    D. A. SprottGuanajuato

    Mexico

  • Contents

    Preface vii

    List of Examples xiii

    1 Introduction 11.1 Repeatable Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Probability and Repeatable Experiments . . . . . . . . . . . . . . . . 21.3 Statistics and Repeatable Experiments . . . . . . . . . . . . . . . . . 31.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 The Likelihood Function 72.1 The Likelihood Model . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 The Denition of the Likelihood Function . . . . . . . . . . . . . . . 82.3 Likelihood and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . 92.4 The Relative Likelihood Function . . . . . . . . . . . . . . . . . . . . 92.5 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . 102.6 Score Function and Observed Information . . . . . . . . . . . . . . . 112.7 Properties of Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.7.1 Nonadditivity of Likelihoods . . . . . . . . . . . . . . . . . . . 122.7.2 Combination of Observations . . . . . . . . . . . . . . . . . . 132.7.3 Functional Invariance . . . . . . . . . . . . . . . . . . . . . . . 14

    2.8 Likelihood Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.9 Examples of Likelihood Inferences . . . . . . . . . . . . . . . . . . . . 15

    ix

  • x CONTENTS

    2.10 Normal Likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3 Division of Sample Information I: Likelihood , Model f 413.1 Minimal Sucient Division . . . . . . . . . . . . . . . . . . . . . . . . 413.2 The Likelihood Function Statistic . . . . . . . . . . . . . . . . . . . . 423.3 Maximal Ancillary Division . . . . . . . . . . . . . . . . . . . . . . . 44

    4 Division of Sample Information II: Likelihood Structure 494.1 Separate Estimation: Nuisance Parameters . . . . . . . . . . . . . . 494.2 Conditional Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.4 Pivotal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.4.1 Pivotal Quantities . . . . . . . . . . . . . . . . . . . . . . . . 634.4.2 Linear Pivotals and Pivotal Likelihood . . . . . . . . . . . . . 64

    4.5 Maximized or Prole Likelihood . . . . . . . . . . . . . . . . . . . . . 66

    5 Estimation Statements 735.1 Pivotal Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.3 Likelihood-Condence Intervals . . . . . . . . . . . . . . . . . . . . . 755.4 Likelihood-Fiducial Intervals . . . . . . . . . . . . . . . . . . . . . . . 765.5 Likelihood-Bayes Intervals . . . . . . . . . . . . . . . . . . . . . . . . 775.6 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 785.7 Bias Reduction and Maximum Likelihood . . . . . . . . . . . . . . . . 82

    6 Tests of Signicance 876.1 The Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.4 Testing Parametric Hypotheses . . . . . . . . . . . . . . . . . . . . . 97

    6.4.1 The Distinction Between Testing and Estimation . . . . . . . 976.4.2 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.4.3 Acceptance of the Null Hypothesis . . . . . . . . . . . . . . . 1006.4.4 The 2 2 Table; Conditional Inference . . . . . . . . . . . . . 101

    6.5 Notes and References for Chapters 1 to 6 . . . . . . . . . . . . . . . . 104

    7 The Location-Scale Pivotal Model 1077.1 Basic Pivotals; Robust Pivotals . . . . . . . . . . . . . . . . . . . . . 1077.2 Division of Pivotal Information . . . . . . . . . . . . . . . . . . . . . 1087.3 Inferential Distributions: . . . . . . . . . . . . . . . . . . . . . . . . . 1097.4 Robust Pivotal Likelihoods; Estimation Statements . . . . . . . . . . 1127.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.6 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

  • CONTENTS xi

    7.6.1 Nonadaptive or Marginal Robustness . . . . . . . . . . . . . . 1187.6.2 Adaptive Procedures . . . . . . . . . . . . . . . . . . . . . . . 1197.6.3 Adaptive or Conditional Robustness . . . . . . . . . . . . . . 120

    7.7 Parametric Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.8 Pivotal Model: Paired Observations . . . . . . . . . . . . . . . . . . . 1327.9 Nonpivotal Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 1407.10 Examples of Adaptive Robustness . . . . . . . . . . . . . . . . . . . . 141

    8 The Gauss Linear Model 1538.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538.2 The Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538.3 The Probability Distributions . . . . . . . . . . . . . . . . . . . . . . 1548.4 The Normal Linear Model . . . . . . . . . . . . . . . . . . . . . . . . 1558.5 Some Special Cases of Nonlinear Pivotals . . . . . . . . . . . . . . . . 1568.6 Notes and References for Chapters 7 and 8 . . . . . . . . . . . . . . . 158

    8.6.1 Fiducial Probability . . . . . . . . . . . . . . . . . . . . . . . 160

    9 Maximum Likelihood Estimation 1639.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1639.2 Approximate N(0, 1) Linear Pivotals . . . . . . . . . . . . . . . . . . 1669.3 Approximate t() Linear Pivotals . . . . . . . . . . . . . . . . . . . . 1779.4 Approximate log F(1, 2) Linear Pivotals . . . . . . . . . . . . . . . . 1799.5 Use of the Prole Likelihood Function . . . . . . . . . . . . . . . . . 1829.6 Notes and References for Chapter 9 . . . . . . . . . . . . . . . . . . . 1899.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

    9.A.1 Degrees of Freedom of logF . . . . . . . . . . . . . . . . . . . 1919.A.2 Observed Information of the Prole Likelihood . . . . . . . . . 1929.A.3 Symmetrizing the Likelihood Function . . . . . . . . . . . . . 193

    10 Controlled Experiments 19710.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19710.2 Controlled Experiments and Causation . . . . . . . . . . . . . . . . . 19810.3 Random Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

    10.3.1 Scientic Purpose: Causation Versus Correlation . . . . . . . . 19810.3.2 Statistical Purpose: Tests of Signicance . . . . . . . . . . . . 199

    10.4 Causal Versus Statistical Inference . . . . . . . . . . . . . . . . . . . 20110.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

    11 Problems 203

    References 231

    Index 241

  • This page intentionally left blank

  • List of Examples

    2.2.1 The binomial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.9.1 A capture-recapture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.9.2 Binomial likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.9.3 Two binomial likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.9.4 A multimodal likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.9.5 Poisson likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.9.6 Poisson dilution series likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.9.7 (a) Gamma likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.9.7 (b) Censored exponential failure times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.9.8 Uniform likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.9.9 Normal likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.9.10 (a) Normal regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.9.10 (b) Normal autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.9.11 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.9.12 Exponential regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.9.13 Two binomial likelihoods, the 2 2 contingency table . . . . . . . . . . . . . . . 302.9.14 ECMO trials: 2 2 table with adaptive treatment allocation . . . . . . . . 322.10.1 The dilution series Example 2.9.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.10.2 The gamma likelihood, Example 2.9.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.10.3 The capture-recapture likelihood of Example 2.9.1 . . . . . . . . . . . . . . . . . . 37

    3.2.1 Factoring the Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.2 The exponential family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.3 The inverse Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.1 The location model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.2 A location regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    xiii

  • xiv LIST OF EXAMPLES

    3.3.3 The location-scale model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.4 A model from particle physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    4.2.1 The dierence between two Poisson distributions . . . . . . . . . . . . . . . . . . . . 504.2.2 Combination of Poisson dierences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.3 Dierence between two binomial likelihoods:

    random treatment assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.4 The combination 2 2 tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.5 The dierence between two binomial likelihoods:

    adaptive treatment assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.6 The multinomial 2 2 table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.7 A capture-recapture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3.1 The location-scale model; the scale parameter, Example 3.3.3 . . . . . . . 614.3.2 The common variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4.1 The location-scale model; the location parameter . . . . . . . . . . . . . . . . . . 654.4.2 The common mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.5.1 The normal mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.5.2 The normal variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.5.3 The common variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.5.4 The dierence between two normal likelihoods N [i, I(i; y)1] . . . . . . 684.5.5 Prole likelihoods associated with Example 2.9.11 . . . . . . . . . . . . . . . . . . . 68

    5.6.1 The gamma likelihood of Example 2.10.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.6.2 The gamma likelihood of Example 2.9.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.6.3 Conditional likelihood, ramipril data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.6.4 Prole likelihood, ramipril data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.6.5 Measure of reliability, Example 4.2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.7.1 Exponential failure times of components connected in series . . . . . . . . . . 825.7.2 Capture-recapture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.7.3 Dilution series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    6.2.1 Testing the Poisson dilution series model of Example 2.9.6 . . . . . . . . . . 906.2.2 Testing the logistic binomial model of Example 2.9.11 . . . . . . . . . . . . . . . 916.3.1 Homogeneity of Poisson samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.3.2 Homogeneity of binomial samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.3.3 Factoring a Poisson dilution series model . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    7.5.1 The normal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.5.2 A guarantee life distribution f(pi) = exp(pi), pi > 0 . . . . . . . . . . . . . . 1147.5.3 (a) A symmetric family of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.5.3 (b) Extension to include asymmetric distributions . . . . . . . . . . . . . . . . . . 1177.7.1 The quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.7.2 Coecient of variation = / . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.7.3 Ratio of scale parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.7.4 Dierence in location, = 1 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.7.5 Dierence in location = 1 2, specied . . . . . . . . . . . . . . . . . . . . . . 125

  • LIST OF EXAMPLES xv

    7.7.6 The common mean 1 = 2 = , specied . . . . . . . . . . . . . . . . . . . . . . . 1277.7.7 Dierence in location, unspecied, the Behrens-Fisher problem . . . 1287.7.8 The common mean 1 = 2 = , unspecied . . . . . . . . . . . . . . . . . . . . . 1297.7.9 Ratio of locations = 2/1, specied . . . . . . . . . . . . . . . . . . . . . . 1307.7.10 Predictive inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317.7.11 Length of mean vector = 21 +

    22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    7.8.1 Paired dierences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.8.2 Paired ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.8.3 Paired ratios, Assumption (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.8.4 Paired ratios, Assumption (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.8.5 Paired ratios, Assumption (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1377.8.6 The linear functional relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387.10.1 Paired dierences, the Darwin data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.10.2 Dierence between two normal means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1447.10.3 Paired ratios: the Darwin data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    8.5.1 Extrema of a polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1578.5.2 The x-coordinate of the intersection of two regression lines . . . . . . . . . 1578.5.3 The linear calibration problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    9.2.1 Failure times of systems of components in series; competing risks . . . 1679.2.2 Number of viruses required to infect a cell . . . . . . . . . . . . . . . . . . . . . . . . . 1709.2.3 The inverse Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779.3.1 Paired ratios, Example 7.8.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1789.4.1 Genetics example, Fisher (1991a, pp. 323-331) . . . . . . . . . . . . . . . . . . . . . 1819.5.1 Data from the extreme value density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1839.5.2 Extreme value data with arbitrary censoring . . . . . . . . . . . . . . . . . . . . . . . 186

  • This page intentionally left blank

  • 1Introduction

    1.1 Repeatable Experiments

    The purpose of this book is to present statistical methods appropriate for the analysisof repeatable experiments in science. By science is meant the study of repeatablenatural phenomena. Its purpose is to predict nature and, when possible, to change orcontrol nature. This requires repeatable procedures. The demonstration of a naturalphenomenon cannot be based on a single event. In order to assert that a naturalphenomenon is experimentally demonstrable we need, not an isolated record, but areliable method of procedure (Fisher 1991b, p. 14). Such a procedure will be calledan experiment. Experiments occur in both observational science, where the factorsof interest cannot be controlled or manipulated, and in experimental science, wheresome of the factors can be controlled. The distinction is not particularly relevantto Chapters 1 to 9 and so will be deferred to Chapter 10. But the requirement ofrepeatability is universal.

    This structure of repeatable experiments leads to statistical models in which thedata are assumed to come from a hypothetical innite population generated by repe-titions of the phenomenon. By innite population is meant that however many timesthe experiment has been repeated, it is always possible in principle to repeat it again.This is an operational denition of the innite population. It is in 1 to 1 correspon-dence with the countably innite set of positive integers. The innite population

    1

  • 2 CHAPTER 1. INTRODUCTION

    generated by a repeatable experiment is therefore a countably innite set.

    1.2 Probability and Repeatable Experiments

    From the point of view of pure mathematics, probabilities are simply additive mea-sures normalized to add to one. Their numerical values are of no particular signi-cance. However this is not a particularly helpful viewpoint for practical applications.Because of the diverse applications of probability, there has been much discussionabout the denition and interpretation of probability. It is not the purpose here toenter into such a discussion. We merely outline a view of probability that seemsapplicable to the innite population of repeatable experiments.

    If a fraction p of the population generated by a repeatable experiment has propertyA, then the probability of A is dened to be p, P (A) = p. The fact that both thepopulation of repeated experiments and the subpopulation with property A are bothinnite, and even that the subpopulation A is in 1 to 1 correspondence with thewhole population, causes no diculty. For example, all of the foregoing is true forthe population of integers. If A is the property of being an even integer, then p = 12 .

    However, only in applications to mathematics will p be assumed known numeri-cally, such as the above example of the positive integers and in games of chance. Inapplications to science, p will generally be unknown, such as in clinical trials. If theobserved frequency of A in n independent repetitions of the experiment is y, thenpn = y/n is an estimate of p with the property lim pn = p as n . This hasbeen used as the denition of the probability p. This denition has been criticized invarious ways. But its main deciency as a denition of probability seems to be thatit confuses the denition of p with the estimation of p. See Fisher (1958, 1959, 1991c,pp. 33-36) for a discussion of the denition of probability along with conditions forits applicability to a given observed trial.

    This leads to probability models of repeatable experiments in the form of proba-bility functions f(yi; i) of the observations yi in terms of unknown parameters i =(i, i). The quantities of interest are the parameters . They represent the repeatablephenomenon. The repeatablity of the phenomenon is supposed to be embodied inthe homogeneity of the is. The remaining quantities are the parameters , wherei is an incidental parameter associated with the ith experiment. The i are notinvolved with the repeatability of the phenomenon and so are not assumed to behomogeneous. There is a distinction between the mathematical form of f , called themodel, and , the parameters entering into the model. They are logically dierentand require separate treatment. The following chapters deal with the separation ofthe sample information into these two parts, f and , and also with the separation ofthe components of . The specication of the model f is not arbitrary, but is basedon the phenomenon under study and on the way in which the data y were obtained,that is, on the experimental design. Usually the purpose of the experiments is tomake inferences about the i conditional on the model f .

  • 1.3. STATISTICS AND REPEATABLE EXPERIMENTS 3

    1.3 Statistics and Repeatable Experiments

    The resulting problems of inference produced by this setup are (a) model assessment:the assessment of the assumed probability model f ; (b) homogeneity: the separationof the s from the s, and the assessment of repeatability 1 = 2 = = conditional on f in (a); (c) estimation: quantitative statements of plausibility about based on the combined evidence from all of the experiments conditional on (a) and(b). This last step, (c), is usually the main reason for the experiments. The purposeis to amass sucient evidence concerning the question at issue that the evidenceeventually becomes conclusive. See Fisher (1952).

    Inferential estimation. Because estimation has many dierent interpretations, it isnecessary to make its interpretation more explicit here. In what follows estimationmeans inferential estimation. This form of estimation is dierent from decision-theoretic or point estimation. The purpose of decision-theoretic estimation isto obtain optimal estimates based on loss functions and related external criteriasuch as unbiasedness and minimum variance. In contrast, the purpose of inferentialestimation is to make estimation statements. These are quantitative statements aboutthe extent to which values of are reasonable or plausible, or contradicted, using all ofthe parametric information in the data y. An example is = y st(n1) appropriatefor a sample of size n from a N(, 2) distribution, where s is the estimated standarderror and t is the Student t(n1) variate with n 1 degrees of freedom. Naturally, notall estimation statements will be this simple. They will depend on the model f andon the type of data.

    The structure (a), (b), and (c) may be illustrated by the data in Table 1.1 arisingin four experiments taken from: (1) Dulbecco (1952); (2) Dulbecco and Vogt (1954);(3) Khera and Maurin (1958); and (4) De Maeyer (1960). The purpose of theseexperiments was to investigate the number of viruses required to infect a cell. Aliquid medium containing a suspension of the virus particles was successively dilutedto form a geometric series of k + 1 dilutions a0 = 1, a, a2, . . . , ak. These were pouredover replicate cell sheets, and after a period of growth the number of plaques occurringat dilution level aj was observed. The results are recorded in Table 1.1 in the form yj(nj) at each dilution level j for each dilution series, where nj is the number of separaterepetitions at dilution level j and yj is the total number of plaques observed on all njrepetitions. For example, in Table 1.1 experiment (2iv), k = 2; at dilution level 0 therewere 2 repetitions with a total of 46 plaques; at dilution level 1 there were 6 repetitionswith a total of 61 plaques; and at dilution level 2 there were 10 repetitions with atotal of 36 plaques. The remaining dilution levels were not used in this experiment.According to the theory, the numbers of plaques in single repetitions at level j in agiven dilution series should have independent Poisson (aj) distributions, where is the expected number of plaques in the undiluted suspension (j = 0) and is theparameter of interest, the minimum number of virus particles required to infect a cell(so that should be an integer). Of interest is the evidence about = 1. Here (a) is

  • 4 CHAPTER 1. INTRODUCTION

    the Poisson model for all experiments irrespective of the parametric structure; (b) isthe separation of the s from the s in all of the experiments and the repeatability1 = = 15 = conditional on the Poisson model (a); (c) is the estimation of ,in particular the evidence for = 1, conditional on homogeneity (b) and the Poissonmodel (a). This example will be discussed in Chapter 9.

    Table 1.1: Plaque counts and number of repetitionsDilution level

    Experiment 0 1 2 3 4 5 6 a(1) Dulbecco (1952)i 297 (2) 152 (2) 2ii 112 (2) 124 (7) 3iii 79 (1) 23 (1) 3iv 50 (1) 12 (1) 2 (1) 2v 26 (1) 10 (1) 3(2) Dulbecco and Vogt (1954)i 305 (3) 238 (4) 2ii 47 (1) 46 (2) 2iii 82 (2) 84 (6) 3iv 46 (2) 61 (6) 36 (10) 3v 102 (4) 99 (8) 92 (16) 2(3) Khera and Maurin (1958)i 66 (2) 44 (2) 27 (2) 17 (2) 11 (2) 4 (2) 4 (2) 5

    10

    ii 178 (2) 63 (2) 6 (2) 0 (2) 10

    iii 180 (4) 27 (2) 6 (2) 2 (2) 10

    (4) De Maeyer (1960)i 264 (2) 25 (2) 10ii 476 (2) 39 (2) 10

    The following chapters present statistical methods that take into account thedierent forms and structures that can arise from these considerations. The analysisconsists in the separation of the sample information into dierent parts, each partaddressing the dierent problems (a), (b), and (c) above.

    Scientically, the natural ordering of dealing with these problems is as above, (a),(b|a), and (c|a,b). It would not be sensible to test homogeneity using a model thatis contradicted by the data, and it would not be sensible to combine the replicationsto estimate if either the model is decient or the experiments are not replicable.However, since (c) is mathematically more easily formulated than (a) or (b), they aretreated in reverse order. For (c), likelihood functions yield measures of relative plau-sibility or support. This is the subject of Chapters 2 to 5. For (a) and (b), P -valuesyield measures of contradiction or discrepancy, which is the subject of Chapter 6. Thelocation-scale and Gauss linear (regression) models have an added structure in terms

  • 1.4. NOTATION 5

    of pivotal quantities that is exploited in Chapters 7 and 8. Maximum likelihood esti-mation is discussed in Chapter 9, and Chapter 10 deals with the distinction betweenobservational science and controlled experiments in experimental science mentionedin Section 1.1.

    1.4 Notation

    The notation f(y; t, ) will denote the family of probabilities of the observations yindexed by the statistic t = t(y) and the parameter , both regarded as xed. Thenotation f(y; |t) will denote the family of conditional probability functions indexedby the parameter , regarded as xed, and mathematically conditioned on the statistict, and hence still regarded as xed.

    The logical dierence between the two models is that f(y; t, ) denotes a con-ditional model whereas f(y; |t) denotes a conditional submodel of a larger modelobtained by the usual conditioning procedure, the joint distribution of y and t di-vided by the marginal distribution of t: f(y; |t) = f(y, t; )/f(t; ). An example ofthe former is the binomial model with t independent trials and constant probability of success, f(y; t, ) =

    (ty

    )y(1 )ty, where the origin of t is unspecied. An

    example of the latter is two independent Poisson variates x, y with means (1 ),, respectively. Then x + y = t is a Poisson variate, and in this case mathemat-ically, although not logically, f(y; |t) f(y; t, ). The second model assumes morethan the rst, since the behavior of t is also modeled, implying a joint distributionof y, t. In this sense the rst is scientically more robust, since it does not model thebehavior of t. This usually makes no dierence to inferences about , but may aectthe assessment of the model. The second model could be rejected because of the,possibly irrelevant, behavior of t. In the rst model the behavior of t is irrelevant.

    The word specied will be used to denote a parameter that is assumed tohave specied values. This is to distinguish from other parameters that are to beestimated, that is, are the subject of estimation statements and so are unspecied.Alternatively it might be said that is assumed known as opposed to unknown. Butthis would give a false impression since parameters usually are not known exactly.Further, if were known, it could have only one value. A specied parameter willtypically be specied to have various dierent values in order to see what eectthis has on the inferences that are being made. This comes under the subject ofrobustness, the eect of changes in the assumptions on the inferences being made. If is not specied, then it is a parameter to be estimated, the subject of an estimationstatement.

  • This page intentionally left blank

  • 2The Likelihood Function

    2.1 The Likelihood Model

    The considerations of Chapter 1 lead to the standard formulation of a scientic ex-periment in terms of a statistical model f(y; ). The data yi are regarded as comingfrom a hypothetical innite population of possible observations having probabilityfunction f(yi; ) depending on an unknown parameter . The inferential structureof the model is fully specied by the three elements: the sample space S = {y}, theparameter space = {}, and the probability function f(y; ). This is the usualstatistical model of an experiment.

    All of the information of the sample y must be contained in f(y; ), since there is noother mathematical entity assumed under this model. The function f , considered asa function of y for specied , gives the probability of all possible samples determinedby the specied . This shows the role of probability in inference as the deduction ofinferences about samples from the populations from which they are drawn. This isrelevant before the experiment when there are no observations. It is a closed axiomaticsystem, requiring the specication beforehand of all possible alternatives and theirprobabilities. From a given population can be calculated the probability with whichany given sample will occur. But nothing new can be learned about the real world.Few scientists would claim to know all possible theories or explanations and theirprobabilities beforehand. Thus probability in general is not an appropriate measure

    7

  • 8 CHAPTER 2. THE LIKELIHOOD FUNCTION

    of uncertainty for science. It lacks the exibility required.The only other way of considering f is as a function of for an observed y. But

    f is not a probability function of . As a function of , f does not obey the lawsof probability. To distinguish this use of f from probability the term likelihoodis used. Likelihood is the most directly accessible inferential element for estimationstatements about conditional on f . This model will therefore be called the likelihoodmodel to distinguish it from the pivotal model of Chapters 7 and 8, which has anadditional structure with inferential relevance.

    2.2 The Denition of the Likelihood Function

    Let y be a discrete random variable with probability function f(y; ) = P (y; ). Fisher(1921) dened the likelihood of any particular value of to be proportional to theprobability of observing y = yo, the subscript o meaning observed, based on thatvalue of . The likelihood function of is thereby dened as

    L(; yo) = C(yo)f(yo; ) P (y = yo; ), (2.1)

    where C(yo) is an arbitrary positive bounded function of yo that does not depend on. In general can be a vector parameter, = 1, . . . , k. For the case of a singlescalar parameter , k = 1 and the sux 1 will be omitted. The likelihood function(2.1) plays the underlying fundamental role in inferential estimation.

    In using (2.1) some care is required to ensure that C(yo) does not inadvertentlycontain any unspecied parameters. This issue arises in Section 7.9, where dierentmodels f are being compared.

    Example 2.2.1 Binomial likelihood. The binomial probability function is f(y;n, ) =(ny

    )y(1 )ny. It arises as the probability of obtaining y successes (S) and n y

    failures (F ) in n independent trials for each of which P (S) = , P (F ) = 1 . Thelikelihood function (2.1) based on an observed y = yo is therefore

    L(; yo, n) = C(yo, n)yo(1 )nyo ,

    where(nyo

    )has been incorporated into the constant C(yo, n).

    In contrast with probability, the role of likelihood is the deduction of inferencesabout populations from which observed samples have been drawn. This is relevant af-ter the experiment. This must be an open-ended system to allow for the incorporationof new knowledge. Unlike samples from a given population, not all populations yield-ing a given sample can be specied. Therefore the likelihood of any given population,hypothesis, or model, is meaningless. To underline this the likelihood function (2.1)is dened as proportional, not equal, to the probability function f . This emphasizesthat only likelihood ratios have meaning the likelihood of one simple hypothesis

  • 2.3. LIKELIHOOD AND UNCERTAINTY 9

    versus another simple hypothesis, a simple hypothesis being one that allows the cal-culation numerically of observing y. In particular, estimation statements will mostgenerally be in terms of relative likelihood.

    In what follows the subscript o will be omitted for notational simplicity, it beingunderstood that the observed likelihood is determined by the numerically observedvalue of y.

    2.3 Likelihood and Uncertainty

    The likelihood function L(; y) supplies an order of preference or plausibility amongpossible values of based on the observed y. It ranks the plausibility of possiblevalues of by how probable they make the observed y. If P (y; = ) > P (y; = ),then the observed y makes = more plausible than = , and from (2.1), L(; y)> L(; y).

    The likelihood ratio L(; y)/L(; y) = f(y; )/f(y; ) is a measure of the plausi-bility of relative to based on the observed fact y. The meaning of L(; y)/L(; y)= 4 is that is four times more plausible than in the sense that makes the ob-served y four times more probable than does . For example, suppose an urn containsthree similar balls, either (a) 1 black, 2 white, or (b) 2 black, 1 white. Two drawingsare made with replacement, both giving a white ball. Under (a), the probability of this(23)

    2, and under (b) is (13)2. Thus having observed these two drawings makes condition

    (a) four times more plausible than condition (b) in the above sense. This statementis not aected by the possibility that the urns may have other hitherto unthoughtof compositions. It also illustrates the general point that probability statements arerelevant before the experiment. Relative likelihood statements are relevant after theexperiment.

    Likelihoods rank plausibilities of based only on the observed y. There may beother facts that would change these plausibilities, and other reasons for preferring

    to , or conversely. Also, likelihoods compare plausibilities of dierent values of forthe given xed value of y. That is, the likelihood is a function of determined by thexed observed numerical value of y.

    Because likelihood ratios are ratios of frequencies, they have an objective frequencyinterpretation that can be veried by simulations on a computer. The relative likeli-hood L(; y)/L(; y) = k means that the observed value y will occur k times morefrequently in repeated samples from the population dened by the value than fromthe population dened by .

    2.4 The Relative Likelihood Function

    Since only ratios of likelihoods are meaningful, it is convenient to standardize thelikelihood with respect to its maximum to obtain a unique representation not involving

  • 10 CHAPTER 2. THE LIKELIHOOD FUNCTION

    an arbitrary constant. The result is the relative likelihood function, also called thenormed likelihood, dened as

    R(; y) =L(; y)

    sup L(; y)=

    L(; y)L(; y)

    . (2.2)

    The relative likelihood function thus varies between 0 and 1. The quantity = (y)that maximizes L(; y) is called the maximum likelihood estimate of . Since f(y; )is a probability function, it is necessarily bounded, and so the denominator of (2.2)exists and is nite.

    The maximum likelihood estimate is the most plausible value of in that itmakes the observed sample most probable. The relative likelihood (2.2) measures theplausibility of any specied value relative to that of . For the simple case of asingle parameter a graph of R(; y) shows what values of are plausible, and outsidewhat limits the likelihood becomes small and the corresponding values of becomeimplausible. It summarizes all of the sample information about contained in theobservation y.

    In Example 2.2.1, the maximum likelihood estimate is = y/n. The binomialrelative likelihood function (2.2) is

    R(; y, n) =(

    )y [1 1

    ]ny=

    nny(1 )nyyy(n y)ny . (2.3)

    2.5 Continuous Random Variables

    The likelihood function (2.1) is dened in terms of discrete random variables, so thatf is a probability function. This involves no essential loss of generality, since allmeasuring instruments have nite precision. But if the precision is suciently high acontinuous approximation can be made. This considerably simplies the model andthe calculations.

    If the observations are continuous, then the statement y = yo means yo 12 y yo + 12, where is determined by the precision of the measuring instrument. If yhas density function f(y; ), then by the law of the mean for integration,

    P (y = yo) = P (yo 12 y yo + 12)

    = yo+12 t=yo12

    f(t; )dt = f(y; ),

    where y is some value of y between yo 12 and yo + 12. If f(y; ) is approximatelyconstant in this range for all plausible , then f(y; ) f(yo; ) in this range. Ifthis approximation is adequate, and if does not involve , then the density func-tion f(yo; ) may be used in the denition of likelihood (2.1). One requirement forthis approximation to be adequate is that the density function f should not have asingularity in the range y 12. See Problems 11.1 and 11.2.

  • 2.6. SCORE FUNCTION AND OBSERVED INFORMATION; 11

    Frequently the likelihood function is dened to be (2.1) in the continuous casewhere f is the density function, without recourse to the limiting approximation above.This allows the likelihood function to have a singularity, which results in one valueof being innitely more plausible than any other value of . The relative likelihoodfunction (2.2) is then 1 for this value of and 0 for all other values of . Usuallythis makes no scientic sense. This diculty is usually avoided by reverting to thediscrete model of Section 2.2 in a neighborhood of the singularity, thus taking intoaccount the nite accuracy of all scientic measurements.

    2.6 Score Function and Observed Information;Numerical calculation of

    Usually it is not possible to calculate the maximum likelihood estimate analytically.Numerical procedures are necessary. These usually involve the score function and theobserved information. The score function and the observed information for a scalarparameter are dened as

    Sc(; y) =

    logL(; y)

    log f(y; ), (2.4a)

    I(; y) = 2

    2logL(; y)

    2

    2log f(y; )

    Sc(; y). (2.5a)

    The observed information calculated at the numerical value = (m) is I((m); y).The observed information calculated at the maximum likelihood estimate is thus

    I(; y) =[

    2

    2logL(; y)

    ]=

    . (2.6a)

    This quantity itself is often called the observed information.The maximum likelihood estimate is usually, but not always, a solution of

    Sc(; y) = 0. It may, for example, be a boundary point at which Sc(; y) = 0,as in Example 2.9.2 with y = 0, n. When is a solution of Sc(; y) = 0, the conditionfor it to be a maximum is [Sc(; y)/]= < 0, or equivalently, I(; y) > 0.

    The maximum likelihood estimate and the observed information I(; y) also havetheoretical importance. They exhibit two features of the likelihood, the former beinga measure of its position relative to the -axis, the latter a measure of its curvature, orlocal precision, in a neighborhood of its maximum. Hence the justication of the terminformation. This is of particular importance when the likelihood is symmetric, ornormal in shape, since then I(; y) is usually the main feature determining the shapeof the likelihood function. This will be discussed in Section 2.10.

    One standard numerical method of calculating the maximum likelihood estimateand related quantities is the NewtonRaphson iterative method. Let a root of the

  • 12 CHAPTER 2. THE LIKELIHOOD FUNCTION

    equation Sc(; y) = 0 be = (y). Expanding Sc(; y) about a neighboring point (0)

    in a Taylor series up to the linear term leads to the linear approximation of Sc(; y)

    Sc(; y) Sc((0); y) + ( (0))[

    Sc(; y)

    ]=(0)

    = Sc((0); y) I((0); y)( (0)).

    Setting this equal to zero gives

    = (1) = (0) + [I((0); y)]1Sc((0); y).

    This leads to the recurrence relation

    (m+1) = (m) + [I((m); y)]1Sc((m); y), (2.7)

    yielding a sequence of values (0), . . . , (m), . . .. If the initial value (0) is sucientlyclose to , this sequence converges to , the sequence Sc((m); y) converges to 0, andthe sequence I((m), y) converges to the observed information I(; y).

    The above can be extended to vector parameters = 1, . . . , k. The score functionvector is a k 1 vector of score functions Sc = (Sci),

    Sci(; y) =

    ilogL(; y), (2.4b)

    and the observed information is a symmetric kk matrix of second derivatives I(; y)= (Iij),

    Iij = Iij(; y) = 2

    ijlogL(; y). (2.5b)

    The observed information calculated at = is similarly the corresponding k kpositive denite symmetric matrix of second derivatives evaluated at ,

    Iij = Iij(; y) = [

    2

    ijlogL(; y)

    ]=

    . (2.6b)

    The resulting recurrence relation is still (2.7) written in vector notation with beinga k 1 vector.

    The diculty of this procedure is the requirement of an adequate initial value(0) to start the iteration. Its advantage is the calculation of (2.6a), which containsimportant inferential information (Section 2.10). The procedure may be complicatedby the presence of multiple stationary values, and care must be taken to ensure thatthe overall maximum has been obtained.

    2.7 Properties of Likelihood

    2.7.1 Nonadditivity of Likelihoods

    The principal property of likelihoods is that they cannot be meaningfully added.This is the sharp distinction between likelihood and probability. Likelihood is a

  • 2.7. PROPERTIES OF LIKELIHOOD 13

    point function, unlike probability, which is a set function. The probability of thedisjunction A or B is well-dened, P (A or B) = P (A) + P (B). But the likelihoodof the disjunction 1 or 2 is not dened. This is because probabilities are associatedwith events, and the disjunction of two events A or B is an event. But likelihoodsare associated with simple hypotheses H, a simple hypothesis being one that speciescompletely the numerical probability of the observations. And the disjunction of twosimple hypotheses H1 or H2 is not, in general, a simple hypothesis H. For example,if y is a Poisson variate, then = 1, = 2 each separately species numericallythe probability function of y. But the hypothesis H: = 1 or 2, does not specifynumerically the probability function of y, and so is not a simple hypothesis. Nolikelihood can be associated with H.

    The fact that a likelihood of a disjunction of exclusive alternatives cannot bedetermined from the likelihoods of the individual values gives rise to the principaldiculty in using likelihoods for inferences. For if = (, ), there is no dicultyin obtaining the joint likelihood of (, ), since can be a vector in (2.1). But alikelihood for or for separately cannot in general be thereby obtained. This givesrise to the search for special structures (Chapter 4) capable of separating from ,and of the estimation of one in the absence of knowledge of the other.

    The lack of additivity also implies that likelihoods cannot be obtained for intervals.Likelihood is a point function assigning relative plausibilities to individual values ofthe parameter within the interval. Care must therefore be exercised in interpretinglikelihood intervals. The likelihood intervals of Section 2.8 assign plausibilities tospecic points within the intervals. They do not assign measures of plausibility tothe intervals themselves (Chapter 5).

    2.7.2 Combination of Observations

    A convenient property of likelihoods is the simplicity they provide for combining datafrom dierent experiments. Since the joint probability of independent events is theproduct of their individual probabilities, the likelihood of based on independentdata sets is from (2.1) the product of the individual likelihoods based on each dataset separately. Thus the log likelihoods based on independent data sets are combinedby addition to obtain the combined log likelihood based on all data.

    The generality should be emphasized. It applies to data from dierent experi-ments, provided the likelihoods apply to the same parameter. In particular, this im-plies that the appropriate way to combine data from dierent experiments estimatingthe same parameter is not, in general, by combining the individual estimates i aris-ing from the individual experiments, weighted by their standard errors, variances, orotherwise. It is by adding their respective log likelihood functions of . Thus, if the in-dividual likelihood functions based on independent experiments Ei are Li = L(;Ei),the log likelihood based on the combined experiments is logL =

    logLi. The com-

    bined maximum likelihood estimate is then obtained by maximizing

    logLi. The

  • 14 CHAPTER 2. THE LIKELIHOOD FUNCTION

    resulting will not in general be a function solely of the individual is and theirstandard errors or variances. It is obvious from this that the combined relative loglikelihood function is not the sum of the individual log relative likelihood functions.This sum must be restandardized with respect to the combined .

    Of course, in practice it should rst be checked that the dierent experiments arein fact estimating the same . This requires tests of homogeneity (Chapter 6).

    Because of the simplicity of combining log likelihoods by addition, the log like-lihood is used more than the likelihood itself. Most measures of information andestimation procedures are based on the log likelihood, as in Section 2.6.

    2.7.3 Functional Invariance

    Another convenient feature of likelihoods is that, like probabilities (but unlike proba-bility densities), likelihood is functionally invariant. This means that any quantitativestatement about implies a corresponding statement about any 1 to 1 function =() by direct algebraic substitution = ().

    If R(; y) is the relative likelihood function of , the relative likelihood functionof is R(; y) = R[(); y]. Also, = (). For example, if > 0 and = log , then = log , R(; y) = R(exp ; y), and a b log a log b. These twoequivalent statements should have the same uncertainty or plausibility. Likelihoodand probability both satisfy this requirement.

    This is quite useful in practice, since in many cases some other parameter is ofmore interest than , as some of the examples in Section 2.9 will show.

    Also, often a change of parameter may simplify the shape of the likelihood func-tion. For instance, R() may be more symmetric, or approximately normal in shape,than R(). Inferences in terms of will then be structurally simpler, but mathemat-ically equivalent, to those in terms of . This is exemplied in Section 2.10.

    2.8 Likelihood Intervals

    Rather than comparing the relative plausibility or likelihood of two specic valuesof an unknown parameter as in Section 2.3, it is usually of more interest to specifyranges of most plausible values. These summarize the relative likelihood function interms of likelihood intervals, or more generally likelihood regions.

    A level c likelihood region for is given by

    R(; y) c, 0 c 1. (2.8)When is a scalar the region will be an interval if R is unimodal, or possibly a unionof disjoint intervals if R is multimodal. Every specic value of within the regionhas a relative likelihood R() c, and every specic value of outside the region hasa relative likelihood R() < c. The region therefore separates the plausible values of from the implausible values at level c.

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 15

    When is a single scalar parameter, the intervals, or unions of intervals, areobtained by drawing a horizontal line through the graph of R() at distance c abovethe -axis. Varying c from 0 to 1 produces a complete set of nested likelihood intervalsthat converges to the maximum likelihood estimate as c 1. Thus is commonto all of the intervals, and so serves to specify their location. This complete set ofintervals is equivalent to the likelihood function, and reproduces the graph of R().

    A single interval is not very informative and so does not suce. It merely statesthat values of outside the interval have relative plausibilities less than c, while valuesinside have plausibilities greater than c. But it gives no indication of the behavior ofplausibility within the interval. To do this the interval should at least be supplementedby to give some indication of the statistical center of the intervals. The deviation of from the geometrical center gives an idea of the skewness of the likelihood functionand hence of the behavior of plausibility within the interval. Preferably, however, anested set of likelihood intervals, such as c = .05, .15, .25, should be given along with.

    In light of Section 2.7.1 it is important to emphasize that a likelihood interval isnot a statement of uncertainty about the interval. It is a statement about the relativeplausibility of the individual points within the interval. Because of the nonadditivityof likelihoods, likelihoods of intervals cannot generally be obtained. This is discussedfurther in Chapter 5

    2.9 Examples of Likelihood Inferences

    Example 2.9.1 A capture-recapture problem. Animal population sizes N are oftenestimated using mark-capture-recapture techniques. This entails catching animals,marking them, then releasing them and recapturing them repeatedly over a givenperiod of time. The observations then take the form f1, f2, . . ., where fi animals arecaught i times. Then f0 is unobserved, and hence unknown, and N =

    0 fi. Denote

    the total number of animals caught by s =

    i=0 ifi.One model for such an experiment is the classical occupancy model in which s

    objects are distributed randomly to N cells. The probability of an object being dis-tributed to any particular cell is assumed to be 1/N , the same for all cells (the uniformdistribution). Then the number of ways of selecting the f0, f1, . . . cells is N !/

    0 fi!.

    The number of ways of distributing the s objects to these cells is s!/

    1 (i!)fi , since for

    each of the fi cells the i objects they contain can be permuted in i! ways that do notproduce dierent distributions. The total number of equally probable distributionsis N s. Letting r =

    i=1 fi, so that f0 = N r, the probability of the observations is

    f(f1, f2, . . . ;N | s) = NsN !s!/

    0fi!(i!)fi = NsN !s!

    /(N r)!

    i=1

    fi!(i!)fi .

    Another argument leading to this model is given in Problem 11.4(a), Chapter 11.The likelihood function of N is L(N ; r, s) NsN !/(N r)!, N r.

  • 16 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Figure 2.1: Relative likelihood, capture-recapture data

    One example of such an experiment, pertaining to buttery populations, gave f1= 66, f2 = 3, fi = 0, i 3, for which r = 69, s = 72, (Craig 1953, Darroch andRatcli 1980). Since N is an integer, to obtain an initial value N (0) consider L(N ; r, s)= L(N 1; r, s). This leads to

    1 rN

    =(1 1

    N

    )s 1 s

    N+s(s 1)2N2

    .

    Solving forN givesN = N (0) = s(s1)/2(sr) = 852. Using Stirlings approximationlogN ! = (N + 12) logN N + log

    2 and considering N as a continuous random

    variable, the quantities (2.4a) and (2.5a) are

    Sc = sN

    +12N

    12(N r) + logN log(N r),

    I = sN2

    +1

    2N2 1

    2(N r)2 1N

    +1

    N r .

    Successive iterations of (2.7) give

    N (0) = 852.00, Sc(N (0); r, s) = 1.049 104, I(N (0); r, s) = 4.117 106,N (1) = 826.51, Sc(N (1); r, s) = 6.876 106, I(N (1); r, s) = 4.669 106,N (2) = 827.99, Sc(N (2); r, s) = 2.514 108, I(N (2); r, s) = 4.635 106,

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 17

    after which there is little change. Since N must be an integer, N = 828, I(N ; r, s) =4.6350 106.

    The resulting relative likelihood R(N ; r = 69, s = 72) is shown in Figure 2.1. Itsmain feature is its extreme asymmetry. This asymmetry must be taken into accountin estimating N , or the resulting inferences will be very misleading. They will inparticular result in emphasizing small values of N that are very implausible andignoring large values of N that are highly plausible, thus understating the magnitudeof N . The likelihood inferences can be summarized by

    c = .25 375, 828, 2,548,c = .15 336, 828, 3,225,c = .05 280, 828, 5,089,

    where N has been included in each interval to mark its statistical center. Theseintervals exhibit the extreme variation in the upper likelihood limits as comparedwith the lower likelihood limits. Thus these data can put fairly precise lower limitson the population size, but not on the upper limits. A slight change in the relativelikelihood produces a large change in the upper limit. Ignoring these facts can resultin seriously understating the possible size of N . See Problem 11.4.

    The remaining examples are less specialized and yield likelihoods that more com-monly occur in practice. However, the above points apply to most of them.

    Example 2.9.2 Binomial likelihood, Example 2.2.1. Graphs of the relative likelihoods(2.3) with n = 10 and y = 0, 2, 5, 7, 10, are given in Figure 2.2. The shapes varywith y, being symmetric for y = 5, and increasing asymmetry as y deviates from 5.The most extreme cases are y = 0, 10, for which the maximum likelihood estimatesare the boundary points not satisfying the equation of maximum likelihood Sc(; y)= 0 (Section 2.6). This variation in shape determines the kind of inferences that canbe made and perhaps more importantly, the kind that can not be made, about . Forexample, symmetric likelihoods, like that arising from y = 5, yield inferences that canbe expressed in the form = u for various values of u. Inferences from asymmetriclikelihoods, like that arising from y = 2, cannot take this simple form. To do so wouldnot reproduce the shape of the likelihood, and would exclude large plausible valuesof while including small, but highly implausible, values, or conversely.

    Boundary point cases y = 0, n are often thought to cause diculties in estimating. However, the likelihood is still well-dened, and so inferences based on the likeli-hood function cause no diculties. For y = 0 a level c likelihood interval takes theform 0 1 c.1. Since the relative likelihood is R(; y = 0, n = 10) = (1 )10 =P (y = 0;n = 10, ), the inferences are even simpler than otherwise, since the relativelikelihoods are actual probabilities, not merely proportional to them. Thus for exam-ple, using c = .01, the resulting likelihood interval 0 .37 can be interpreted assaying that unless < .37, an event of probability less than .01 has occurred.

  • 18 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Figure 2.2: Binomial relative likelihoods, n = 10

    Example 2.9.3 Two binomial likelihoods. In a clinical trial to investigate the ecacyof ramipril in enhancing survival after an acute myocardial infarction, there were1986 subjects, of which 1004 randomly chosen subjects were given ramipril, and theremaining 982 were given a placebo (control group), (AIRE Study Group 1993). Theresulting observations are usually presented in the form of a 2 2 contingency table

    Treatment S F TotalRamipril 834 170 1004Placebo 760 222 982

    Total 1594 392 1986

    The relative likelihoods are the binomial relative likelihoods (2.3) R(; 834, 1004)and R(; 760, 982) shown in Figure 2.3.

    These data support the superiority of ramipril over the placebo. The likelihoodsare almost disjoint. Plausible values of the survival probability are greater underramipril than under the placebo. The relative likelihoods intersect at = .803, whichhas a relative likelihood 8% under both ramipril and the placebo. Values of thesurvival probability > .803 have relative likelihoods greater than 8% under ramipriland less than 8% under the control. The maximum likelihood estimates are .77 forthe placebo and .83 for ramipril. The 15% likelihood intervals are (.747, .799) forthe placebo and (.807, .853) for ramipril. These facts are shown in Figure 2.3. Data

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 19

    Figure 2.3: ramipril ; control - - - -

    in the form of a 2 2 table are discussed further in Example 2.9.13 and Chapter 4,Section 4.2.

    Example 2.9.4 A multimodal likelihood. A sample y1, y2, from a Cauchy density

    f(y; ) =1

    11 + (y )2

    provides an articially simple example of a multimodal likelihood. The relative like-lihood is

    R(; y1, y2) =[1 + (y1 )21 + (y1 )2

    ] [1 + (y2 )21 + (y2 )2

    ].

    For y1 = 6, y2 = 6, there are two maximum likelihood estimates that can be obtainedanalytically as = 5.92 and a local minimum at y = 0, at which R(y) = .105. SeeFigure 2.4. For c .105 the level c likelihood region is an interval. But the behaviorof plausibility within this interval is more complicated than in the preceding example.It increases to a maximum at = 5.92, decreases to a minimum of .105 at = 0,increases again to a maximum at = 5.92, and then decreases. For c > .105 theintervals split into the union of two intervals. For c = .25 the likelihood region is7.47 3.77, 3.77 7.47. See Problem 11.5, Chapter 11.

    In more complicated cases the only adequate way of summarizing the evidence isto present a graph of the likelihood function.

  • 20 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Figure 2.4: Cauchy relative likelihood, y1 = 6, y2 = 6

    Example 2.9.5 Poisson likelihood. The Poisson probability function with mean hasthe probability function f(y; ) = y exp()/y!. The joint probability function of aset of n independent Poisson observations y1, . . . , yn is

    f(y1, . . . , yn; ) =

    yi exp()/yi! = t exp(n)/

    yi!, t =

    yi.

    Thus the likelihood function is proportional to L(; t, n) = t exp(n). The maxi-mum likelihood estimate is = t/n = y. The relative likelihood function is

    R(; t, n) = (/)t exp[n( )] = (n/t)t exp(t n). (2.9)This is shown in Figure 2.5 for y1 = 1, y2 = 5, so that n = 2, t = 6, = 3. The mainfeature of R(; 6, 2) is its asymmetry. This must be taken into account in makinginferences about . The plausibility drops o faster for values of < 3 than for values > 3.

    To illustrate the eect of increasing the sample size, Figure 2.5 also shows therelative likelihood arising from n = 20, t = 60. The maximum likelihood estimateis still = 3, so that the relative likelihoods are centered at the same point. Thedierence is in their shape, R(; 60, 20) being much more condensed around 3 thanis R(; 6, 2). This reects the increased precision in estimating arising from thelarger sample containing more information about . A related consequence of thelarger sample is that R(; 60, 20) is also much more symmetric about = 3 than

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 21

    Figure 2.5: Poisson relative likelihoods. R(; 6, 2) ; R(; 60, 20) - - - -; N(3; .15)

    is R(; 6, 2). The asymmetry of R(; 6, 2) must be taken into account in makinginferences about based on t = 6, n = 2. Larger values of are more plausiblethan smaller values. Failure to take this into account would result in understatingthe magnitude of . The same is not true when t = 60, n = 20, illustrating how theshape of the likelihood function must inuence parametric inferences.

    Example 2.9.6 Poisson dilution series, (Fisher 1922). Suppose the density of organ-isms in a given medium is per unit volume. To estimate the original mediumis successively diluted by a dilution factor a to obtain a series of k + 1 solutionswith densities /a0, /a, /a2, . . . , /ak. Suppose that a unit volume of the solutionwith density /ai is injected into each of ni plates containing a nutrient upon whichthe organisms multiply, and that only the presence or absence of organisms can bedetected. The observations are then y0, y1, . . . , yk, where yi is the number of sterileplates out of the ni at dilution level i.

    Assuming a Poisson distribution of the organisms in the original medium, theprobability of a sterile plate at level i is the probability that a given unit volume atdilution level i contains no organisms, which is

    pi = exp(/ai), i = 0, 1, . . . , k.The probability of a fertile plate at level i is 1 pi. Assuming independence of the niplates, yi has the binomial distribution (ni, pi). The probability of the observations

  • 22 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Figure 2.6: Relative likelihood, Poisson dilution series data

    isk

    i=0

    (niyi

    )pyii (1 pi)niyi . The log likelihood function of is

    logL(; y, a, n) =ki=0

    [yi log pi + (ni yi) log(1 pi)] + a constant, n = {ni}.

    The maximum likelihood estimate cannot be obtained analytically, and so must becalculated numerically. The rst and second derivatives of logL can be calculated as

    Sc(; y, a, n) =

    logL =

    (nipi yi)

    /[ai(1 pi)],

    I(; y, a, n) = 2

    2logL =

    pi(ni yi)

    /[a2i(1 pi)2].

    These calculations are facilitated by noting that dpi/d = pi/ai. Using these results,(2.7) can be used iteratively with a suitable initial value for . An initial value canbe obtained by using a single dilution producing an observed frequency close to .5ni.

    Fisher and Yates (1963 p. 9) give the following data: a = 2, k+1 = 10, {ni} = 5,and {yi} = {0, 0, 0, 0, 1, 2, 3, 3, 5, 5}. The unit volume was 1 cc, which contained .04gm of the material (potato our) containing the organisms. Thus if is the numberof organisms per cc, the number of organisms per gm of potato our is 25.

    For dilution level i = 6, y = 3 = .6ni. Setting exp(/26) = .6 yields an initialvalue = (0) = 26 log .6 = 32.693. Then the above derivatives used in two iterations

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 23

    of (2.7) give

    (0) = 32.693, Sc((0); y) = .02305, I((0); y) = .01039,(1) = 30.473, Sc((1); y) = .00217, I((1); y) = .01243(2) = 30.648, Sc((2); y) = .00002, I((2); y) = .01225,

    after which there is little change. The maximum likelihood estimate is = 30.65organisms/cc and I(; y) = .01225. The maximum likelihood estimate of the numberof organisms per gm of potato our is 25 = 766. But again the likelihood functionis asymmetric (Figure 2.6) and this must be taken into account in making inferencesabout . For example, the 15% relative likelihood bounds on 25 are 422, 1325. Thedeviation on the right of 25 is 60% more than that on the left. Thus the number oforganisms is more likely to be much larger than 25 than much smaller.

    Example 2.9.7(a) Gamma likelihood. Let y1, . . . , yn be n independent exponentialobservations with mean , having density function f(y; ) = (1/) exp(y/). Thelikelihood function of based on y1, . . . , yn, is proportional to the density function,

    L(; t, n) f(yi; ) =(1/) exp(yi/) = (1/)n exp(t/), (2.10)where t =

    yi. The maximum likelihood estimate is = t/n = y. Since t has the

    gamma distribution, (2.10) may be called a gamma likelihood. The Poisson likelihood(2.9) has the same algebraic form as (2.10) and so is a gamma likelihood. The gammarelative likelihood with n = 7, t = 308, = 44 is shown in Figure 2.7. The asymmetryis again apparent.

    Example 2.9.7(b) Censored exponential failure times Suppose n items were observedfor xed periods of time T1, . . . , Tn, and that r of the items were observed to failat times t1, . . . , tr, and the remaining (n r) items were observed to survive theirperiods of observation, and so were censored at times Tr+1, . . . , Tn. This gives rise toa combination of continuous and discrete variates. An item that fails contributes theprobability density function (1/) exp(ti/); a censored observation contributes theprobability function P (ti > Ti) = exp(Ti/). The resulting likelihood function isthus

    L(; t, r) r

    i=1(1/) exp(ti/)

    ni=r+1

    exp(Ti/) = (1/)r exp(t/), (2.11)

    where t =r

    i=1ti +

    ni=r+1

    Ti.

    This is also a gamma likelihood (2.10) with n replaced by r for r = 0.An example arising in the literature is n = 10, {Ti} = {81, 70, 41, 31, 31, 30, 29,

    72, 60, 21 } days, {ti} = {2, 51, 33, 27, 14, 24, 4 }; the last three failure times were

  • 24 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Figure 2.7: Gamma relative likelihood, n=7, t=308

    censored T8 = 72, T9 = 60, and T10 = 21, (Bartholomew 1957). See also Sprott andKalbeisch (1969), Sprott (1990). Thus r = 7, and t = 308. This yields the samelikelihood function as Example 2.9.7(a) (Figure 2.7).

    As an example of the use of functional invariance (Section 2.7.3) the survivorfunction P (t > ) = = exp(/) may be of more interest than in this example.The likelihood of for any specied , or of for any specied , can be obtained bysubstituting =/ log into the likelihood of (2.11), giving the relative likelihood

    (t log /r)r exp[r + (t log /)] = (t log /r)rt/ exp(r).Similarly, a likelihood interval a b is equivalent to a log b log at the same likelihood level. For example, in Figure 2.7, the 5% likelihood intervalis 20 130, giving the corresponding 5% likelihood region 20 log 120 log for and .Example 2.9.8 Uniform likelihood. Suppose y is a U( 12 , + 12) variate. Its densityis then unity for 12 y + 12 and zero elsewhere. In a sample of size n lety(1) y(2) y(n) be the ordered observations. Then 12 y(1) y(n) + 12 .The density functions are all unity in this range, so that the relative likelihood functionis the single likelihood interval

    R(; y(1), y(n)) =1, y(n) 12 y(1) + 12 ,0, otherwise. (2.12)

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 25

    Here the precision of the likelihood depends on the range r = y(n) y(1), the widthof the likelihood interval being 1 r. All values of within the interval are equallyplausible. The data give no reason for preferring one value of over any other value.All values outside the likelihood interval are impossible. The least precision ariseswhen r = 0, the precision being that of a single observation. The most precisionarises when r = 1, its maximum value, for which = y(n) 12 = y(1) + 12 is known.

    Here the sample size n plays no role in specifying the observed likelihood, andhence the precision of the inferences. However, indirectly n plays a role, since as nincreases, the probability of obtaining r = 1 increases. However, after r is observedthis fact is irrelevant. The inferences must be conditioned on the value of r obtained,that is, on the observed likelihood. This exemplies the statement in Section 2.3 thatbefore the experiment probabilities are relevant, after the experiment likelihoods arerelevant.

    Example 2.9.9 Normal likelihood. If y1, . . . , yn are independent normal variates withmean and variance 2, yi N(, 2), where is assumed known, the likelihoodfunction of is

    L(; , I) exp [ 122

    (yi )2]= exp

    { 122

    [(yi y)2 + n(y )2

    ]}

    exp[ n22

    (y )2]= exp

    [12I( )2

    ]= R(; , I), (2.13)

    where = y is the maximum likelihood estimate and I = n/2 = 1/var(y) is theobserved information (2.6a). This may be expressed by saying that has a N(, 2/n)likelihood. This must not be confused with the N(, 2/n) distribution of y. It simplymeans that has a likelihood function that has the same shape as the normal densityfunction, and is centered at the observed y with variance 2/n.

    Some normal likelihoods with = 0 and various values of I are shown in Figure 2.8.Unlike the previous likelihoods, they are completely symmetric about = y. Theirshape, or precision, is completely determined by I, or equivalently var(). This, inaddition to the assumed prevalence of the normal distribution, may explain why somuch attention is paid to the mean and variance of estimates by textbooks.

    Note that up to now, no mention has been made of mean and variance, nor ofany other properties of estimates, nor indeed of estimates per se. The emphasis is onthe whole of the likelihood function, and in particular on its location and its shape.The special feature of the normal likelihood is that these are determined completelyby the two quantities and I. If the underlying distribution is normal, these are themean and the reciprocal of the variance. But not in general.

    Example 2.9.10(a) Normal regression. If yi are independent N(xi, 2), variates, i =1, . . . , n, the xi being known constant covariates, and where is assumed known, thelikelihood function of is also

    L(; , I) exp [ 122

    (yi xi)2]

  • 26 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Figure 2.8: N(0, 1/I) likelihoods, I = 4 ....; I = 1 ; I = .25 - - - -

    = exp{ 122

    [(yi xi)2 + ( )2

    x2i]}

    exp[12I( )2

    ]= R(; , I), (2.14)

    where =

    xiyi/

    x2i , I =

    x2i /2.

    Example 2.9.10(b) Normal autoregression. Suppose {yi} is a sequence of observationsordered in time with conditional normal distributions yi|yi1 N(yi1, 2) of yi givenyi1, i = 1, . . . , n. Suppose the initial observation y0 is regarded as xed in advance,and as before, is known. The likelihood function of is the same,

    L(; , I) ni=1

    f(yi; , | yi1) exp[ 122

    (yi yi1)2

    ]

    exp[12I( )2

    ]= R(; , I),

    where =

    yiyi1/

    y2i1, I =

    y2i1/2.

    This likelihood function has the same algebraic form as the regression likelihood(2.14). It is obtained from (2.14) by replacing xi by yi1. Thus the likelihood functionmakes no distinction between regression and autoregression. The same is true for thegeneral linear regression model, yi N(xijj, 2).

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 27

    The previous examples all involved a single parameter. The corresponding treat-ment of multiple parameters is more dicult. With two or three parameters it maybe possible to make three- or four-dimensional plots. The next examples involve twoparameters. The simplest presentation seems to be by contour plots.

    Example 2.9.11 Logistic regression. This is somewhat similar to Example 2.9.6. Sup-pose the response to a stimulus is quantal (success, failure), as is often the case inassessing the potency of drugs. Suppose when applied at dose xi there is a probabilitypi = pi(xi) that the drug produces the required eect (success), and probability 1pithat the drug does not produce the required eect (failure). The dose xi is usuallymeasured by the logarithm of the concentration of the drug, so that xi .

    For simplicity it is desirable to have the response linearly related to the xi. Butsince pi is conned to the range (0, 1), it is unlikely in principle that pi can be linearlyrelated to xi over a wide range. Therefore it is advisable to change to a parameteri(pi) having a doubly innite range, so that a linear relation between i and xi is atleast possible in principle. The logistic transformation

    logpi

    1 pi = + xi, pi =e+xi

    1 + e+xi

    is convenient for this purpose. This is a logistic regression model. It assumes that thelog odds, log[p/(1 p)], is linearly related to x. However the parameter is usuallyof little interest, since it is the log odds of success at x = 0, which is usually outsidethe range of x used in the experiment. Of more interest is the parameter = /.In terms of the equivalent parametrization = (, ),

    logpi

    1 pi = (xi ), pi =e(xi)

    1 + e(xi). (2.15)

    Setting xi = produces pi = .5. The parameter is therefore the dose requiredto produce the required eect with probability .5, and is called the ED50 dose, themedian eective dose. Then is a summary measure of the strength of the drug, and is a summary measure of its sensitivity to changes in the dose. Also the use of rather than facilitates obtaining an initial value in the following iterations requiredto obtain the maximum likelihood estimate.

    Assuming a binomial distribution of responses, the probability of yi successes in nitrials, i = 1, . . . , k is

    (niyi

    )pyii (1pi)niyi , where pi is given by (2.15). The likelihood

    function is therefore

    L(, ; s, t, {ni}) (ni

    yi

    )pyii (1 pi)niyi

    e(xi)yi [1 + e(xi)]ni= ets

    [1 + e(xi)

    ]ni, (2.16)

    where s =yi, t =

    xiyi.

  • 28 CHAPTER 2. THE LIKELIHOOD FUNCTION

    The score function vector (2.4b) and observed information matrix (2.5b) are

    Sc1 = logL/ =

    (yi nipi),Sc2 = logL/ =

    (xi )(yi nipi),

    I11 = 2 logL/2 = 2

    nipi(1 pi), (2.17)I12 = 2 logL/ =

    (yi nipi)

    ni(xi )pi(1 pi)

    I22 = 2 logL/2 =

    ni(xi )2pi(1 pi).When , are replaced by their maximum likelihood estimates to give the informationmatrix (2.6b), the quantity

    (yi nipi) in I12 is zero because of the rst maximum

    likelihood equation Sc1 = 0, and so disappears from (2.17). These calculations arefacilitated by noting that 1pi = 1/[1+exp[(xi )]; dierentiating this separatelywith respect to and with respect to gives dpi/d = pi(1 pi), dpi/d =(xi )pi(1 pi).

    The following data cited by Finney (1971, p. 104) are the results of a test ofthe analgesic potency of morphine on mice. The analgesic was classied as eective(success) or ineective (failure).

    xi .18 .48 .78ni 103 120 123yi 19 53 83

    The recursive method of Section 2.6 can be used to obtain the maximum likelihoodestimates. Initial values can be obtained by using the two endpoints i = 1, k of (2.15),log(19/84) = (.18 ), log(83/40) =(.78 ), giving = 3.69, = .583. Usingthese as initial values, a few iterations of (2.7) give the maximum likelihood estimates = 3.6418, = .5671.

    A likelihood contour at level c is the set of values of , satisfying R(, ) = c, 0 c 1. The likelihood function can then be described by giving some representativelikelihood contours, such as c = .05, .10, .25, .50, and (, ). These are shown inFigure 2.9. They zone o regions of plausibility for , jointly. Values of and outside of the 5% contour are relatively implausible since their relative likelihoodsare less than 5% of the maximum. That is, these values reduce the probability of theobserved sample to less than 5% of the maximum possible.

    The probit transformation

    pi =12

    +xit=

    e12 t

    2dt

    is frequently used instead of (2.15), usually with very little numerical dierence.

    Example 2.9.12 Exponential regression. Consider pairs of observations (xi, yi), < xi < , < yi

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 29

    Figure 2.9: Likelihood contours, (, ), morphine data

    (the extreme value distribution). The likelihood function based on a sample of n pairs(xi, yi) is

    L(, ; {xi, yi})

    f(yi;, |xi) exp[n(+x)

    exp(yixi)]. (2.18)From (2.4b) the maximum likelihood equations are

    logL = n+ exp(yi xi),

    logL = nx+ xi exp(yi xi).

    The rst of these can be solved explicitly for to give exp() = [

    exp(yi xi)]/n.This can be substituted into the second equation to give a single equation for

    g() = nx+ nxi exp(yi xi)exp(yi xi) = 0.

    This can be solved iteratively for to give the maximum likelihood estimate usingSection 2.6 along with the derivative of g(). Then can be obtained withoutiteration from the rst maximum likelihood equation, the relative likelihood R(, )can be calculated for specic values of , , and likelihood contours can be graphed.

    The data in Table 2.1 are for two groups of patients who died of acute myelogenousleukemia, (Feigel and Zelen 1965). The patients were classied according to the

  • 30 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Table 2.1: Survival times in weeks of acute myelogenous patients

    AG+ n = 17 AG n = 16WBC Survival Time WBC Survival Timeexp(xi) exp(yi) exp(xi) exp(yi)2300 65 4400 56750 156 3000 654300 100 4000 172600 134 1500 76000 16 9000 1610500 108 5300 2210000 121 10000 317000 4 19000 45400 39 27000 27000 143 28000 39400 56 31000 82000 26 26000 435000 22 21000 3100000 1 79000 30100000 1 100000 452000 5 100000 43100000 65

    presence AG+, or absence AG, of a characteristic of white cells. Assuming theabove exponential regression model, the variate yi is the logarithm of the survivaltime in weeks, and the covariate xi is the logarithm of the white blood count (WBC).It is numerically and statistically more convenient to center the covariates about theirmean x, so that x = 0 in (2.18) and subsequent equations. As with the use of inExample 2.9.11, this facilitates obtaining an initial value of and results in likelihoodcontours that are more circular, somewhat like Figure 2.9.

    The maximum likelihood estimates are = 3.934, 2.862, = .4818,.1541, forAG+ and AG, respectively. Some representative likelihood contours are given inFigure 2.10. These suggest the survival time is greater for lower white blood counts,and that this dependence of survival time on white blood count is somewhat greaterfor the AG+ group. Also, is somewhat larger for the AG+ group. From the .05likelihood contour (Figure 2.10) in the absence of knowledge of , can range from2.3 to 3.5 in the AC group, and from 3.4 to 4.6 in the AC+ group at the 5% level oflikelihood.

    Example 2.9.13 Two binomial likelihoods, the 2 2 contingency table. The mostcommon case is that of two independent binomial (r, 1), (n r, 2) distributions inwhich r randomly chosen subjects out of n are assigned treatment 1 and the remainingn r treatment 2. The results can be presented in the form of a 2 2 contingency

  • 2.9. EXAMPLES OF LIKELIHOOD INFERENCES 31

    Figure 2.10: Likelihood contours, leukemia data

    table as in Example 2.9.3

    Distn S F Total1 x r x r2 y n r y n r

    Total t = x+ y n t n

    The likelihood function is the product of the two binomial likelihoods (2.3) arisingfrom two binomial variates x and y,

    L(1, 2; x, y, r, n) x1(1 1)rxy2(1 2)nry. (2.19)Gastric freezing was once widely promoted as the treatment for peptic ulcer. When

    nally a proper randomized trial was carried out with r = 69 patients randomized tothe gastric freeze and n r = 68 to the sham treatment, the results were x = 34successes and y = 38 successes respectively (Sackett, Haynes, Guyatt, and Tugwell,P. 1991, pp. 193, 194).

    If the probabilities of a success under the gastric freeze treatment and the controlor sham treatment are 1 and 2, respectively, the maximum likelihood estimatesare 1 = .493, 2 = .559. Some likelihood contours are shown in Figure 2.11 alongwith the line 1 = 2. This line, representing no dierence between the gastric freezetreatment and a completely irrelevant treatment, passes almost squarely within the

  • 32 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Figure 2.11: Likelihood contours, gastric freeze data

    contours of highest likelihood, indicative of no benets whatever due to the gastricfreeze treatment. In fact, whatever evidence of a dierence there is favors slightly thesham treatment over the gastric freeze treatment, the maximum likelihood estimatesatisfying 2 > 1. The evidence here could also be presented as in Figure 2.3.

    Example 2.9.14 ECMO trials: 2 2 table arising from an adaptive treatment alloca-tion. In order to allay ethical criticisms of a completely randomized assignment ofpatients to a treatment and a control, as in the last example, consider an adaptiveallocation of patients to two treatments 1 and 2, resulting in success or failure S,F . Suppose the treatment allocation is based on an urn containing type 1 and type2 balls. A ball is chosen at random, and the subject is assigned the correspondingtreatment. Initially the urn contains one type 1 and one type 2 ball. On trial i atype 1 ball is added to the urn if trial i 1 was on treatment 1 and was S, or was ontreatment 2 and was F . Otherwise a type 2 ball is added. Suppose there are n trials.The idea is to increase the probability that patients are assigned to the treatmentwith the highest success rate.

    Let ui, vi be indicator variables for a success on trial i and for treatment 1 on triali, that is ui = 1 or 0 according as trial i is S or F , vi = 1 or 0 according as trial i ison treatment 1 or 2. Let i be the probability of treatment 1 on trial i, and let 1,2 be the probabilities of S on treatments 1 and 2, respectively. On trial i there areexactly four mutually exclusive possibilities: S on treatment 1, F on treatment 1, Son treatment 2, and F on treatment 2, with probabilities i1, i(1 1), (1 i)2,

  • 2.10. NORMAL LIKELIHOODS 33

    and (1i)(12) and indicator variables uivi, (1ui)vi, ui(1vi), and (1ui)(1vi).The probability of the observations ({ui, vi}) can therefore be written

    ni=1

    (i1)uivi [i(1 1)](1ui)vi [(1 i)2]ui(1vi)[(1 i)(1 2)](1ui)(1vi)

    = x1(1 1)rxy2(1 2)nry

    vii (1 i)1vi , (2.20)where x =

    uivi is the number of successes on treatment 1, r =

    vi is the total

    number of trials on treatment 1, and y =ui(1 vi) is the number of successes on

    treatment 2. See Begg (1990). The results can be presented in the form of a 2 2contingency table as in Example 2.9.13, and the likelihood function of 1, 2 is again(2.19), the same as in Example 2.9.13. Thus the likelihood analysis is the same as inExample 2.9.13. But unlike Example 2.9.13, the distributions are not binomial sincethe trials are not independent.

    This method of allocation was used in a trial to compare treatment 1, extra-corporeal membrane oxygenation (ECMO), with treatment 2, conventional medicaltherapy (CMT), on infants with persistent pulmonary hypertension, because it wasthought that there was a potential for a dramatic improvement in survival with ECMO(Bartlett, Rolo, Cornell, Andrews, Dellon, and Zwischenberger 1985). It was there-fore thought to be unethical to randomize patients equally to what was thought tobe a considerably inferior treatment.

    The results of the experiment were n = 12, x = r =11, y = 0. This impliesthat there were 11 successes and 0 failures on ECMO, and 0 successes and 1 failureon CMT. In particular, this means there was only one control trial. This in factis the purpose of adaptive allocation, although in this case it was extreme. Themaximum likelihood estimates are 1 = 1, 2 = 0. But the likelihood contours shownin Figure 2.12 show a considerable variability. It is somewhat dicult to make adenitive inference from these results. Except for the .01 likelihood contour, the line1 = 2 lies above the likelihood contours, although not by much. Therefore there issome evidence for the superiority of ECMO over CMT. But many would not nd itconvincing or conclusive, nor on this basis could the dierence be considered dramatic.

    2.10 Normal Likelihoods

    The Taylor expansion of the logarithm of the relative likelihood function about themaximum likelihood estimate is

    logR(; y) = logR(, y) + ( ) logR(; y)

    =

    +12!()2

    2 logR(; y)2

    =

    + .

    The rst term on the right is zero by denition (2.2), and the second term is zeroin the usual case where the maximum likelihood estimate satises Sc = 0, (2.4a), so

  • 34 CHAPTER 2. THE LIKELIHOOD FUNCTION

    Figure 2.12: Likelihood contours, ECMO data

    that from (2.6a) this expansion can be written logR(; y) = 12( )2I(; y) + .If the remaining terms of the Taylor expansion are negligible, the result is the normalrelative likelihood function

    RN[; , I(; y)

    ]= exp

    (12u2

    ), where u = ( )

    I(; y). (2.21)

    The quantity can be said to have a N [, I(; y)1] likelihood, and u to have aN(0, 1) likelihood. These are not to be confused with normal distributions.

    The likelihood (2.21) is more general than the normal likelihoods (2.13) and (2.14).These likelihoods arise from the estimate itself having a normal distribution whosevariance is I(; y)1. In general, the normal likelihood (2.21) does not imply that hasa normal distribution, nor is I(; y) related to the variance of . The autoregressionlikelihood of Example 2.9.10b exemplies this.

    The remaining terms of the Taylor expansion will usually be negligible, and sothe likelihoods will usually be approximately normal, if the sample is large enough.For example, the Poisson likelihood (2.9) of Example 2.9.5 can easily be shown from(2.6a) to have I(; y) = n/ = 20/3 when n = 20, t = 60. The normal likelihood of with = 3, I(; y) = 20/3 is N(3, .15). Figure 2.5 exhibits R(; 60, 20) along withthe normal likelihood N(3, .15).

    Sometimes the remaining terms can be made negligible, and so the likelihoodsmade approximately normal, in small samples if express