Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two...

30

Transcript of Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two...

Page 1: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1
Page 2: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1
Page 3: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Statistical Methodologies withMedical Applications

Page 4: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1
Page 5: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Statistical Methodologies withMedical Applications

Poduri S.R.S. Rao

Professor of StatisticsUniversity of Rochester

Rochester, New York, USA

Page 6: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

This edition first published 2017© 2017 John Wiley & Sons, Ltd

Registered OfficeJohn Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply forpermission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted,in any form or by anymeans, electronic, mechanical, photocopying, recording or otherwise, except as permitted bythe UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not beavailable in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand namesand product names used in this book are trade names, service marks, trademarks or registered trademarks of theirrespective owners. The publisher is not associated with any product or vendor mentioned in this book

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparingthis book, they make no representations or warranties with respect to the accuracy or completeness of thecontents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particularpurpose. It is sold on the understanding that the publisher is not engaged in rendering professional servicesand neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or otherexpert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Names: Rao, Poduri S.R.S., author.Title: Statistical methodologies with medical applications / Poduri S.R.S. Rao.Description: Chichester, West Sussex, United Kingdom ; Hoboken : John Wiley & Sons Inc., 2016. |Includes bibliographical references and index.

Identifiers: LCCN 2016022669| ISBN 9781119258490 (cloth) | ISBN 9781119258483 (Adobe PDF) |ISBN 9781119258520 (epub)

Subjects: | MESH: Statistics as TopicClassification: LCC RA409 | NLM WA 950 | DDC 610.2/1–dc23LC record available at https://lccn.loc.gov/2016022669

A catalogue record for this book is available from the British Library.

Cover Image: Gun2becontinued/Gettyimages

Set in 10/12pt Times by SPi Global, Pondicherry, India

10 9 8 7 6 5 4 3 2 1

Page 7: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

To my grandchildrenAsha, Sita,

Maya and Wyatt

Page 8: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1
Page 9: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Contents

Topics for illustrations, examples and exercises xv

Preface xvii

List of abbreviations xix

1 Statistical measures 11.1 Introduction 11.2 Mean, mode and median 21.3 Variance and standard deviation 31.4 Quartiles, deciles and percentiles 41.5 Skewness and kurtosis 51.6 Frequency distributions 61.7 Covariance and correlation 71.8 Joint frequency distribution 91.9 Linear transformation of the observations 101.10 Linear combinations of two sets of observations 10Exercises 11

2 Probability, random variable, expected value and variance 142.1 Introduction 142.2 Events and probabilities 142.3 Mutually exclusive events 152.4 Independent and dependent events 152.5 Addition of probabilities 162.6 Bayes’ theorem 162.7 Random variables and probability distributions 172.8 Expected value, variance and standard deviation 172.9 Moments of a distribution 18Exercises 18

3 Odds ratios, relative risk, sensitivity, specificity and theROC curve 193.1 Introduction 193.2 Odds ratio 193.3 Relative risk 20

Page 10: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

3.4 Sensitivity and specificity 213.5 The receiver operating characteristic (ROC) curve 22Exercises 22

4 Probability distributions, expectations, variances and correlation 244.1 Introduction 244.2 Probability distribution of a discrete random variable 254.3 Discrete distributions 25

4.3.1 Uniform distribution 254.3.2 Binomial distribution 264.3.3 Multinomial distribution 274.3.4 Poisson distribution 274.3.5 Hypergeometric distribution 28

4.4 Continuous distributions 294.4.1 Uniform distribution of a continuous variable 294.4.2 Normal distribution 294.4.3 Normal approximation to the binomial distribution 304.4.4 Gamma distribution 314.4.5 Exponential distribution 324.4.6 Chisquare distribution 334.4.7 Weibull distribution 344.4.8 Student’s t- and F-distributions 34

4.5 Joint distribution of two discrete random variables 344.5.1 Conditional distributions, means and variances 354.5.2 Unconditional expectations and variances 36

4.6 Bivariate normal distribution 37Exercises 38Appendix A4 38

A4.1 Expected values and standard deviations of the distributions 38A4.2 Covariance and correlation of the numbers of successes x

and failures (n – x) of the binomial random variable 39

5 Means, standard errors and confidence limits 405.1 Introduction 405.2 Expectation, variance and standard error (S.E.) of the sample mean 415.3 Estimation of the variance and standard error 425.4 Confidence limits for the mean 435.5 Estimator and confidence limits for the difference of two means 445.6 Approximate confidence limits for the difference of two means 46

5.6.1 Large samples 465.6.2 Welch-Aspin approximation (1949, 1956) 465.6.3 Cochran’s approximation (1964) 46

5.7 Matched samples and paired comparisons 475.8 Confidence limits for the variance 485.9 Confidence limits for the ratio of two variances 49

viii CONTENTS

Page 11: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

5.10 Least squares and maximum likelihood methods of estimation 49Exercises 51Appendix A5 52

A5.1 Tschebycheff’s inequality 52A5.2 Mean square error 53

6 Proportions, odds ratios and relative risks: Estimation andconfidence limits 546.1 Introduction 546.2 A single proportion 546.3 Confidence limits for the proportion 556.4 Difference of two proportions or percentages 566.5 Combining proportions from independent samples 566.6 More than two classes or categories 576.7 Odds ratio 586.8 Relative risk 59Exercises 59Appendix A6 60

A6.1 Approximation to the variance of lnp1 60

7 Tests of hypotheses: Means and variances 627.1 Introduction 627.2 Principle steps for the tests of a hypothesis 63

7.2.1 Null and alternate hypotheses 637.2.2 Decision rule, test statistic and the Type I & II errors 637.2.3 Significance level and critical region 647.2.4 The p-value 647.2.5 Power of the test and the sample size 65

7.3 Right-sided alternative, test statistic and critical region 657.3.1 The p-value 667.3.2 Power of the test 667.3.3 Sample size required for specified power 677.3.4 Right-sided alternative and estimated variance 687.3.5 Power of the test with estimated variance 69

7.4 Left-sided alternative and the critical region 697.4.1 The p-value 707.4.2 Power of the test 707.4.3 Sample size for specified power 717.4.4 Left-sided alternative with estimated variance 71

7.5 Two-sided alternative, critical region and the p-value 727.5.1 Power of the test 737.5.2 Sample size for specified power 747.5.3 Two-sided alternative and estimated variance 74

7.6 Difference between two means: Variances known 757.6.1 Difference between two means: Variances estimated 76

CONTENTS ix

Page 12: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

7.7 Matched samples and paired comparison 777.8 Test for the variance 777.9 Test for the equality of two variances 787.10 Homogeneity of variances 79Exercises 80

8 Tests of hypotheses: Proportions and percentages 828.1 A single proportion 828.2 Right-sided alternative 82

8.2.1 Critical region 838.2.2 The p-value 848.2.3 Power of the test 848.2.4 Sample size for specified power 84

8.3 Left-sided alternative 858.3.1 Critical region 858.3.2 The p-value 868.3.3 Power of the test 868.3.4 Sample size for specified power 86

8.4 Two-sided alternative 878.4.1 Critical region 878.4.2 The p-value 888.4.3 Power of the test 888.4.4 Sample size for specified power 89

8.5 Difference of two proportions 908.5.1 Right-sided alternative: Critical region and p-value 908.5.2 Right-sided alternative: Power and sample size 918.5.3 Left-sided alternative: Critical region and p-value 928.5.4 Left-sided alternative: Power and sample size 938.5.5 Two-sided alternative: Critical region and p-value 938.5.6 Power and sample size 94

8.6 Specified difference of two proportions 958.7 Equality of two or more proportions 958.8 A common proportion 96Exercises 97

9 The Chisquare statistic 999.1 Introduction 999.2 The test statistic 99

9.2.1 A single proportion 1009.2.2 Specified proportions 100

9.3 Test of goodness of fit 1019.4 Test of independence: (r x c) classification 1019.5 Test of independence: (2x2) classification 104

9.5.1 Fisher’s exact test of independence 1059.5.2 Mantel-Haenszel test statistic 106

x CONTENTS

Page 13: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Exercises 107Appendix A9 109

A9.1 Derivations of 9.4(a) 109A9.2 Equality of the proportions 109

10 Regression and correlation 11010.1 Introduction 11010.2 The regression model: One independent variable 110

10.2.1 Least squares estimation of the regression 11210.2.2 Properties of the estimators 11310.2.3 ANOVA (Analysis of Variance) for the significance

of the regression 11410.2.4 Tests of hypotheses, confidence limits and prediction

intervals 11610.3 Regression on two independent variables 118

10.3.1 Properties of the estimators 12010.3.2 ANOVA for the significance of the regression 12110.3.3 Tests of hypotheses, confidence limits and prediction

intervals 12210.4 Multiple regression: The least squares estimation 124

10.4.1 ANOVA for the significance of the regression 12610.4.2 Tests of hypotheses, confidence limits and prediction

intervals 12710.4.3 Multiple correlation, adjusted R2 and partial correlation 12810.4.4 Effect of including two or more independent

variables and the partial F-test 12910.4.5 Equality of two or more series of regressions 130

10.5 Indicator variables 13210.5.1 Separate regressions 13210.5.2 Regressions with equal slopes 13310.5.3 Regressions with the same intercepts 134

10.6 Regression through the origin 13510.7 Estimation of trends 13610.8 Logistic regression and the odds ratio 138

10.8.1 A single continuous predictor 13910.8.2 Two continuous predictors 13910.8.3 A single dichotomous predictor 140

10.9 Weighted Least Squares (WLS) estimator 14110.10 Correlation 142

10.10.1 Test of the hypothesis that two random variables areuncorrelated 143

10.10.2 Test of the hypothesis that the correlation coefficienttakes a specified value 143

10.10.3 Confidence limits for the correlation coefficient 14410.11 Further topics in regression 144

CONTENTS xi

Page 14: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

10.11.1 Linearity of the regression model and the lack of fit test 14410.11.2 The assumption that V εi Xi = σ2, same at each Xi 14610.11.3 Missing observations 14610.11.4 Transformation of the regression model 14710.11.5 Errors of measurements of (Xi, Yi) 147

Exercises 148Appendix A10 149

A10.1 Square of the correlation of Yi and Ŷi 149A10.2 Multiple regression 149A10.3 Expression for SSR in (10.38) 151

11 Analysis of variance and covariance: Designs of experiments 15211.1 Introduction 15211.2 One-way classification: Balanced design 15311.3 One-way random effects model: Balanced design 15511.4 Inference for the variance components and the mean 15511.5 One-way classification: Unbalanced design and fixed effects 15711.6 Unbalanced one-way classification: Random effects 15911.7 Intraclass correlation 16011.8 Analysis of covariance: The balanced design 161

11.8.1 The model and least squares estimation 16111.8.2 Tests of hypotheses for the slope coefficient and

equality of the means 16311.8.3 Confidence limits for the adjusted means and their

differences 16411.9 Analysis of covariance: Unbalanced design 165

11.9.1 Confidence limits for the adjusted means and thedifferences of the treatment effects 167

11.10 Randomized blocks 16811.10.1 Randomized blocks: Random and mixed effects models 170

11.11 Repeated measures design 17011.12 Latin squares 172

11.12.1 The model and analysis 17211.13 Cross-over design 17411.14 Two-way cross-classification 175

11.14.1 Additive model: Balanced design 17611.14.2 Two-way cross-classification with interaction:

Balanced design 17811.14.3 Two-way cross-classification: Unbalanced additive

model 17911.14.4 Unbalanced cross-classification with interaction 18311.14.5 Multiplicative interaction and Tukey’s test for

nonadditivity 18411.15 Missing observations in the designs of experiments 184

xii CONTENTS

Page 15: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Exercises 186Appendix A11 189

A11.1 Variance of σ2α in (11.25) from Rao (1997, p. 20) 189A11.2 The total sum of squares (Txx, Tyy) and sum of products

(Txy) can be expressed as the within and betweencomponents as follows 189

12 Meta-analysis 19012.1 Introduction 19012.2 Illustrations of large-scale studies 19012.3 Fixed effects model for combining the estimates 19112.4 Random effects model for combining the estimates 19312.5 Alternative estimators for σ2α 19412.6 Tests of hypotheses and confidence limits for the variance

components 194Exercises 195Appendix A12 196

13 Survival analysis 19713.1 Introduction 19713.2 Survival and hazard functions 19813.3 Kaplan-Meier Product-Limit estimator 19813.4 Standard error of Ŝ(tm) and confidence limits for S(tm) 19913.5 Confidence limits for S(tm) with the right-censored observations 19913.6 Log-Rank test for the equality of two survival distributions 20113.7 Cox’s proportional hazard model 202Exercises 203Appendix A13 Expected value and variance of Ŝ(tm) and confidencelimits for S(tm) 203

14 Nonparametric statistics 20514.1 Introduction 20514.2 Spearman’s rank correlation coefficient 20514.3 The Sign test 20614.4 Wilcoxon (1945) Matched-pairs Signed-ranks test 20814.5 Wilcoxon’s test for the equality of the distributions of two

non-normal populations with unpaired sample observations 20914.5.1 Unequal sample sizes 210

14.6 McNemer’s (1955) matched pair test for two proportions 21014.7 Cochran’s (1950) Q-test for the difference of three or

more matched proportions 21114.8 Kruskal-Wallis one-way ANOVA test by ranks 212Exercises 213

CONTENTS xiii

Page 16: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

15 Further topics 21515.1 Introduction 21515.2 Bonferroni inequality and the Joint Confidence Region 21515.3 Least significant difference (LSD) for a pair of treatment effects 21715.4 Tukey’s studentized range test 21715.5 Scheffe’s simultaneous confidence intervals 21815.6 Bootstrap confidence intervals 21915.7 Transformations for the ANOVA 220Exercises 221Appendix A15 221

A15.1 Variance stabilizing transformation 221

Solutions to exercises 222

Appendix tables 249

References 261

Index 264

xiv CONTENTS

Page 17: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Topics for illustrations,examples and exercises

Heights, weights and BMI (Body Mass Index) of sixteen and twenty-year-old boysfrom growth charts

Immunization coverage of one-year-olds: Measles, DTP3 and HEP B3 from WHOreports

Medical insurance for children

Sudden Infant Death Syndrome (SIDS)

Population growth rates and fertility

Age, family size, income and health insurance

Healthcare expenditure in Africa, Asia and Europe

Vaccination for flu for different age groups

Emergency department visits for cold symptoms, injuries and other reasons.

Overweight and obesity

Trends of adult obesity

BMI and mortality

Smoking, heart disease and cancer risk

Air pollution and cancer risk

Hypertension, systolic and diastolic blood pressures (SBP, DBP) of males andfemales.

Cholesterol levels: LDL and HDL

Effects of overweight on LDL

Low-dose aspirin and reduction of certain types of cancer

Celiac disease and the benefits of gluten-free diet

Statins and the reduction of LDL

Exercise and its benefits for blood pressure levels

Weight loss with diets of combinations of low and high-levels of fatty acids andprotein

Page 18: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Medical rehabilitation of stroke patients

Functional independence measures of stroke patients from medical rehabilitation

Sources: Reports of WHO, CDC, U.S. Health Statistics; Journal of the AmericanMedical Association (JAMA); New England Journal of Medicine (NEJM), Lancetand other published literature.

xvi TOPICS FOR ILLUSTRATIONS, EXAMPLES AND EXERCISES

Page 19: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Preface

Statistical analysis, evaluation and inference are essential for every type of medicalstudy and clinical experiment. Physicians and medical clinics and laboratories rou-tinely record the blood pressures, cholesterol levels and other relevant diagnosticmeasurements of patients. Clinical experiments evaluate and compare the effectsof medical treatments and procedures. Medical journals report the research findingson the relative risks and odds ratios related to hypertension, abnormal cholesterollevels, obesity, harmful effects of smoking habits and excessive alcohol consumptionand similar topics.

Estimation of the means, standard deviations, proportions, odds ratios, relative risksand related statistical measures of health-related characteristics are of importance for theabove types of medical studies. Evaluation of the errors of estimation, ascertaining theconfidence limits for the population characteristics of interest, tests of hypotheses andstatistical inference, and Chisquare tests for independence and association of categoricalvariables are important aspects of many medical studies and clinical experiments.Statistical inference is employed, for instance, to assess the relationship between obesityand hypertension and the association between air pollution and bronchial problems.A variety of similar problems require statistical investigations and inference. Regressionanalysis is widely used to determine the relationship of clinical outcomes and physicalattributes. In several clinical investigations, correlations between diagnostic observa-tions are examined to search for the causal factors. Analysis of Variance and Covarianceprocedures are extensively employed to examine the differences between the effects ofmedical treatments. All the above types of statistical methods, procedures and techni-ques required for medical studies, research and evaluations are presented in the follow-ing chapters. Topics such as the Meta-analysis, Survival Analysis and Hazard Ratios,and nonparametric statistics are also included.

Following the descriptive statistical measures in the first chapter, definitions ofprobability, odds ratios and relative risk appear in Chapters 2 and 3. Binomial, nor-mal, Chisquare and related probability distributions essential for the statistical meth-ods and applications are presented in Chapter 4. Estimation of the means, variances,proportions and percentages, odds ratios and relative risks, Standard Errors (S.E.) ofthe estimators and confidence intervals appear in Chapters 5 and 6. Tests of hypoth-eses of means, proportions and variances, p-values, power of a test, sample sizerequired for a specified power are the topics for Chapters 7 and 8. The Chisquare testsfor goodness of fit and independence are presented in Chapter 9. Linear, multiple andlogistic regressions and correlation are the topics for Chapter 10. Chapter 11 presentsthe Analysis of Variance (ANOVA) and Covariance procedures, Randomized bocks,

Page 20: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Latin square designs, fixed and random effects models, and two-way cross-classification with and without interaction. Meta-analysis and Survival Analysis inChapters 12 and 13 are followed by the nonparametric statistics in Chapter 14.The final chapter contains topics in ANOVA and tests of hypotheses including theSimultaneous Confidence Intervals and Bootstrap Confidence Intervals.

Examples, illustrations and exercises with solutions are presented in each chapter.They are constructed from the observations of practical situations, research studiesappearing in The New England Journal of Medicine (NEJM), Journal of the AmericanMedical Association (JAMA), Lancet and other medical journals, and the summariespresented in the Health Statistics of the Center for Disease Control (CDC) in theUnited States and theWorld Health Organization (WHO). They are related to a vari-ety of medical topics of general interest including the following: (a) heights, weightsand Body Mass Index (BMI) of ten-to-twenty-year-old boys and girls; (b) immuni-zation of children; (c) overweight, obesity, hypertension and high cholesterol levelsof adults; (d) benefits of fat-free and gluten-free diets and exercise, and (e) healthcareexpenditures and medical insurance.

BMI is the ratio of the weight in kilograms to the square of the height in meters.A person is considered to be of normal weight if the BMI is 18.5–24, overweight if itis 25–29, and obese if it is 30 or more. For the blood cholesterol levels of adults, LDLless than 100 mg/dL and HDL higher than 40 mg/dL are considered optimal. Systolicand diastolic blood pressures, SBP and DBP of 120/80 mmHg are considered desir-able. Illustrations and examples and exercises throughout the chapters are related tothese medical measurements and other health-related topics. Readily available soft-ware programs in Excel, Minitab and R are utilized for the solutions of the illustra-tions, examples and exercises.

The various topics in these chapters are presented at the level of comprehension ofthe students pursuing statistics, biostatistics, medicine, biological, physical and naturalsciences and epidemiological studies. Each topic is illustrated through examples. Morethan one hundred exercises with solutions are included. This book can be recommendedfor a one-semester or two-quarter course for the above types of students, and also forself-study. One or two semesters of training in the principles and applications of statis-tical methods provides adequate preparation to pursue the different topics. The variousstatistical methods for medical studies presented in this manuscript can also be of inter-est to clinicians, physicians, and medical students and residents.

I would like to thank the editor, Ms. Kathryn Sharples, for her interest in this proj-ect. Thanks to Charles Heckler, Kevin Rader and Nicholas Zaino for their expertreviews of the manuscript. Thanks also to Sarah Briscoe, Isabelle Weir and PatriciaDigiorgio for their assistance in assembling the manuscript on the word processor.Special thanks to my wife and daughter, Drs. K.R. Poduri, MD and Ann Hug Poduri,MD, MPH for sharing their medical expertise in selecting the various topics and illus-trations throughout the chapters.

Poduri S.R.S. RaoProfessor of Statistics

University of Rochester

xviii PREFACE

Page 21: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

List of abbreviations

WHO: World Health OrganizationCDC: Center for Disease ControlLDL: Low Density LipoproteinHDL: High Density LipoproteinLDL and HDL are measures of cholesterol levels in units of milligrams forDeciliter (mg/dl)SBP : Systolic Blood PressureDBP: Diastolic Blood PressureSBP and DBP are measures of pressure in the blood vessels in units of millimeters ofmercury (mmHg)BMI: Body Mass Index

Page 22: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1
Page 23: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

1

Statistical measures

1.1 Introduction

Medical professionals, hospitals and healthcare centers record heights, weights andother relevant physical measurements of patients along with their blood pressurescholesterol levels and similar diagnostic measurements. National organizations suchas the Center for Disease Control (CDC) in the United States, the World HealthOrganization (WHO) and several national and international organizations recordand analyze various aspects of the healthcare status of the citizens of all age groups.Epidemiological studies and surveys collect and analyze health-related information ofthe people around the globe. Clinical trials and experiments are conducted for thedevelopment of effective and improved medical treatments.

Statistical measures are utilized to analyze the various diagnostic measurementsas well as the outcomes of clinical experiments. The mean, mode and mediandescribed in the following sections locate the centers of the distributions of the abovetypes of observations. The variance, standard deviation (S.D.) and the related coef-ficient of variation (C.V.) are the measures of dispersion of a set of observations.The quartiles, deciles and percentiles divide the data respectively into four, tenand one hundred equal parts. The skewness coefficient exhibits the departure ofthe data from its symmetry, and the kurtosis coefficient its peakedness. The measure-ments on the heights, weights and Body Mass Indexes (BMIs) of a sample of twenty-year-old boys obtained from the Chart Tables of the CDC (2008) are presented inTable 1.1. These measurements for the ten and sixteen- year old boys and girls arepresented in Appendix Tables T1.1–T1.4.

Statistical Methodologies with Medical Applications, First Edition. Poduri S.R.S. Rao.© 2017 John Wiley & Sons, Ltd. Published 2017 by John Wiley & Sons, Ltd.

Page 24: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

1.2 Mean, mode and median

The diagnostic measurements of a sample of n individuals can be represented byxi, i = 1,2,…,n . Their mean or average is

x =n

i= 1

xi n = x1 + x2 +…+ xn n (1.1)

For the heights of the boys in Table 1.1, the mean becomes x = 162 + 163 +…+ 188 20 = 175 2 cm. Similarly, the mean of their weights is 73.1 kg. For theBMI, which is (Weight/Height2), the mean becomes 23.59.

The mode is the observation occurring more frequently than the remaining obser-vations. For the heights of the boys, it is 176 cm. The median is the middle value of theobservations. If the number of observations n is odd, it is the (n+ 1)th observation. If nis an even number, it is the average of the (n/2)th and the next observation. Both themode and median of the twenty heights of the boys in Table 1.1 are equal to 176 cm,which is slightly larger than the mean of 175.2 cm.

Table 1.1 Heights (cm), weights (kg) and BMIsof twenty-year old boys.

Height Weight BMI

162 54 20.58163 55 20.70167 58 20.80168 59 20.90170 60 20.76172 62 20.96172 63 21.30173 66 22.05174 68 22.46176 72 23.24176 75 24.21176 75 24.21177 78 24.90178 80 25.25178 82 25.88180 84 25.93184 86 25.40184 88 25.99186 95 27.46188 102 28.86

BMI =Weight/(Height)2.

2 STATISTICAL MEASURES

Page 25: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

The mean, mode and median locate the center of the observations. The mean isalso known as the first moment m1 of the observations. For the healthcare policies,for instance, it is of importance to examine the average amount of the medicalexpenditures incurred by families of different sizes or specified ranges of income.At the same time, useful information is provided by the median and modal valuesof their expenditures. Figure 1.1 is the Stem and Leaf display of the heights inTable 1.1. The cumulative number of observations below and above the medianappear in the first column. The second and third columns are the stems, with theattached leaves.

1.3 Variance and standard deviation

The variance is a measure of the dispersion among the observations, and it is given by

s2 =n

i= 1

xi−x2 n−1

= xi−x2 + x2−x

2 +… xn−x2 n−1 (1.2)

The divisor (n – 1) in this expression represents the degrees of freedom (d.f.). If (n – 1)of the observations and the sum or mean of the n observations are known, theremaining observation is automatically determined. The expression in (1.2) can also

be expressed asi j

xi−xj2n n−1 , which is the average of the squared differences

of the n(n – 1) pairs of the observations. The standard deviation (S.D.) is given by s,the positive square root of the variance. The second central moment of the observa-

tions m2 = xi−x2 n is the same as n−1 s2 n. For the twenty heights of boys in

2 16 234 16 789 17 02234(6) 17 6667885 18 0442 18 68

Figure 1.1 Stem and leaf display of the heights of the twenty boys. Leaf unit = 1.0.The median class has (6) observations. The cumulative number of observations belowand above the median class are (2, 4, 9) and (5, 2).

STATISTICAL MEASURES 3

Page 26: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

Table 1.1, s2 = 162−175 2 2 + 163−175 2 2 +…+ 188−175 2 2 19 = 51 33

and m2 = 19 20 51 33 = 48 76. The standard deviation becomes s = 7 16 cm.The unit of measurement is attached to both the mean and standard deviation; kg

for weight and cm for height. It is kg/(meter-squared) for the BMI. The coefficient ofvariation (C.V.), is the ratio of the standard deviation to the mean s x and is devoidof the unit of measurement of the observations. The mean, variance, standard devi-ation and C.V. for the above three characteristics for the 20 boys in Table 1.1 arepresented Table 1.2.

1.4 Quartiles, deciles and percentiles

Any set of data can be arranged in an ascending order and divided into four parts withone quarter of the observations in each part. Twenty-five percent of the observationsare below the first quartile Q1 and 75 percent above. Similarly, half the number ofobservations are below the median, which is the second quartile Q2, and half above.Three-quarters of the observations are below the third quartile Q3 and one-fourthabove. As seen in Section 1.2, the median of the heights in Table 1.1 is 176 cm.The average of the fifth and sixth observations is 171 cm, which is the first quartile.Similarly, the third quartile is 179 cm, which is the average of the fifteenth and six-teenth observations. The box and whiskers plot in Figure 1.2 presents the positions ofthese quartiles.

Ten percent of the observations are below the first decile and 90 percent above.Ninety percent of the observations are below the ninth decile and 10 percentabove. One percent of the observations are below the first percentile and 99 percentabove. Similarly, 99 percent of the observations are below the ninety-ninth percentileand 1 percent above.

Table 1.2 Summary figures for the heights, weights and BMIsof the 20 boys in Table 1.1.

Height Weight BMI

Mean x 175.2 73.1 23.59Variance (s2) 51.33 188.09 6.53m2 48.76 178.69 6.21S.D.(s) 7.16 13.71 2.56C.V.(%) 4.09 18.76 10.85m3 –18.86 913.69 5.24K1 –0.055 0.383 0.341m4 5690 70901 75.93K2 2.39 2.22 1.97

4 STATISTICAL MEASURES

Page 27: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

1.5 Skewness and kurtosis

Physical or diagnostic measurements xi, i = 1, 2,…, n , of a group of individualsmay not be symmetrically distributed about their mean. The third central moment,

m3 =n

1

xi−x3 n will be zero if the observations are symmetrically distributed

about the mean. It will be positive if the observations are skewed to the right andnegative if they are skewed to the left. For the symmetrically distributed observations,the third, fifth, seventh and all the odd central moments will be zero. The Pearsonian

coefficient of skewness is given by K1 =m3 m3 22 , which does not depend on the unit

of measurement of the observations unlike m2 and m3. For any set of observationssymmetrically distributed about its mean, m3 = 0 and hence K1 = 0. For thepositively skewed observations, m3 and K1 are positive. For the negatively skewedobservations, they are negative. For the heights of the boys in Table 1.1,

m3 = −18 86 and K1 = −18 86 48 76 3 2 = −0 055. These heights are slightly neg-atively skewed.

The fourth central moment of the observations, m4 =1

xi−x4 n, becomes

large as the distribution of the observations becomes peaked and small as it becomesflat. The Pearsonian coefficient of kurtosis is given by K2 =m4 m2

2, which does notdepend on the unit of measurement. For the normal distribution, which is extensivelyemployed for statistical analysis and inference, K1 = 0 and K2 = 3. For the

Hei

ghts

190

185

180

175

170

165

160

Heights of boys

Figure 1.2 Box and whiskers plot of the heights of boys in Table 1.1, obtained fromMinitab. The middle line of the box is the median Q2. The bottom and top lines are thefirst and third quartiles Q1 and Q3. The tips of the vertical line, whiskers, are theupper and lower limits Q1 + 1 5 Q3− Q1 and Q1−1 5 Q3− Q1 .

STATISTICAL MEASURES 5

Page 28: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

observations on all the three characteristics in Table 1.1, the fourth moments are large,as seen from Table 1.2, but K2 is smaller than three.

1.6 Frequency distributions

Any set of clinical measurements or medical observations can be classified into a con-venient number of groups and presented as the frequency distribution. The CDC,National Center for Health Statistics (NCHS) and other organizations present varioushealth-related measurements of the U.S. population in the form of summary tables.Thesemeasurements are obtained from periodic or continual surveys of the populationin the country and also from the administrative medical records of the population.They are arranged according to age groups, education, income levels, male-femaleclassification and other characteristics of interest. Similar summary figures are pre-sented by the WHO and healthcare organizations throughout the world. For the sakeof illustration, the twenty heights of the boys in Table 1.1 are arranged in Table 1.3 intoseven classes of the same width of five, and displayed as the histogram in Figure 1.3.

In general, the n observations can be divided into k classes with ni observations in

the ith class, n =k

1

ni. The mid-values of the classes can be denoted by (x1, x2,…, xk).

With the above notation, the mean of the n observations becomes

x =k

1

fixi = n1x1 + n2x2 +…+ nkxk n (1.3)

where fi = ni n is the relative frequency in the ith class andk

1

fi = 1. From the above

table and (1.3), the mean of the heights is

x = 1 × 160 + 2 × 165 +…+ 1× 190 20 = 175 25

Since the 20 observations are grouped, this mean differs slightly from the actual valueof 175.2 cm.

Table 1.3 Frequency distribution of the heights of the 20 boys in Table 1.1.

Class Mid-xi Frequency (ni) Relative frequency ( fi)

157.5–162.5 160 1 0.05162.5–167.5 165 2 0.10167.5–172.5 170 4 0.20172.5–177.5 175 6 0.30177.5–182.5 180 3 0.15182.5–187.5 185 3 0.15187.5–192.5 190 1 0.05

n = 20 Σ fi = 1

6 STATISTICAL MEASURES

Page 29: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

For the grouped data, the second moment becomes

m2 =k

i= 1

fi xi−x2 = n1 x1−x

2 + n2 x2−x2 +…+ nk xk −x

2 n (1.4)

Now, s2 =k

1

ni xi−x2 n−1 = nm2 n−1 . From (1.4), for the heights of the boys,

m2 = 56 19 and s2 = 59 14, which differ from the actual values 48.76 and 51.33 as aresult of the grouping. From the grouped data, the third and fourth central moments

are obtained fromm3 =k

1

fi xi−x3 andm4 =

k

1

fi xi−x4. In general, the rth central

moment for the grouped data is given by mr =k

1

fi xi−xr.

1.7 Covariance and correlation

The heights and weights of the 20 boys in Table 1.1 can be denoted byxi,yi , i = 1, 2,…, n . With the subscripts (x, y) for these characteristics, as pre-sented in Table 1.2, the standard deviations of these characteristics are sx = 7 16and sy = 13 71. Their covariance is given by

Heights

Fre

quen

cy

190185180175170165160

6

5

4

3

2

1

0

Histogram of the heights of boys

Figure 1.3 Histogram of the distribution of the heights of the boys in Table 1.3obtained from Minitab.

STATISTICAL MEASURES 7

Page 30: Statistical Methodologies with · 5.5 Estimator and confidence limits for the difference of two means 44 5.6 Approximate confidence limits for the difference of two means 46 5.6.1

sxy =n

1

xi−x yi−y n−1

= x1−x y1−y + x2−x y2−y +…+ xn−x yn−y n−1 (1.5)

It is the sum of the cross-products of the deviations of (xi, yi) from their means

divided by (n – 1). It can also be expressed as sxy =n

1

xiyi−nxy n−1 The sam-

ple correlation coefficient of (x, y) is

r = sxy sxsy (1.6)

It will be positive as y increases with x and negative if it decreases, and vice versa. Ingeneral, the covariance can be positive or negative. It can range from a very smallnegative value to a very large positive number, and the units of measurements of bothx and y are attached to it. The correlation coefficient, however, ranges from –1 to + 1,and it is devoid of the units of measurements of the two characteristics. If x increasesas y increases, or x decreases as y decreases, their covariance and correlation will bepositive; negative otherwise. If x and y are not related, sxy and r will be zero. For theheights and weights of the twenty-year-old boys in Table 1.1, from (1.5), (1.6) andTable 1.2, sxy = 1814 61 19 = 95 51 and r = 95 51 7 16 × 13 71 = 0 97. In thiscase, these two characteristics are highly positively correlated as expected.Figure 1.4 displays the relationship of the weights and heights of the twenty boysin Table 1.1.

Height

Wei

ght

190185180175170165160

100

90

80

70

60

50

Plot of weight relating to height

Figure 1.4 Plot of the weights of the twenty-year-old boys on their heights from theobservations in Table 1.1.

8 STATISTICAL MEASURES