STATISTICAL ANALYSIS OF BLUEGILL SUNFISH DATA
USING LINEAR LOGISTIC REGRESSION
Susan Ng
B.A.(~onors), University of Hong Kong, 1973
PROJECT SUBMITTED IN PARTIAL FULF'ILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
in the Department
of
Mathematics and Statistics
@ Susan Ng 1986
SIMON FRASER UNIVERSITY
August 1986
All rights reserved. This work may not be reproduced in whole or in part, by photocopy
or other means, without permission of the author.
Name: Susan Ng
Degree: Master of Science
Title of p r o j e c t : Statistical Analysis of Bluegill Sunfish Data
Using Linear Logistic Regression
Examining Committee:
Chairman : Dr, A,R. Freedman
Dr, R. Lockhart Senior Supervisor
Dr. D. Eaves
--- - - -- - Dr. R, Routledge External Examiner Mathematics and Statistics Department Simon Fraser University
Date Approved: 8 August 1986
I hereby grant to Slmon Fraser
my thesis, project or extended essay (the
to users oi the Simon Fraser University L
PART l AL COPYRl GHT LICENSE
University the right to
title of which Is shown
ibrary, and to make part
lend
below)
la1 or
single copies on1 y for such users or i n response to a request from t h e
library of any other university, or other educational Institution, on
i t s own behalf or for one of its users. I further agree that permission for multiple copying of this work for scholarly purposes m y be granted
by me or the Dean of Graduate Studies. It is understwd that copying
or publication of this work for flnanclal gain shall not be allowed
without my w f itten permission.
T i t l e of Thes i s/Project/Extended Essay
Author:
(signature)
(name 1
V
(da te )
ABSTRACT
A data set from an artificial breeding experiment on the
bluegill sunfish is analysed. The aim of the experiment is to
test whether alternative reproductive patterns in the bluegill
sunfish are genetically inherited, and to identify factors that
contribute to the different reproductive patterns. Linear
logistic regression models using the maximum likelihood
estimation method are employed. Improvement' chi-square
statistics are used to select explanatory variables. Various
goodness-of-fit statistics are used to check the adequacy of the
fitted models. Finally, a Monte-Carlo study is carried out to
check the appropriateness of the selection criterion and the
sensitivity of the goodness-of-fit statistics. The model
identified before the Monte-Carlo study is then reassessed.
iii
ACKNOWLEDGMENTS
I would like to take this opportunity to thank my Senior
supervisor, Dr. R. Lockhart, for his invaluable advice and for
his many helpful suggestions during my M.Sc. studies.
I would also like to thank the members of my Supervising
committee for their valuable comments and advice.
In particular, I am grateful to Dr. D. Eaves for his
supervision during the initial analysis of the data set.
Finally, my special thanks to Dr. M. Gross for supplying the
detail of the sunfish experiment.
TABLE OF CONTENTS
Approval .................................................... ii
~bstract ................................................... iii ~cknowledgments ............................................ iv
List of Figures ............................................ vii List of Tables ............................................ viii FOREWORD .................................................... ix I . EXPERIMENT ............................................. 1
Data ................................................... 3 CHOICE OF MODEL ........................................ 4 Problem Set-up ......................................... 4
................................ Choice of ink unction 6
.................................. Linear Logistic Model 7
............................. BMDP Statistical Software 1 1
I11 . EXPLANATORY VARIABLES ................................. 13 ...................... Improvement Chi-square Statistic 15
.......... Result of Selection of Explanatory Variables 16
Monte-Carlo Study of Improvement Chi-square Statistic . 19 .......................... Reassessment of Fitted Model 23
Genetic-Environmental Interaction ..................... 24 . ....... CHECKING ADEQUACY OF MODELS RESIDUAL ANALYSIS 25
.................. Residuals for Linear Logistic Models 25
........................... Checking Adequacy of Models 28
. .... CHECKING ADEQUACY OF MODELS GOODNESS-OF-FIT TEST 31
.................... Hosmer's Goodness-of-fit Statistic 32
C.C. Brown's Goodness-of-fit Statistic ................ 33
Likelihood Ratio Statistic ............................ 34 ........................ Pearson's Chi-square Statistic 36
Monte-Carlo Study of Quality . of Chi-square ~pproximation ...................................... 36
Checking Adequacy of Models ........................... 41
Comparison of Model-A and Model-B ..................... 43 VI . TESTS ............................................... 45 VII . Analysis of Environmental Effects ..................... 51
Results .............................................. 52
VIII . Conclusion ............................................ 54
Appendix A .................................................. 74 ~ibliography ................................................ 76
LIST OF FIGURES
Figure Page
A.I Plot of Residual versus Predicted Proportion Precocious : Model-A ................................ 56
~ . 2 Normal probability Plot : Model-A ..................... 57 ~ . 3 Plot of Observed versus Predicted Proportion
Precocious : Model-A ................................ 58 ~ 1 . 1 Plot of Residual versus Predicted Proportion
Precocious : Model-B (Collapsed Cells) .............. 59 B1.2 Normal Probability Plot : Model-B (Collapsed Cells) ... 60 ~1.3 Plot of Observed versus Predicted Proportion
Precocious : Model-B (Collapsed Cells) .............. 61 B2.1 Plot of Residual versus Predicted Proportion
Precocious : Model-B (96 cells) ..................... 62 B2.2 Normal probability Plot : Model-B (96 Cells) .......... 63 B2.3 Plot of Observed versus Predicted Proportion
Precocious : Model-B (96 Cells) ..................... 64
LIST OF TABLES
able Page
2 Variable Selection by Improvement Chi-square Statistic 17
3 Simulation Results of Improvement Chi-square Statistic 22
4 Simulation Results of Goodness-of-fit Statistics ...... 37 1 . 1 Number of Fish Survived Including Unsacrificed ........ 65
Number of Fish Survived and Sacrificed ................ 66 ...................... Number of Female Fish Sacrificed 67
........................ Number of Male Fish Sacrificed 68
................. Number of Precocious Males Sacrificed 69
..... Ranking of Individual Cuckolder Fathers : Model-A 70
..... Ranking of Individual Cuckolder Fathers : Model-B 71
Ranking of Individual Mothers : Model-A ............... 72 Ranking of Individual Mothers : Model-B ............... 73
viii
FOREWORD
An important recent discovery in biology is the existence of
alternative mating strategies in some fish populations. In these
fish populations, some males mature precociously while 5-25% the
body size of 'normal' adult males and mate by sneaking into the
- latter's nests.
In the bluegill sunfish population in Lake Opinicon in
ontario, 'normal' males (called parentals) mature at the age of
seven to eight, build nests and provide care for the offspring.
The 'precocious' males (called cuckolders) mature at the age of
one to two years and fertilize eggs in nests of parental males
either by sneaking or mimicking the behavior of females. These
small cuckolder males provide no parental care for their
offspring.
An artificial breeding experiment on the bluegill sunfish
was conducted in 1982/1983 by Dr. M. Gross of ~iological
Sciences Department of Simon Fraser University and Dr. D.
Philipp of Illinois Natural History Survey. The aim of the
experiment was to test whether alternative reproductive patterns
in the bluegill sunfish (precocial maturity) are genetically
inherited, and to identify factors, genetic or environmental,
which contribute to the different mating patterns.
The experiment is described briefly in Chapter 1. Various
models for analysing the data are studied in Chapter 2. Linear
logistic regression was found to be the most appropriate model
and the PLR program of the BMDP Statistical Software was used
for analysing the data set. Chapter 3 describes how the
explanatory variables are selected. It also discusses the result
of a small-scale Monte-Carlo study aimed at investigating the
appropriateness of the selection criterion. Two models were
found appropriate. They are checked for adequacy by residual
analysis in Chapter 4 and by goodness-of-fit tests in Chapter 5.
~ l s o in Chapter 5, the results of a Monte-Carlo study to test
the quality of the chi-square approximation to the distributions
of various goodness-of-fit statistics are discussed. Results of
various tests and inferences are summarized in Chapter 6.
Lastly, in Chapter 7, an environmental factor, the pond effect,
is further analysed.
CHAPTER I
EXPERIMENT
~eproductively mature parental males, cuckolder males, and
female bluegill sunfish were collected on 27 June 1982 from an
active breeding colony in Sandy Bay, Lake Opinicon. Eight
- females of age five were crossed with six cuckolder males of age
three and six parental males of age eight. They were selected to
be of similar body size and the age that is the mean of their
type. Sperm from each of the parental and cuckolder males were
used to fertilize approximately 200 eggs from each female.
Approximately 19200 progeny of the 96 crosses were reared
for 24 hours in a laboratory at Queen's university Biological
Station, Lake Opinicon. They were then transferred to a
laboratory at the Illinois Natural History Survey (INHS),
Champaign, Illinois. The eggs and emerging fry from each cross
were placed in 96 separate 40-litre tanks. In August, the
crosses were transferred to a greenhouse containing 96 200-litre
tanks.
In September 1982, about 30 progeny per tank were chosen
randomly and tagged to identify both mother and father. They
were then transferred to four ponds at the INHS quat tic Research
Field Laboratory where they were reared together on a natural
diet. Each pond contained all the progeny of two females. The
ponds were also stocked with six mature parental and twelve
mature female bluegill sunfish from a local Illinois population
to simulate natural 'breeding conditions.
In June 1983, about one year after establishing the crosses
and at the height of natural bluegill breeding activity, the
ponds were drained and the progeny captured.
190 fish from Pond 2 were chosen randomly and kept alive for
future experiments and the remaining 1159 progeny were killed.
Their parents were identified and their weight and length were
measured. Their gonads, while fresh, were examined under a
dissecting microscope. The sex and reproductive state of all but
58 (31 from parental fathers and 27 from cuckolder fathers) were
readily determined. Of the remaining 1101 fish, 432 were male.
166 males were identified and categorized as precociously mature
by the presence of free sperm in their testes.
The 190 fish which were retained alive'were tested for the
presence of sperm by light pressure on the lower abdomen body
wall. Individuals that shed sperm were tagged by left opercular
clips and those that did not with right opercular clips. Then
all fish were released into a pond containing mature adult male
and female bluegill sunfish. Observation of spawning activity in
the ponds in the next year revealed that only those tagged by
left opercular clips were behaving as cuckolders. Thus, in this
experiment, the presence of free sperm at the age of one could
be used to clearly identify cuckolder males from parental males.
r-
Data - The counts , - number of fish survived, number of fish
sacrificed, number of females sacrificed, number of males
sacrificed and number of precocious males sacrificed - classified by crosses (families) are summarised in Tables 1.1 to
- 1.5.
There were altogether 432 male offspring killed and
identified, that is, an average of 4.5 male offspring from each
family. However, they were not evenly distributed. There were 9
families with no male offspring, 46 with one to four, 34 with
five to nine and 7 with ten or more.
Of these 432 male offspring, 195 were fathered by parental
males. Of these 195, 55 or 28% were precocious. On the other
hand, 1 1 1 or 4.7% out of 237 sons fathered by cuckolder's ' were
precocious.
CHAPTER I I
CHOICE OF MODEL
Problem Set-up
The objective of the experiment is to test whether precocial
- in the bluegill sunfish is genetically inherited. This
hypothesis predicts a greater proportion of precocious sons
among the male progeny of cuckolder fathers as compared to those
of parental fathers.
Within family or cell i (i = 1 , ..., N), there are n male i
off spring, Y ( j = 1,. . . ,n. 1. Each of these Y 's takes one of ij 1 i j
two possible forms: precocious or non-precocious. Let Y take i j
on value 1 if it is precocious and 0 if it is non-precocious.
Let
th Then Y is the number of precocious sons in the i family.
i Assuming the probability of becoming precocious, Bit is constant
for all individuals in the same family and the observations on
all individuals are independent, then the distribution of Y i
given n is binomial with index n and parameter 8 . i i i
The saturated model describing the data has N parameters 8 1 '
..., 8 where 8 is estimated by N i
The saturated model yields a perfect fit to the data but
does not help us understand the underlying process generating
the data, or the underlying structure of the data. To get this
understanding, we need to replace the individual data values,
the 8.'s, by a summary (model.) that describes their general 1
characteristics in terms of fewer parameters.
We suppose that the probability of becoming precocious, 8,
when suitably transformed depends linearly on some explanatory
variables or covariates. That is,
where h is a function of 8 and is usually called the link i
function,
X. is the i th
row of the Nxp matrix of explanatory 1
variables (known constants), and
0 is a vector of p unknown parameters.
choice of Link Function
An obvious candidate for the link function is the identity
transform h(8.1 = 8 . However it has the serious restriction 1 i
that the model may lead to some fitted values of X . 0 , and thus 1
some fitted values of 8 , outside the range [ 0 , 1 ] . i
The variance-stabilizing anqular transform
has the nice property that arcsin d(y./n.) (where y is the 1 1 i
number of successes in cell i) has asymptotic normal
distribution with mean A and variance 1/4n which is i i
independent of 0 . However, similar to the identity i
transformation; it is limited for general usefulness by.
finite range.
Two other transformations have received considerable
attention. One is the inteqral normal transform
where 0(t) is the value of the cumulative normal curve at the
point t.
The other is the loq-odds transform
8 = log i
'i 1-8 i
Both transformations always satisfy the constraint 0 I 8 I i
1 and are very similar over the range 0.1 5 8 5 0.9 when i
appropriately standardized. However the integrated normal
transform has the disadvantage over the log-odds transform of
the absence of sufficient statistics. Also, the log-odds
transform is preferred in this analysis because of its simple
interpretation as the logarithm of the odds ratio. Models
involving such a transformation are logistic models.
Linear Loqistic Model
The model
8 = log i
'i 1-8 i
where X = [x ... x 1 i i I i~
and p = [ P , ... 6 I T P
is a linear logistic model,
There are two ways of estimating the unknown parameters P .
One is based on non-iterative weighted least squares and the
other is the maximum likelihood method.
1 . m on-iterative Weiqhted Least Squares Method
The number of successes Y in each cell i of size n is i i
distributed as binomial with index n and parameter 0 . For i i
large n., provided that 8 is not too close to 0 or 1 , 1 i
y./n. u = log i l-yi/ni
is nearly normally distributed. As n tends to infinity, the i
asymptotic mean and variance are respectively
0 . 1 log - and
1-8 n.0. (1-0.) i 1 1 1
The asymptotic variance is consistently estimated by
These estimated variances can be used to obtain weights and
thus the weighted least squares method can be applied.
The expression for u is undefined when y equals 0 or i i
n . The u 's need to be transformed to i i
The estimate for the asymptotic variance is also modified to
The computation in the weighted least squares method is
less complex and no iteration is involved. Also various
graphical analyses are possible by treating the u 's as i
approximately normally distributed with known variances.
However if some n 's are small, the Central Limit Theorem i
cannot be applied. Moreover, if some n 's are small and some i
ye's are equal to 0 or n , the addition of the term 1/2n in 1 i i
the expression ( * ) might change the relative magnitude of
the data in an undesirable way. Lastly, since this method is
not based on sufficient statistics, some efficiency may be
lost.
2. Maximum likelihood Method
The second method for estimating the parameter 0 is the
maximum likelihood method. This method uses the sufficient
statistics
.d
3r I -r
l
C h
-rl
Q,
I .- w
-r
l 3r
.d
rg
zc
ii -r
l
.d
C n
.rl Q
, I .- w
.d
3r n
h
-4
I C
Y
\ -A
Q
, U
zc
ii -d
II
r"r
..I
C n
h
@a
.r(
X
w
a
X
aJ
+ C
u
zc
ii .r
l b.J
\
c.cI n
@a
.I.(
X .d
3r V
a
X
aJ C
ZC
I1 .d
L.J
II
th The jk entry of the information matrix 1(0) is given
A
Under regularity conditions, the estimate P has an
approximate multivariate normal distribution with
covariance matrix I which can be estimated by
1-Y;).
There are nice asymptotic results for some test
statistics. They can be used to select the 'best' set of
explanatory variables and to test whether a model fits
. the data adequately.
The main problem in using the maximum likelihood
method is a computational one of maximizing the log
likelihood. This can be solved by using statistical
packages like BMDP.
BMDP Statistical Software -
The PLR program of the BMDP Statistical Software was used
for analysing this data set. It computes maximum likelihood
using an iteratively reweighted Gauss-Newton method,
~t also calculates the asymptotic variance-covariance matrix of
the estimates.
~t can handle continuous or categorical explanatory
variables and generates design variables for the categorical
ones. It can select explanatory variables to be included in the
model in a stepwise manner based on maximum likelihood ratio.
It also provides, at each step, the log-likelihood, the
change in log-likelihood from the previous step, and three
goodness-of-fit statistics, namely the likelihood ratio
statistic, Hosmer's statistic and Brown's statistic.
predicted probabilities, standardized residuals and various
scatter plots are also available on request.
CHAPTER I 1 1
EXPLANATORY VARIABLES
The biologists believe that the factors affecting whether a
male offspring will become precocious or not can be classified
into environmental and genetic.
The only environmental explanatory factor for whi,ch the
experiment provides data was the pond effect. The genetic factor
was accounted for by the father-type effect, an individual
father effect, an individual mother effect and possibly
individual father crossed individual mother interaction effects.
The individual father effect was nested in father-type
because there were six fathers of father-type parental and six
fathers of father-type cuckolder. Also, due to the design of the
experiment, the individual mother effect was nested in pond
because the offspring of two mothers were stocked in each of the
four ponds.
If n and Y are respectively the number of males and ijkl ijkl
the number of precocious males in family of father k of
father-type i and mother 1 of pond j, then the full model
suggested was
Y - Bin(n ijkl ijkl' 'ijkl 1
. -r
l
d
rl a
k
0
W
0
II h
.rl w
Y
@a -
row 11 x
. . 'a . . . . . - II x k
al C
C,
a
W
rl a 7
a
-rl
3
-rl
a
C
-rl
II h
-rl
w
x
@a
C
Oa
lk
Ud
O
A
+I
a
. 0
II n
C -
d'w n - n
0.
d' . . . . .
.-- II - n
a
C
0
a
II ' n
C
. rl . n .
.rl
rl d
a
k
0
W
0
11 h
n
-4
w
rl
x
a -
row a x
'pond' and ' fathertype' , the 'rest were nested effects. Therefore
the design variables for these nested effects had to be
generated manually. As a result, they were treated as separate
continuous variables by the program and the stepwise selection
feature could not be used. Rather, the explanatory variables to
be included in the model had to be specified every time the
program was run.
The selection process of the explanatory variables was
sequential and the order in which they were considered was a
hierarchical one, that is, main effects followed by nested
effects and lastly by interaction effects. The variables that
significantly improved the fit of the model were retained while
those that did not were excluded from the model.
Improvement Chi-square Statistic
Whether a variable or a group of variables significantly
improve the fit of the model can be quantified by the
'improvement chi-square statistic'. It is twice the logarithm of
the ratio of the current versus previous likelihood function
values. Suppose there are two models, model(1) and model(2) and
that model(2) contains a variable or a group of variables in
addition to the variables in model(1). The likelihood ratio
statistic of model(2) is
where L(O), L(G2) are the log likelihood of the saturated model
and of model(2) respectively. Then
Then ~ ~ [ ( 2 ) 1(1)1 is asymptotically distributed as chi-square
with degrees of freedom equal to the number of parameters that
are in Model(2) but not in Model(1). To test the hypothesis that
model(1) holds against the alternate hypothesis that model(2)
holds but model(1) does not, we use G2[(2) 1 (I)] as the test statistic. A small p-value indicates that the new variable or
group of variables added significantly improves the fit of the
model to the data. G2[(2) 1(1)1 is called the improvement ~.
chi-square statistic or the deviance.
Result - of Selection - of Explanatory variables
The data in Tables 1.4 and 1.5 are the n 's and the ijkl
Y 's respectively in this analysis. There were altogether 96 ijkl families, 9 of which had no sacrificed male offspring. Thus
there were 87 effective covariate patterns.
The log-likelihood of each model considered and the
corresponding improvement chi-square statistic for each group of
variables added are presented in Table 2.
16
T a b l e 2
Variable Selection b~ Improvement Chi-square Statistic
-
MODEL PARAMETERS PARAMETERS LOG D.F. IMPROV. IN MODEL ADDED LIKELIHOOD CHI -SQ .
P-VALUE
B C, ~ond(~) Pond -270.079 3 < 0,001
D C, P, T, Parental -259.099 5 0.220 Parental Father father(^^)
E C, P, T, Cuc kolder -255.459 5 0.014 Cuc kolder Father Father(CF) (over ~odel-C)
F C, P, T, Mother -247-071 4 0.002 CF, ~other(M) (over Model-E)
G C, P, T, CF, Parental Fa./ -227.386 20 0.006 M, PF/M Mother Inter. Interaction (over Model-F)
H C, P, T, CF, Cuckolder Fa./ -230.467 20 0.032 M, CF/M Mother Inter. Interaction (over ~odel-F)
I C, P, T, Parental Fa./ -210.787 20 0.006 CF, M, Mother Inter. PF/M Inter., (over Model-H) CF/M Inter.
Cuckolder Fa./ -210.787 20 0.032 Mother Inter. (over Model-G)
The first variable to enter the model was 'pond'. The
improvement chi-square statistic over the null model with the
constant term was very significant with p-value less than 0.005.
'Father-type' was the next variable considered. The
improvement chi-square statistic was again very significant
(p-value = 0.0001) indicating that it should be in the model.
However when 'individual parental father' was added to the
model, the improvement chi-square statistic had a value of 0.22
meaning that they did not contribute significant improvement to
the model; therefore they were left out of the model.
Next, 'individual cuckolder father' was tried and the
improvement chi-square statistic was significant (p-value =
0.014).
The 'individual mother' effect was incorporated into the
model. The improvement chi-square statistic showed a highly
significant p-value of 0.002 indicating that it should be
included.
To see whether the order in which the explanatory variables
were entered would affect the result of the variable selection,
the 'parental father', 'cuckolder father', and 'mother' effects
were entered in different orders into the model after 'pond' and
'father-type'. In all cases, the 'parental father' effect was
insignificant while the other two were significant.
The next group of variables to enter the model was the
interaction effect. Both the 20-parameter 'parental father
crossed mother interactions' and the 20-parameter 'cuckolder
father crossed mother interactions' were significant when they
were entered separately into the model.
Also the 'parental father crossed mother interactions' were
significant when 'cuckolder father crossed mother interactions'
were in the model and vice versa.
Basing on the improvement chi-square statistics, the model
selected include the explanatory variables 'pond',
'father-type', 'individual cuckolder father', 'individual
mother' , 'parental father crossed mother interaction',
'cuckolder father crossed mother interaction' effects. The
number of parameters is 54 which is not a small fraction of the
number of effective covariate patterns. Thus it is desirable to
study the appropriateness of the improvement chi-square
statistic in analysing this data set.
Monte-Carlo Study of Improvement Chi-square Statistic
1. Objective
The objective of this study was to investigate the
reliability of the improvement chi-square statistic in the
selection of explanatory variables. The decision of whether
to include a parameter or a group of parameters in the
current model was based on the improvement chi-square
statistic which has approximately a chi-square distribution.
If the approximation is good, then one is more confident
about the selected model. Otherwise, this selection
procedure may have included more variables than it should,
or may have left out some variables which are important.
Moreover the quality of the chi-square approximation may
differ with the number of degrees of freedom relative to the
number of covariate patterns.
2. Simulation - I
The original data set was fitted to the model with 14
parameters : 1 for 'father-type', 3 for 'pond', 5 for
'individual cuckolder father', 4 for 'individual mother' and
1 for the general mean. The proportion precocious for each
cell predicted from this model was taken as the true
probability of success. These 'true' probabilities of
success, O i l together with the total number of male fish in
each cell, n , were used to simulate 200 binomial random i
variables with index n and parameter 8 for each cell. GGBN i i
routine of ISML library was used.
For each simulated sample, four goodness-of-fit
statistics were calculated for further analysis. Then the
data were fitted separately to two extended medels. Extended
Model 1 has 20 additional parameters (cuckolder father
crossed mother interactions) and Extended Model 2 has 40
additional terms (parental father and cuckolder father
crossed mother interactions). The improvement chi-square
statistic was calculated for each of the 200 data sets for
each model. As can be seen from Table 3 below, when the
nominal 0.05 level test was used, the Extended Model 1 (20
additional parameters) was significant 12% of the time while
the Extended Model 2 (40 additional parameters) was
significant 24% of the time. The result suggests that for
this analysis, the selection criterion using improvment
chi-square statistics tends to include more explanatory
variables than it actually should. A possible reason is that
the chi-square approximation to its distribution is not good
when the number of degrees of freedom is large.
Simulation - I1
To investigate further, another Monte-Carlo study was
carried out. This time, the original dataset was fitted to
the model with 10 explanatory variables : 1 for
'father-type', 3 for 'pond', 5 for 'individual cuckolder
father' and 1 for the grand mean. Following the same
procedure in Simulation I, simulated samples were generated.
For each simulated sample, the goodness-of-fit statistics
were computed. Also the data were fitted to an extended
model with 4 additional parameters ('individual mother'
effect). The improvement chi-square statistic was then
calculated for each data set. Since the analyses of these
simpler models cost much less than the ones involving
interaction terms, altogether 500 samples were simulated and
analysed, The nominal 5% level test was used. The
improvement chi-square statistics this time suggest that the
extra 4 parameters are needed 5.2% of the time.
T a b l e 3
Simulation Results of Improvement Chi-square Statistic i
Observed Siqnificance Level - - at a Nominal -- 5% Level
Simulation I Extended Model 1 12.0% ( 2 0 additional parameters)
Extended Model 2 24.0% ( 4 0 additional parameters)
Simulation I1 Extended Model ( 4 additional parameters)
4. Summary - of Results
The two Monte-Carlo studies suggest that the chi-square
approxima,tion for the improvement chi-square statistic works
fine when based on 4 degrees of freedom but is bad with 20
or more degrees of freedom. Thus it is a good selection
criterion for explanatory variables when the number of
additional parameters is small. However it tends to include
more parameters than necessary when the number of additional
r
parameters is large.
Reassessment - of Fitted Model
The selected model for the actual dataset was a logistic
model with explanatory variables 'pond1, 'father-type',
'individual cuckolder father', 'mother', 'parental father
crossed mother interaction' and 'cuckolder father crossed mother
interaction'.
Judging from the result of the Monte-Carlo studies, the
'individual parental father' effect should definitely be
excluded from the model while 'pond', 'father-type', 'individual
cuckolder father' and 'mother' should be in the model. However
there is doubt whether the 40 interaction terms should be
included since the Monte-Carlo studies showed that the 0 .
chi-square approximation of the improvement statistic tends to
understate the p-value when degrees of freedom are large. The
p-value for including the 'cuckolder father crossed mother
interaction' was 0.032. ~ssuming a 0.03 level of significance
test, the Monte-Carlo studies indicated that 9.2% of the times
the 20 more parameters were unnecessarily included. Thus there
is more evidence that 'cuckolder father crossed mother
interaction' can be left out of the model.
On the other hand, the p-value of the improvement of fit by
including the 'parental father crossed mother interaction' was
0.006, With a significance level of 0.006, the Monte-Carlo
P
studies indicated that 6 out of 200 times, or 3% of the time,
the 20 more parameters were unneccessarily included. Thus the
'parental father crossed mother interaction' effect was not as
significant as it appeared. It may be desirable to exclude it
from the model as well.
Therefore there are two possible models, A and B. Model-A
has larger improvement in log likelihood and has 54 parameters.
Model-B is more parsimonious with 14 parameters.
Genetic-Environmental Interaction
With Model-B, genetic-environmental interaction effect was
studied by including pond by father-type interaction parameters
in the model. The improvement chi-square statistic was not
significant at the 5% level (p-value = 0.09) . Thus this data set
cannot be said to reveal an effect of interactions upon
precociality.
CHAPTER IV
CHECKING ADEQUACY OF MODELS - RESIDUAL ANALYSIS
In the last Chapter, two models A and B were identified.
Model-A has explanatory variables 'pond', 'father-type',
'individual cuckolder father', 'individual mother', 'parental
- father crossed mother interaction' and 'cuckolder father crossed
mother interaction'. Model-B is similar to Model-A but with no
interaction effects.
In this chapter and the next, we examine whether the two
models adequately fit the data. Residual analysis and
goodness-of-fit tests are employed. The former is dealt with in
the following paragraphs and the latter will be discussed in
Chapter 5.
Residual analysis plays an important role in checking the
adequacy of a model. Residual plots can reveal points with large
residuals or patterns in the residuals. Points with large
residuals may indicate outliers that should be carefully checked
in the original data. Patterns in the plots suggest the need of
a better model.
Residuals - for Linear Loqistic Models
There are different ways of constructing residuals for
linear logistic models. McCullagh and Nelder [ 1 9 8 3 ] suggest the
Pearson residual and adjusted deviance residual. The PLR program
adjusted estimate of the variance.
1. Pearson Residual
Pearson residual is the simple residual scaled by the
estimated standard deviation of Y./n : 1 i
It ignores the variation in the estimate of 8 . i
2. Residual used in PLR ---
The residual used in the PLR program is the residual
scaled by the standard error of the residual:
standard error of residual
A
It takes into account the variation in 8 . To the first i
order of approximation, the square of the standard error of
residual is given by
A A
e.(i-0.1 1
a* -r ae 1 - (2) 1 - 1 (i)
n i a0 30
exp(X. 0) where 8 = 1
i I+exp(X.0) 1
and I is the information matrix of the data.
The standardized residuals are approximately standard
normal.
3. Adjusted Deviance Residual
McCullagh and Nelder [ 1 9 8 3 ] recommend that if n is A
i small or 8 is near 0 or 1 , the adjusted deviance residual
i defined as follows should be used:
A
where the sign is that of y -n.B . i i i
A program was written to compute these residuals for A
Model-A. It was found that when n is small and 8 is close i i
to 0 or 1 , the term
dominates. As a result the residual becomes very large even
when 6 is very close to y./n . Thus the adjusted deviance i 1 i
residual is not a very appropriate measure of model adequacy
for this data set.
?' checking Adequacy - of Models
In view of the above comparison, the standardized residual
provided by PLR was used in residual analysis for this data set.
Two residual plots were used. One plots residuals against fitted A
values, 0 . It can be used to check if the spread of residuals i
A
- is approximately constant and independent of 8 . The other plots i
ordered residuals against the quantiles of a standard normal
distribution (normal probability plot). It can be used to check
whether the residuals are approximately normal as the asymptotic
theory predicts.
1. Model-A
A
The plot of the residuals versus 6 for Model-A is shown i
in Figure A.1. Except for tws pints on t h e upper :eft
corner, the residuals do not exhibit significant variation
pattern with the fitted values.
The two cells with large variances were both fathered by
parental fish and were stocked in the same pond. The
residuals of these two cells are also the only residuals
that deviate significantly from the other points in the
normal probability plot (Figure A . 2 ) .
There is no evidence that these two cells are outliers
in the sense of misrecording or misclassification. They
indicate that the model under investigation cannot remove
Other than these two points, the residual plots show that
Model-A is adequate.
The explanatory variables in this model were 'pond',
'father-type', 'individual cuckolder father' and 'individual
mother'. Since there was no parental father effect involved,
the PLR program automatically collapsed data of individual
parental fathers resulting in 56 covariate patterns rather
than 96. As a result only 56 residuals were computed by the
PLR program and large errors, if they existed, in some of
the cells collapsed would not be revealed. Thus residuals
for those cells collapsed were computed individually using
the expression ( * * I . Plots with 56 cells and 96 cells are
presented in Figures B1.1, B1.2, B2.1, B2.2.
With 56 cells, the plot of residuals versus 6 does not i
demonstrate any obvious pattern. Also on the normal
probability plot, all residuals lie very closely to a
straight line.
With 96 cells, there is a point which lies far away from A
the rest of the points in the plot of residuals versus 8 . i
Also it does not lie close to a straight line as the others
do in the normal probability plot. This point is one of the
two points which have large variances in Model-A.
Other than this point, the residual plots suggest that
Model-B is also adequate.
CHAPTER V
CHECKING ADEQUACY OF MODELS - GOODNESS-OF-FIT TEST
Another powerful tool in checking the adequacy of a model is
the use of goodness-of-fit tests. Goodness of fit in this
context refers to a measure of how well the model fits the data,
- that is, how well the 8 . ' ~ are modelled in the form 1
It does not check whether the sampling distribution is a
binomial distribution or not.
In this chapter, some summary measures of goodness-of-fit
are described. Then the results of a Monte-Carlo study to assess
the quality of the chi-square approximation to the distribution
of these statistics are discussed. Lastly, the Models A and B
are checked using these goodness-of-fit statistics.
There are three summary statistics provided by the PLR
program for testing goodness-of-fit of the model to the data.
They are Hosmer's goodness-of-fit test, Brown's goodness-of-fit
test and likelihood ratio statistic. Apart from these three
statistics, Pearson's chi-square statistic was also included in
this process of checking the adequacy of the model.
Hosrner's Goodness-of-fit Statistic
Hosmer's goodness-of-fit statistic is similar to Pearson's
chi-square statistic except that it groups the N cells into A
fewer cells, say g. The cells whose predicted probabilities 6.
lie between C and C , are grouped into one cell where j - 1 3
A
The C.'s can depend on the data such that ~ / g values of Bi 3
in each interval or the C.'s can be fixed constants. 3
In the PLR program, g is taken to be 10 and C =j/g j
j=O,l,...,g.
-C
fall
for
The expected probability of each of these g cells is A
estimated by a weighted average of the 6 in the grouped cell. 3
i
Thus a 2xg contingency table can be formed:
Cells
1 2 . . . I
Total * . 2 . . . n n
-(ml+mz) d exp(6ml)(l+exp 8)
where - ~ ( 8 , m,, m2) = de B(m, ,m,)
and B(m,,m2) is the beta function.
This class of models include the logistic, probit, extreme
minimum value, extreme maximum value, double exponential,
exponential and reflected exponential models.
Brown's test assumes that X.p is correct but that the link 1
function g might be wrong.
It uses a score test in which the test statistic
asymptotically has a chi-square distribution with two degrees of
freedom. Readers are referred to a paper by ~rentice[l976].
Likelihood Ratio Statistic
The likelihood ratio statistic is defined as
where y is number of successes in cell i of size n , i i
and 8 is predicted probability of success in cell i. i
G 2 is twice the difference between the maximum log
likelihood achievable and that achieved by the model under
investigation.
As discussed in the previous chapter, the PLR program
automatically collapses cells basing on the parameters specified
in the model. The likelihood ratio statistic provided by the
program is then based on the collapsed data. The statistic thus
computed may have a better chi-square apprroximation when there
are many sparse cells. However, it may not be able to reflect
the true structure of the data. Very different observed
probabilities of success may be averaged out and thus important
information may be lost. Therefore a program has been written to
obtain the likelihood ratio statistic (and also the Pearson's
chi-square statistic) basing on all 96 cells, irrespective of
the parameters 'in the model.
The likelihood ratio statistic is distributed asymptotically
as a chi-square variate with degrees of freedom M-p where p is
the number of unknown parameters. When the statistic is computed
using the collapsed data, M is the number of effective covariate
patterns. Based on the uncollapsed data, M is the number of
non-empty cells.
Pearson's Chi-square Statistic
Pearson's chi-square statistic is defined as
A
where y , n and 8. are the same as for the likelihood ratio i i 1
statistic.
The asymptotic distribution of X2 is chi-square with degrees
of freedom M-p, same as for G2.
Monte-Carlo Study - of Quality - of Chi-square Approximation
. .
All four goodness-of-fit statistics are approximated
asymptotically by a chi-square distribution. Thus it is
desirable to assess the quality of the approximation using
Monte-Carlo simulation.
As described in Chapter 2, 500 samples were simulated using
the explanatory variables 'pond', 'father-type', 'individual
cuckolder father'. Another 200 samples were simulated using the
same set of explanatory variables together with 'individual
mother' effect. All four goodness-of-fit statistics were
computed and a nominal level of 0.05 was used to assess the
quality of the chi-square approximation.
The results of the Monte-Carlo study are summarized in Table
below.
T a b l e 4
Simulation Results - of Goodness-of-fit Statistics 1
Observed Siqnificance - - - Level at a Nominal 5% Level --
-
GOODNESS-OF-FIT STATISTIC SIMULATION
Brown's Statistic 5.0% 5.4%
Based on Collapsed Cells Pear son' s Statistic
Based on 96 Cells
Likelihood Ratio Statistic
Original 9.0% Based on Collapsed Cells Corrected 0.5%
Based on 96 Cells Original 19.5% 27.4%
Corrected 0 % 0.2%
1. Hosmer's Statistic
Results in Table 4 indicate that the Hosmer's statistic
is not a sensitive test for this data set. However it must
be noted %hat the PLR program assumes the statistic is
approximately chi-square distributed with 9-2 degrees of
freedom. It ignores the fact that when the number of
parameters p is greater than g, the first term in the
expression for the asymptotic distribution of the Hosmer's
statistic, x2(g-p), is meaningless.
The model used to simulate the first 200 samples has 14
parameters which exceed the value of 10 set for g by the PLR
program. Thus these 200 simulated datasets cannot be used to
judge the quality of Hosmer's statistic.
The model used to simulate the second 500 samples has 10
parameters. Hosmer's test rejects the true model only once
out of 500 times when the significance level is set to be
0.05. Thus Hosmer's statistic as provided by the PLR program
should not be used to test models with 10 or more
parameters.
Brown's Statistic
Brown's statistic achieved the desired level in the
Monte-Carlo study. At a nominal level of 0.05, it rejected
the true model 10 times out of the first 200 simulated
samples and 27 times out of the second 500 simulated
samples. The results based on this small-scale Monte-Carlo
showed that the chi-square approximation to Brown's
statistic was very good. Thus it is an appropriate statistic
to check adequacy of a model.
3. Pearson's Statistic
Based on the collapsed cells and at a nominal level of
5%, Pearson's chi-square test rejected the true model 1 1
times out of the first 200 simulated samples and 22 times
out of the second 500 simulated samples. Based on all 96
cells, the observed significance levels were 6.5% and 5.2%
in Simulations I and I1 respectively. Expected cell sizes as
small as 0.2 did not seem to jeopardize its performance in
this Monte-Carlo study. Thus a general guideline suggested
by some statisticians that Pearson's statistic be used only
when the minimal expected cell size exceeds 5 may appear too
conservative in this context.
4. Likelihood Ratio Statistic
Based on the collapsed cells and at a significance level
of 5%, the likelihood ratio test rejected the true model 9%
of the time in the first 200 simulated data sets and 9.4% of
the time in the second 500 simulated data sets. Based on all
96 cells, the respective observed significance levels were
19.5% and 27.4%.
It has been suggested in McCullagh and
that the chi-square approximation of the 1
Nelder
ikelihood ratio
statistic can be improved by means of a first-order
correction term (l+c)-l where
A
and N, p, n 8. are defined as before. i ' 1
This correction was included in the Monte-Carlo study. Based
on collapsed cells, the corrected likelihood ratio statistic
rejected the true model only 0.5% and 1.8% in Simulations I
and I1 respectively. Based on all 96 cells, the respective
observed significance levels were 0% and 0.2%. The results
suggest that the corrected statistic is not appropriate for
analysing this dataset. .
5. Summary - of Simulation Results
The following is a summary of the result of the
Monte-Carlo study. Hosmer's statistic should not be used
when the number of parameters in the model is 10 or more.
The PLR program should be able to adjust the value of g for
different models. Brown's goodness-of-fit statistic and
Pearson's chi-square statistic are well approximated by the
chi-square distribution. They provide reliable tests for
checking adequacy of a model. On the other hand, the
likelihood ratio statistic tends to reject a model more than
it should. Its p-value tends to be understated.
Unfortunately the correction factor does not seem to improve
the quality of the chi-square approximation. Thus the
likelihood ratio statistic should be used with caution.
Checkinq Adequacy - of Models
Taking into consideration the results of the Monte-Carlo
study, the various goodness-of-fit statistics are now used to
check the Models A and B.
Brown's goodness-of-fit statistic was not significant
(p-value = 0.65). This indicates that with the same set of
explanatory variables, a logistic model is appropriate
relative to the class of model with density given on Page
33.
Hosmer's goodness-of-fit statistic provided by the PLR
program could not be used because the number of parameters
is 54 which is much larger than the value of g of 10 set by
the PLR program.
The likelihood ratio statistic had a p-value of 0.08
suggesting that the logistic model with 54 parameters fits
the data.
Pearson's chi-square statistic was significant (p-value
= 0.03). Individual cells' contributions to the Pearson's
statistic were examined. The two cells with large errors
that were identified in the residual analysis account for
nearly 50% of the statistic. If these two cells were
ignored, the Pearson's statistic would have been
insignificant.
To conclude, other than the two cells with large errors,
Model-A can be considered adequate.
Brown's goodness-of-fit statistic was again not
significant (p-value = 0.16) suggesting the logistic model
is appropriate.
Hosmer's goodness-of-fit statistic could not be used to
test this model as well because the number of parameters,
14, is greater than the number of grouped cells of 10.
The likelihood ratio statistic provided by the PLR
program was based on collapsed cells (56) and had a p-value
of 0.04. This suggests the model does not fit the data
adequately. The corresponding Pearson's goodness-of-fit
statistic had a p-value of 0.15.
When these statistics were computed using 96 cells, they
were both very significant. Both p-values were about 0.001
with degrees of freedom of 73.
It may appear that Model-B is not adequate since
Pearson's statistic was significant. However, a careful
examination of individual cells showed that the cell
detected in the residual analysis to have a large error
accounted for 22% of Pearson's statistic. Ignoring this
cell, the Pearson's statistic would be insignificant.
The likelihood ratio statistic also had a highly
significant p-value of 0.001. The Monte-Carlo study results
showed that its p-value tends to be understated. The
observed significance level was 19.5% as compared to the
expected level of 5%. At a 0.001 significance level, the
observed level was 0.005. Moreover, the cell that had large
error accounted for "7% of the statistic. Thus there is
evidence that the Model-B under study is adequate as
suggested by Brown's and Pearson's statistics.
Comparison - of Model-A - and Model-B
From residual analyses and goodness-of-fit tests performed,
it was found that Model-A and Model-B were both adequate.
Model-A has a larger improvement in log-likelihood than Model-B.
Also by examining the plots of the observed versus the predicted
proportion precocious in Figures A.3, B1.3 and B2.3, Model-A
produced points which lie more closely to a straight line than
Model-B. Thus Model-A resulted in a closer fit to the data but
at the expense of using 54 parameters whereas Model-B is more
parsimonious with 14 parameters. They were both accepted and
tests were carried out on each of them.
CHAPTER VI
TESTS
Using the same notation as in Chapter 3, model-A is :
and Model-B is :
If a, and a2 denote parental and cuckolder father-types
respectively, then 'k(1)
= 0 for all k because there is no
individual parental father effect.
The following tests are performed individually for each
model.
1. Estimation - of Ratio of Odds of being --- was a Cuckolder to Odds given Father - --
Precocious given Father
was a Parental - -
One of the main objectives of the analysis is to test
whether cuckolder fathers produce more precocious sons than
parental fathers. This is equivalent to determining whether
the odds of being precocious given that the father was a
cuckolder is significantly higher than the odds of being
precocious given that the father was a parental. Let
Then w is the odds of being precocious for a male of i jkl
father k of father-type i and of mother 1 in pond j. Also
let
0 = n u i"' j A 1 i jkl
Then w l . . . is the odds of being precocious given father was
a parental. Similarly o , . . . is the odds of being precocious
given father was a cuckolder.
For both Models,
A
An estimate of a, is a, = 0.4154. Thus the ratio
w2 . . . h. . . is estimated to be 2.30. In other words the
odds of becoming precocious are approximately 2.3 times
higher with a cuckolder father than with a parental
father.
An approximate 95% confidence interval for a, is
given
where df = 96-9-54 = 33, A A '
and s.e.(a2) is the estimated standard error of a,,
Thus an approximate 95% confidence interval for
0 2 . . . h. . . is (1.27, 4.15). Since it does not include
1 , it can be concluded that cuckolder males produce more
precocious progeny than parental males,
The ratio w , . . / a , . . . is estimated to be 2.37 and
its approximate 95% confidence interval is given by
(1.50, 3.74). Since the confidence interval does not
include 1, it can be concluded that cuckolder males
produce more precocious progeny than parental males.
As compared to the confidence interval using the
Model-A, this one is tighter because of smaller standard
error and larger degrees of freedom.
c. Summary - of Results
Results of both models suggest that cuckolder males
produce more precocious progeny than parental males.
Thus there is evidence that precocial maturity in the
bluegill sunfish is genetically inherited.
2. Ranking Individual Cuckolder Fathers
For the purpose of identifying 'high lines' and 'low
lines' of cuckolder fathers for future breeding experiments,
it is desirable to rank individual cuckolder fathers by
their contribution to the odds of being precocious.
The estimated value of the parameter 'k (2
can be
regarded as a measure of the contribution of the k th
cuckolder father to the odds of being precocious.
To determine whether a cuckolder father is statistically
different from another in terms of their contribution to
odds of being precocious, Bonferroni confidence intervals
are calculated.
By comparing the estimated 0 k(2)
's, the individual
cuckolder fathers can be ranked, from largest to
smallest:
From the 90% Bonferroni confidence intervals
presented in Table 5, we can conclude that C5 is
different from C3 and C6 is different from 63.
Similarly by comparing the k(2)
's estimated for
Model-B, the individual cuckolder fathers can be ranked,
from largest to smallest:
In this model, C4 ranks higher than C2 while C2 ranks
higher than C4 in Model-A.
The 90% ~onferroni confidence intervals in Table 6
suggest that no pairs of cuckolder fathers are
significantly different. To have- some indication of
difference among the cuckolder fathers, the 80%
significance level were also calculated. This time, C5
is significantly different from C3.
3, Rankinq Individual Mothers
For future breeding purposes, the individual mothers
were also ranked according to their contribution to odds of
being precocious.
The sum of the parameters T and y is used as a j l(j!
measure of the contribution to odds of being precocious of
the 1 th th
mother in the i pond. By comparing the estimated
values of these parameters, both models yield the same
ranking from largest to smallest :
The Bonferroni confidence intervals are also calculated to
determine whether two mothers are statistically different.
The 90% ~onferroni intervals in Table 7 show that MI
is significantly different from M7, M4, M6, M5 and M3
whereas M3 is significantly different from M2, M8, M7
and M4.
The 90% Bonferroni int.ervals in Table 8 show that M3
is significantly different from MI, M2, M8 and M7.
From the results in ranking of cuckolder fathers and
mothers, Model-A seems to suggest more variation within
the fathers and the mothers than Model-B.
CHAPTER VII
ANALYSIS OF ENVIRONMENTAL EFFECTS
With either Model-A or Model-B, the environmental effect,
namely the 'pond' effect, is a very important effect. In this
chapter, it is further analysed.
Three candidates are explored as agents for the 'pond'
effect:
1. the density of fish in the pond,
2. the sex ratio of fish in the pond in terms of number of
males per 100 fish, and
3. growth rate due to the pond environment.
A true density variable is not available as the density of
fish in the pond changes with time. However, since the four
ponds are approximately of the same size, initial and final
populations for each pond can be used as substitutes for
density. Moreover it is likely that most of the mortality in the
pond occurred shortly after the initial stocking. Therefore the
final population may reflect the true densities in the ponds.
Previous analysis by Dr. M. Gross has indicated that
precocious sons are significantly larger than other offspring at
the age of two. Variables such as average body length of all
fish in a pond cannot be used as a measure of growth rate due to
the pond environment. This is because ponds with a large
proportion of precocious sons would be expected to have a longer
average body length. However by assuming that the effects of the
environment on how fast fish grow is the same for both males and
females, the average body length of females in a pond can be
used as a measure of growth rate for that pond. The average body
lengths of females in the four ponds were 55.50 cm., 55.97 cm.,
59.12 cm. and 54.87 cm. respectively.
For each of the Models A and B, the 'pond' effect is taken
out of the model, and replaced by 'initial population', 'final
population', 'sex ratio' and 'growth rate' one at a time.
The results for the two models are similar. 'Final
population' is found to be the most significant effect. With
'final population' in the model, are t h e other three factors
needed in the model? Adding in 'initial population', 'sex ratio'
or 'growth rate' after 'final population' is in the model
results in no significant improvement. But adding in 'final
population' after any of these three effects is in the model
results in a very significant improvement. Therefore, 'final
population' is definitely in the model and there is no point in
including any of the other three effects.
The result suggests that whether a male becomes precocious
or not depends on the number of fish in the pond, but does not
depend on how fast it grows or the sex ratio in the pond.
Since the estimated final population parameter in both
models is negative, the odds of being precocious increase as the
final population decreases.
CHAPTER VIII
CONCLUSION
Linear logistic regression is an appropriate model for
analysing this data set. The PLR program of the BMDP Statistical
Software is a powerful program for logistic models. However it
- has the drawback of collapsing cells, sometimes undesirably.
Moreover, in calculating the Hosmer's goodness-of-fit statistic,
it should be able to adjust the number of grouped cells, g, to
cater for different models.
From the results of the analysis of the data set, it is
found that precocial maturity in the bluegill sunfish is
genetically inherited, that is, cuckolder fathers tend to
produce more precocious sons than parental fathers. Findings
show that maternal and paternal genes are important and perhaps
maternal and paternal genetic effects interact. Environmental
effects exist and there is evidence that the odds of being
precocious increase as the final population decreases. However
there is no significant evidence for environmental crossed
genetic interaction.
The small scale Monte-Carlo studies suggest that the
improvement chi-square statistic is a good criterion for
selecting variables when the number of additional parameters is
small. However it tends to include too many high order
interactions because the chi-square approximation to its
distribution is not good when the number of degrees of freedom
is large.
Hosmer's goodness-of-fit statistic provided by the PLR
program of the BMDP Statistical Software is notl sensitive when
the number of parameters is large. Brown's goodness-of-fit
statistic and Pearson's chi-square statistic are well-behaved.
On the other hand, the likelihood ratio statistic tends to
reject a true model more than it should.
Thus the improvement chi-square statistic and the
goodness-of-fit statistics should be used with caution. The
appropriateness of these statistics is likely to depend on the
data set and the complexity of the model. It is advisable to
perform Monte-Carlo studies, whenever possible, to investigate
the appropriateness of these statistics to the data set being
analysed.
Fi g u r e A. 1
Plot of Residual versus Predicted Proportion Precocious -- Model-A
PREDICTED PROPORTI ON PRECOCIOUS
Fi gur e A. 2
Normal Probability Plot
QUANTILE OF NORMAL DISTRIBUTION
Fi g u r e A. 3
Plot of Observed versus Predicted Proportion Precocious --
0.0 0.20 0.40 0.60 0.80 1 .OO
PREDI CTED PROPORTI ON PRECOCIOUS
Plot of Residual versus Predicted Proportion Precocious -- Model-B (Collapsed Cells)
Fi gur e B l . I
PREDICTED PROPORTION PRECOCI OUS
F i g u r e B l . 2
Normal Probability Plot - Model-B (Collapsed Cells)
QUANTILE OF NORMAL DISTRIBUTION
Fi g u a e B 1 . 3
Plot Observed versus Predicted Proportion Precocious - Model-B (Collapsed Cells)
PREDICTED PROPORTION PRECOCIOUS
Fi g u r e B 2 . 1
Plot of Residual versus Predicted Proportion Precocious -- Model-B (96 Cells) -
PREDI CTED PROPORTI ON PRECOCI OUS
Fi g u r e B 2 . 2
Normal probability Plot
Model-B (96 cells) -
QUANTILE OF NORMAL DISTRIBUTION
Fi gur e B 2 . 3
Plot of Observed versus Predicted Proportion ~recocious -- Model-B (96 Cells) -
PREDICTED PROPORTION PRECOCIOUS
T a b l e 1 . 1
Number -- of Fish Survived Includinq Unsacrificed
M O T H E R
1 2 3 4 5 6 7 8 TOTAL
T a b l e 1 . 2
Number of Fish Survived and Sacrificed
M O T H E R
1 2 3 4 5 6 7 8 TOTAL
Number - of Female Fish Sacrificed
M O T H E R
1 2 3 4 5 6 7 8 TOTAL
Number --- of Male Fish Sacrificed
M O T H E R
1 2 3 4 5 6 7 8 TOTAL
T a b l e I . 5
Number - of Precocious Males Sacrificed
M O T H E R
1 2 3 4 5 6 7 8 TOTAL
T a b l e 5
Ranking of Individual Cuckolder Fathers I Model-A 90% ~onferroni Confidence Intervals -
T a b l e 6
Rankinq of Individual Cuckolder Fathers L Model-B
Bonferroni Confidence Intervals
90% Confidence 80% Confidence
Rankinq - of Individual Mothers - : Model-A 90% Bonferroni Confidence Intervals -
T a b l e 8
Rankinq - of Individual Mothers : Model-B - 90% Bonferroni Confidence Intervals -
APPENDIX A
To Find Variance of Residual used in PLR -- - --- th
Using the same notation as before, variance of the i
residual used in PLR is
Since
exp(X.p) A
1 exp(x.5 8 = and 8 = 1
i I+exp(X.P) i l+exp(X.p) I
1 1
by Taylor's Theorem,
Therefore,
and
The log of the likelihood of y ..., yN 1' is
N N log L = constant + Z y . 1 0 ~ ~ 8 ~ + Z (n.-Y.) log(1-6.)-
i=l 1 i=l 1 1 1
Again by Taylor's Theorem,
where the information matrix, I = - ~ [ a u ( p ) / a p ]
= -(au(p)/ap). . .
and finally,
BIBLIOGRAPHY
Bishop, P.M., Fienberg, S.E. and Holland, P.W. [1975]. Discrete Multivariate Analysis, MIT Press, Cambridge, MA.
Cox, D.R. [ 19701. The Analysis of Binary Data, Methuen, London.
Cox, D.R. and Snell, E.J. [1968]. "A General Definition of Residuals", J. R. St at ist. Soc., B, 30, 248-275.
Fienberg, S.E. [1980], The Analysis of Cross-Classified Data, 2nd Ed., MIT Press, Cambridge, MA.
Haberman, S.J. [1974]. The Analysis of Frequency Data, University of Chicago Press, Chicago.
Hosmer, D.W. and Lemeshow, S. [1980]. "Goodness of Fit Tests for the Multiple Logistic egression Model", Commun. Statist. - Part A Theor. Meth. ~9(10), 1043-1069.
Larntz, K. [1973]. "Small Sample Comparisons of Exact Levels for Chi-Squared Goodness-of-Fit Statistics", J. Amer. St at i st. Assoc. , 73, 253-263.
McCullagh, P. and Nelder, J.A. [1983]. Generalized Linear Models, Chapman and Hall, London.
- Plackett, R.L. [l98l]. The Analysis of Categorical Da.t a, Charles Griffin, London.
Prentice, R.L. [1976]. "A Generalization of the Probit and Logit Methods for Dose Response Curves", Bi omet r i cs, 32, 761-768.
UCLA [1983]. BMDP Statistical Software, University of California Press.
Top Related