Markov Chain Monte Carlo Approaches to Analysis of...

Behavior Genetics, Vol. 33, No. 3, May 2003 (© 2003)

Markov Chain Monte Carlo Approaches to Analysisof Genetic and Environmental Components of HumanDevelopmental Change and G � E Interaction

Lindon Eaves1,3 and Alaattin Erkanli2

Received 12 Sept. 2001—Final 10 Oct. 2002

The linear structural model has provided the statistical backbone of the analysis of twin andfamily data for 25 years. A new generation of questions cannot easily be forced into the frame-work of current approaches to modeling and data analysis because they involve nonlinearprocesses. Maximizing the likelihood with respect to parameters of such nonlinear models isoften cumbersome and does not yield easily to current numerical methods. The application ofMarkov Chain Monte Carlo (MCMC) methods to modeling the nonlinear effects of genes andenvironment in MZ and DZ twins is outlined. Nonlinear developmental change and genotype �environment interaction in the presence of genotype-environment correlation are explored insimulated twin data. The MCMC method recovers the simulated parameters and provides esti-mates of error and latent (missing) trait values. Possible limitations of MCMC methods are dis-cussed. Further studies are necessary explore the value of an approach that could extend thehorizons of research in developmental genetic epidemiology.

KEY WORDS: Growth curves; Bayesian inference; Gibbs sampling; Markov Chain Monte Carlo meth-ods; twins; longitudinal studies; G � E interaction, hierarchical mixed models.

2790001-8244/03/0500-0279/0 © 2003 Plenum Publishing Corporation

INTRODUCTION

From relatively modest and controversial beginnings(Jinks and Fulker, 1970), the last quarter-century hasseen the emergence of linear structural modeling as lit-tle short of an industry in behavioral genetics and ge-netic epidemiology to the point where it has supplantedalmost all other approaches to the analysis of familyresemblance. The reasons for its success are fairly clear.The approach is flexible, allowing for a very wide rangeof models to be specified for the effects of genes andenvironment. Models have been developed and testedfor biological and cultural inheritance, various patterns

of assortative mating, and developmental change in lon-gitudinal data. One major factor accounting for its ap-peal has been the ready extension of models for familyresemblance to incorporate multivariate measures(Martin and Eaves, 1977). On the assumption of under-lying multivariate normality (i.e., probit regression),the approach can be further extended to encompass di-chotomous (Fulker, 1973) and other categorical data(Eaves et al., 1978) in a rich variety of complex pat-terns (Neale and Kendler, 1995).

The widespread adoption of structural modelingtechniques has been further facilitated by the develop-ment and dissemination of remarkably flexible, well-supported, robust, efficient and user-friendly softwaresuch as the Mx package (Neale et al., 1999). Over thelast decade, structural modeling has provided the uni-fying platform for teaching the analysis of family re-semblance at a widely-subscribed series of internationalworkshops that are now credited with more than 500alumni.

1 Virginia Institute for Psychiatric and Behavioral Genetics, Depart-ment of Human Genetics, Virginia Commonwealth University.

2 Department of Biostatistics and Bioinformatics, Duke UniversityMedical Center.

3 To whom correspondence should be addressed at Virginia Institutefor Psychiatric and Behavioral Genetics, PO Box 980003, VirginiaCommonwealth University, Richmond, Virginia 23298-0003.

The structural modeling approach is typicallybased in maximizing the likelihood of data with respectto parameters of one or more theoretical models. Thus,the approach yields tests of goodness of fit and stan-dard errors of parameter estimates, minimizing, thoughnot removing, one element of subjectivity in decidingbetween alternative hypotheses.

The transformation of many areas of statistical ge-netic research accomplished by these approaches overthe last three decades has been remarkable and due inno small measure to the parallel development of com-puter hardware and software for numerical analysis thathas rendered as matters of routine analyses that werescarcely conceivable 30 years ago. In few areas has theimpact of these developments been greater than that ofbehavioral and psychiatric genetics.

When a method works well and is being produc-tive, there is a danger that limitations may be ignoredand significant scientific questions be deferred becausethere are many others that can be answered with theexisting methods.

Nonlinear models incorporating random geneticand environmental effects comprise one significantarena in which there are serious scientific questionsvying for the attention of behavioral geneticists. Ex-amples of issues requiring such models are the analy-sis of non-linear developmental change (e.g., geneticdifferences in growth patterns) and non-additive effects(e.g., epistasis and G � E interaction). Such processescannot readily be forced into the Procrustean bed of thelinear structural models that have, historically, provedso productive.

Although it is a relatively easy matter to write thelikelihood functions associated with such nonlinearmodels (see, e.g., Eaves et al., 1986), obtaining para-meter estimates has been a laborious process becausethe likelihood involves integration over the unknownrandom effects in the model. For example, in a (multi-variate) nonlinear model, such as a survival model orgrowth curve model, the likelihood of a twin pair de-pends on the values of the random genetic and envi-ronmental effects of the individual twins for all theparameters of the growth curve model (e.g., initial value,asymptote, slope, inflection point). The likelihood ofthe pair requires that the value of the likelihood of thephenotype, given particular (unknown) genetic andenvironmental effects be integrated over all values ofthe latent genetic and environmental variables. Thisintegration has as many dimensions as there are latentgenetic and environmental affects in the model. Thus,a four-parameter growth curve model in which the

280 Eaves and Erkanli

growth parameters of twins are each influenced by sep-arate genetic and environmental factors requires inte-gration over 4 � 2 � 2 � 16 dimensions. Even thoughthis is numerically tractable in simple cases throughmethods such as Gaussian quadrature, it rapidly be-comes tedious in practically important contexts, suchas the analysis nonlinear growth curves in twins, be-cause of the relatively large number of dimensions andthe lack of robust, accurate and relatively general adap-tive procedures for numerical integration. The problemof multidimensional integration is encountered in thecontext of hierarchial mixed models in which there arerandom individual differences in parameters of a non-linear model for a response variable (e.g., Lindstromand Bates, 1990). Even in those cases where estimatescan be obtained numerically, the additional computa-tion required to obtain estimates of confidence inter-vals is prohibitive.

In this paper, we introduce a fairly general MarkovChain Monte Carlo (MCMC) framework of theBayesian methodology (see, e.g., Gilks et al., 1996) tothe analysis of twin data that is free of many of the limi-tations inherent in some of the current widely usedmethods. In addition to providing numerical estimatesof the parameters of complex models for twin resem-blance, the MCMC approach provides, at little extracomputational cost, information that is more difficultto obtain through the classic likelihood-based ap-proaches, such as the joint posterior probability distri-bution of a latent trait for a twin pair. The use of “non-informative” prior distributions provide inferences thatare comparable with the classic MLE approaches, andadditionally, if there is prior information available, ittoo can be incorporated into the existing models throughthe use of informative prior distributions; such infor-mation cannot be utilized with MLE estimation. At onelevel, we may think of MCMC as an approach to para-meter estimation that uses a carefully constructed se-quence of Monte Carlo simulations to construct theintegrals that are so elusive in ML approaches. How-ever, the way the sequence of simulations is constructedprovides a wealth of information and possible insightthat is not available through the usual approaches tonumerical integration. Although MCMC has been usedquite widely in the analysis of linkage and pedigreedata (see, e.g., Thomas and Gauderman, 1996), it hasnot been applied much outside that arena. Part of thereason for this is almost certainly the convenience oflinear structural modeling and the conceptual impactof its applications in a wide range of contexts over thelast quarter century. Do et al.’s (2000) application of

MCMC to survival analysis of twin data is a notableexception that also reflects the relative intractability ofnonlinear genetic models to the usual likelihood-basedapproaches.

MARKOV CHAIN MONTE CARLO METHODS

Since their inception in physics in early 1950s(Metropolis et al., 1953; Hastings, 1970), MCMC meth-ods have been used in a variety of areas requiring com-plex statistical modeling such as image analysis (Besagand Green, 1993; Gelfand and Smith, 1990; Geman andGeman, 1984) and generalized linear mixed models(Clayton, 1996; Zeger and Karim, 1991). Although theapproach has been implemented in some genetic con-texts (see, e.g., Shoemaker et al., 1999, and citationsin Thomas and Gauderman, 1996), there have been rel-atively few published examples of its application totwin data (e.g., Burton et al., 1999; Do et al., 2000).The latter application was developed appropriately inthe context of survival analysis that has hitherto provednumerically cumbersome (Meyer and Eaves, 1988;Meyer et al., 1991) for the same reasons that compelledus to explore the application of MCMC methods to theproblems described here. Apart from apparently solv-ing some practical problems, MCMC methods com-mend themselves intellectually because they “providea unifying framework within which many complexproblems can be analyzed using generic software”(Gilks et al., 1996, p. 1). Indeed, the feasibility of ap-plying MCMC methods to behavior-genetic applica-tions is enhanced enormously by the free dissemina-tion of a windows version (WinBUGS 1.3) of theprogram BUGS (“Bayesian Inference Using GibbsSampling”; Spiegelhalter et al., 2000) that we used inall the applications described here.

Superficially, we may regard MCMC methodsas one further way of obtaining estimates of parame-ters of a genetic model when the usual approaches ofmaximum-likelihood are too tedious. Thus, MCMCmethods use Monte Carlo methods to approximate theintegrals that have proved tiresome in likelihood-basedapproaches, such as obtaining the normalizing constantsin Bayesian analyses, and marginal likelihood functionsin the generalized linear mixed-effects models. How-ever, from another perspective, MCMC’s foundation ina Bayesian approach to modeling raises more profoundquestions about the theoretical framework within whichthe task of modeling is conceived (see, e.g., Gelmanet al., 1995, Sivia, 1996). In the classic ML framework,we seek the values of model parameters that maximize

Markov Chain Monte Carlo 281

the likelihood of the data, given certain assumptionsabout the distribution of the data and the underlyingparameters. Within the Bayesian framework, we seekthe joint posterior distribution of the unknown para-meters given the data under, it is hoped, appropriateassumptions about the prior distribution of the para-meters and data. Whether or not we choose explicitlyto adopt a Bayesian paradigm for genetic modeling, itturns out that some of the algorithms developed withinthe Bayesian context have advantages for exploring arange of models that appear less tractable within theconventional context of maximum likelihood.

Briefly, the MCMC approach constructs a MarkovChain on the (parameter) space of unknown quantitiessuch that, starting with a series of trial values (e.g. means,regressions, genetic variances, genetic and environmen-tal effects etc.), after an initial series of iterations (the“burn-in”) successive iterations represent samples fromthe unknown joint distribution. This is the so-called sta-tionary distribution of the Markov Chain, and in theBayesian context, it is the joint posterior distributionof all the parameters. We note that in this context, the“parameters” do not just include the usual parametersof the structural model (means, genetic and environ-mental variances, etc.) but also the latent genetic and en-vironmental deviations of the individual twins and anymissing values. The MCMC iterations are furnished bysimulating values for the unknown parameters, condi-tional upon the given data, using specially catered tran-sition probability kernels (also called proposal distri-butions) that are not only easy to simulate from, butalso guarantee the convergence (in distribution) of thesimulated Markov Chain to the true joint posterior dis-tribution. After a burn-in period the probability distri-bution, and its moments like the expected value of anyfunction, of the unknown quantity is obtained to anydesired degree of precision by taking an (ergodic) aver-age of the successive values over a sufficiently largenumber of iterations.

The Gibbs sampler (Creutz, 1979; Gelfand andSmith, 1990; Geman and Geman, 1984; Ripley, 1979)is perhaps the most popular MCMC approach to con-struct a Markov chain with the desired properties. Inthe Gibbs sampling approach, the Markov kernels con-sist of the conditional distributions of each variables ofinterest given all the other variables. As an exampleconsider random variables X and Y having an unknownjoint distribution [X, Y], and assume further that eachof the conditional distributions [X ƒY] and [Y ƒX] areavailable in analytically closed form. Here the Markovkernels are the conditionals [X ƒY] and [Y ƒX]. So, if X0

and Y0 are initial values, then the Markov Chain is con-structed on the XY space by simulating successively asequence of {Xr, Yr} from the known conditionals[Y ƒXr-1] and [X ƒYr-1] for r � 1,2, . . . , R. It can beshown that the joint distribution of the sequence {Xr, Yr}converges to the joint distribution [X, Y] as R tends to-ward infinity, as long as these conditionals are bona-fide probabilitydistributions. Thus, for a sufficientlylarge R, the {Xr, Yr} resemble draws from the true jointdistribution [X, Y]. Within the Bayesian context, X andY are usually unknown parameters (e.g., the mean andvariance), and [X ƒY] and [Y ƒX] (suppressing the condi-tioning on the data) are the conditional posterior distri-butions, and [X, Y] is the joint posterior distribution ofX and Y, respectively. For example, the marginal pos-terior distribution [X] of X can be approximated by theMonte Carlo integration, for each X�x,

where the summation is over r � 1,2, . . . , R. Simi-larly, the expectation of a function g (X) is approxi-mated by the Monte Carlo average

Note that a desired byproduct of MCMC approach isthat not only the expectations, but the entire posteriordistribution of g(X) is approximated by using the se-quence {g(Xr)}. There are also several other ways, suchas general Metropolis-Hasting algorithms, to constructan MCMC sampler; in fact the Gibbs sampling is a spe-cial case of these general algorithms. A more thoroughaccount of the approach, which is beyond the scope ofthis paper, may be found in Tierney (1994), Gilks et al.(1996), and Brooks (1998). A recent paper by Besag(2000) reviews several MCMC approaches and pro-vides a comprehensive list of references.

Example 1: The Classic Multivariate GeneticModel for MZ and DZ Twins

It is convenient to introduce the approach with anexample familiar to those with practice in linear struc-tural models, that is the estimation of the additive geneticand within-family environmental covariance matrices,from data on a pair of variables measured on samplesof 100 MZ and 100 DZ twins. The data were simulatedusing SAS on the assumption of multivariate normality.Table I summarizes the population parameter valuesused to generate the simulated data and the observedmean vectors and 4 � 4 covariance matrices for theMZ and DZ twins. The table also shows the maximum-

Eg(X) � 1/R�r g(Xr).

[x] � 1/R �r [x ƒ Yr],


likelihood estimates of the MZ and DZ means and theadditive genetic and within-family environmental co-variance matrices obtained by fitting the structuralmodel to the raw data vectors in Mx (Neale et al., 1999).

The same data were analyzed by MCMC usingWinBUGS to implement the Gibbs sampler. The firststep is to construct a directed graph (Fig. 1) that ex-presses the logical and stochastic process by which thetwin data are generated. WinBUGS provides a graphi-cal user interface (GUI) that allows most of the elementsof the model to be specified graphically as “doodles”.Subsequently the graphical model may be automaticallytranslated into a script with syntax similar to S-Plus,and compiled with object data prior to initializationwith trial values and execution. The automated code maybe modified manually to carry out side-computationsthat cannot be implemented directly in the GUI, suchas the computation of standardized estimates.

The underlying model follows that of Jinks andFulker (1970) in recognizing that each twin of a DZpair is realized, in the absence of shared environmen-tal effects, by sampling three separate (normal) variables:a “between families” genetic component, g2; a “withinfamilies” genetic effect, g1, and a “within families” en-vironmental effect, e1. The expected variances of thebetween and within family genetic effects may be para-meterized in terms of additive and non-additive geneticeffects. If genetic effects are additive and mating is ran-dom, the variances of g2 and g1 are equal. The processthat generates MZ twins is similar, but each twin of apair share the same within and between-family geneticeffects (so the genetic effects of MZs are identical). EachMZ twin still receives its own unique within-familyenvironmental effect, e1.

The logic of the genetic model follows closely theunderlying process by which genetic and environmen-tal effects arise. The “between family” genetic effects

Table I. Population Parameter Values Used in Simulation ofBivariate Twin Data and Values Realized Using Mx for ML

Estimation (n � 100 MZ and 100 DZ Pairs)

Parameter ML estimate Population value

mu[1] 9.993 10.0mu[2] 10.047 10.0sigma2.g[1,1] 0.704 0.8sigma2.g[1,2] 0.371 0.4sigma2.g[2,2] 0.741 0.8sigma2.e[1,1] 0.194 0.2sigma2.e[1,2] 0.098 0.1sigma2.e[2,2] 0.254 0.2

reflect the average genetic differences between parentsand the within-family effects those of genetic segrega-tion. The “environment” within families is thus a fur-ther process that differentiates between individuals ofknown genetic constitution. This “natural” logic is mir-rored in the way in which BUGS models are writtenfor twin data (Fig. 1).

Figure 1 represents many, but not all, of the featuresof a “BUGS” model. A fuller description and examplesare provided in the manuals accompanying the down-loaded WinBUGS software. (Spiegelhalter et al., 2000).Constants are represented by rectangles and variablesby ellipses. Stochastic dependence is represented by asingle one-headed arrow. The distribution of stochas-tic nodes is not shown in the figure, but is selected froma menu of options available when building the doodleinteractively. The rectangles (“plates”) represent thedomain over which subscripts vary in subscripted vari-ables and translate into “For” loops in the code derivedfrom the doodle.

The graphical model is in two parts. The left halfof the diagram represents the process by which data onDZ pairs are generated. The element ydz[l,j,k] containsthe observed value of the jth twin of the ith pair on thekth variable. The model allows generally for N variables.Similarly, the corresponding typical datum for an MZindividual is stored in ymz[l,j,k] represented on the right


of the diagram. Elements such as ydz[l,j,k] are termed“nodes” and are represented in ellipses in the BUGSGUI.

The observed values of the twins depend stochas-tically on prior nodes whose interrelationships arespecified in the graph. Starting at the top, we have thepopulation mean vector, mu, it is assumed initially thatthese means have a relatively uninformative prior dis-tribution and are sampled from a multivariate normaldistribution with mean vector mean � (0,0) and preci-sion matrix (precis). The precision matrix is equal tothe inverse of the prior 2 � 2 covariance matrix of thepopulation means. These so-called “meta-parameters”are typically chosen, in our application, to reflect ourrelative lack of prior information about the populationparameters. This amounts to making the diagonal ele-ments of precis fairly small (we used 0.0001). Theseprior values are supplied as data to BUGS. Notice thatwith non-informative priors, the Bayesian inferenceusually produces results that are comparable to thoseobtained using MLE since the joint posterior distribu-tion is highly dominated by the likelihood function rela-tive to the prior distribution.

The next layer of the process in DZ twins, whichcorresponds to the way twin pairs are produced, com-prises generation of the means of the ndz twin pairs.The node g2dz[i,k] denotes the expected value of the

Fig. 1. Graphical model for random additive genetic and within-family environmental effects on multivariate MZ and DZ twin data.

mean of the jth pair on the kth variable. The vector ofbetween family effects is generated from the multi-variate normal distribution with mean vector mu andprecision matrix tau.g. The matrix tau.g is the inverseof half the additive genetic covariance matrix. Our ini-tial model assumes that this matrix is of general posi-tive definite form. It is possible to devise models thatreflect other hypotheses about the genetic covariancestructure. In the GUI notation of WinBUGS, a single-headed arrow represents stochastic dependence. Thisshould not be confused with the path coefficient famil-iar in LISREL formulations. For simplicity we ignorethe effects of the shared environment in this early ex-ample, but there is no difficulty adding an environ-mental component to the differences between families.Our choice of “g2” to denote the between-pair geneticeffects emphasizes the parallel between the develop-ment of the MCMC model for twin data and the earlyformulation of the variance components model for twindata by Jinks and Fulker (1970). The fact that each ithpair of the set of ndz DZ pairs has its own genetic de-viation is represented by the large rectangle (“plate”)encompassing the g2dz[i,1:N]. The plate correspondsto a “for” loop in many programming languages.

The expected values of the individual DZ twins,g1dz[i,j,k] are obtained by adding bivariate normalwithin-pair genetic deviations to the expected valuesof the corresponding pair. The genetic deviations areagain sampled from a multivariate normal distributionwith covariance matrix equal to half the additive geneticvariance. Thus, in our GUI, the individual expectedvalues of the DZ twins depend stochastically on theg2dz[i,k] and the “genetic” precision matrix, tau.g. Thefact that the single headed arrow (“edge”) links bothg2dz and g1dz to tau.g automatically imposes the con-straint that the genetic variances are equal within andbetween DZ (sib) pairs when gene action is additive.

Finally, the individual twin observations (ydzil,j,k])are represented as samples from the multivariate nor-mal distribution with expected values equal to the ge-netic effects, g1dz, and precision equal to the inverse,tau.e, of the within-family environmental covariancematrix. The fact that the individual twins are nestedwithin pairs is represented in the graph by the secondplate nested within the outer plate. The inner plate con-tains the expected individual values, g1dz, and the ob-servations, ydz, but excludes the g2dz that are only cap-tured within the domain of the outer plate.

The graph is completed by representing, on theright-hand side of Figure 1, the process that generates


pairs of MZ twins. The MZ part of the graph differsonly in the fact that both the between-family (“g2”) andwithin-family (“g1”) genetic deviations contribute to ex-pected values of the twin pair in MZ twins, because theyhave identical alleles segregating from their parentalgenotypes. The individual MZ twin observations are as-sumed to be multivariate normal with expected valuesequal to the pair means (g1mz[i,k]) and dispersion equalto the within family environmental covariance matrix.The fact that both between and within-family geneticdeviations contribute to differences between MZ pairsis indicated by excluding the g1mz effects from the innerplate in the case of MZ pairs (Fig. 1). Note that thenodes on the MZ part of the diagram depend stochasti-cally (have arrows coming from) the same precisionnodes as those for DZ twins. By putting MZ and DZtwins on the same diagram in this way, the impliedequalities of the components of covariance in MZ andDZ twins are automatically imposed.

The BUGS code generated automatically from thegraph was amended to invert the genetic and environ-mental precision matrices to yield the additive geneticand within-family environmental covariance matrices.The code is supplied in (Appendix 1).

Electronic copies of this and other doodles, codeand simulated data used in this paper can be obtainedfrom the first author.

The MCMC simulation was started with trial val-ues of (15,15}for the mean vector, the identity matrix,I, for the “genetic” precision matrix (tau.g) and 6I forthe “environmental” precision matrix (tau.e). Thesevalues were selected to be of the right order but suffi-ciently far from the ML estimates as to pose a realisticchallenge to the MCMC algorithm. The MCMC algo-rithm converged very rapidly, but a 2000 iteration “burn-in” preceded a further 5000 iterations that were sam-pled to characterize the stationary distribution. Withthe relatively small data set and simple model the CPUtime was very short on a lap-top computer, though sig-nificantly slower than Mx applied to the raw observa-tions. WinBUGS offers a rich range of tools for moni-toring the progress of the algorithm, including activetraces of the iterations of any desired nodes, historiesof selected bands of iterations, kernel density plots anda variety of summary statistics. Table II gives the sta-tistical summaries of the first 5000 MCMC iterationsafter a 2000 iteration burn-in.

The first part of the table corresponds to Table Iand gives the MCMC values of the parameters that werealso obtained by ML using the conventional structural

modeling approach. Note that the agreement betweenthe ML estimates and those obtained by averaging thesequence of Monte Carlo simulations is very close.Table II however, gives a number of other statisticsthat are not readily available from the ML algorithm.The MCMC algorithm gives the median parametervalues and the upper and lower 2.5% confidence in-tervals. Mx experienced difficulty in obtaining these


confidence intervals for this small sample. The stan-dard deviations of the parameters are also obtaineddirectly from the MCMC algorithm. These may betime-consuming to calculate numerically within theML algorithm since they require numerical computa-tion of the Hessian matrix that usually requires a se-ries of additional steps after the ML estimates havebeen obtained.

Table II. Summary Statistics for 5000 MCMC Iterations of Bivariate AE Model after 2000 Iteration “Burn in”

Node Mean SD MC error 2.5% Median 97.5%

deviance 985.7 67.7 2.91 856.7 985.4 1118.0mu[1] 9.991 0.05906 0.001452 9.877 9.992 10.11mu[2] 10.05 0.06184 0.001871 9.924 10.05 10.17sigma2.g[1,1] 0.705 0.08122 0.003302 0.5567 0.7018 0.8731sigma2.g[1,2] 0.3719 0.06418 0.0024 0.252 0.3711 0.5065sigma2.g[2,2] 0.7394 0.08337 0.002786 0.5885 0.7367 0.9113sigma2.e[1,1] 0.197 0.02872 0.001223 0.1476 0.1943 0.2606sigma2.e[1,2] 0.09904 0.02398 9.144E-4 0.05537 0.09788 0.1486sigma2.e[2,2] 0.2581 0.03528 0.00123 0.1977 0.2555 0.3337g1mz[1,1] 10.03 0.2981 0.00454 9.451 10.03 10.61g1mz[1,2] 9.623 0.3315 0.005314 8.988 9.624 10.27g1mz[2,1] 9.321 0.2961 0.004655 8.738 9.312 9.911g1mz[2,2] 10.46 0.335 0.005125 9.801 10.47 11.09g1mz[3,1] 10.8 0.2972 0.004429 10.21 10.8 11.38g1mz[3,2] 10.11 0.3344 0.00493 9.463 10.11 10.79g1mz[4,1] 10.82 0.2915 0.004629 10.24 10.82 11.38g1mz[4,2] 10.98 0.3303 0.004751 10.34 10.98 11.64g1mz[5,1] 11.22 0.3009 0.004413 10.61 11.23 11.8g1mz[5,2] 11.87 0.3405 0.005285 11.2 11.87 12.53g1dz[1,1,1] 11.53 0.3756 0.007243 10.77 11.53 12.27g1dz[1,1,2] 10.94 0.4263 0.007021 10.1 10.94 11.78g1dz[1,2,1] 11.44 0.3763 0.006501 10.7 11.43 12.18g1dz[1,2,2] 10.93 0.4286 0.007602 10.08 10.95 11.76g1dz[2,1,1] 9.231 0.3776 0.007866 8.499 9.228 9.992g1dz[2,1,2] 10.25 0.4248 0.007331 9.409 10.25 11.09g1dz[2,2,1] 9.331 0.373 0.005969 8.606 9.329 10.07g1dz[2,2,2] 9.505 0.4188 0.007975 8.694 9.5 10.31g1dz[3,1,1] 10.72 0.3864 0.007354 9.957 10.73 11.48g1dz[3,1,2] 9.448 0.4249 0.008252 8.601 9.449 10.26g1dz[3,2,1] 9.563 0.3835 0.007174 8.81 9.561 10.29g1dz[3,2,2] 9.309 0.4141 0.007627 8.486 9.306 10.12g1dz[4,1,1] 9.928 0.3801 0.007233 9.189 9.926 10.67g1dz[4,1,2] 10.34 0.4218 0.007382 9.514 10.34 11.15g1dz[4,2,1] 9.96 0.3739 0.005286 9.231 9.963 10.7g1dz[4,2,2] 10.24 0.4178 0.007408 9.422 10.24 11.06g1dz[5,1,1] 10.05 0.3862 0.006506 9.288 10.05 10.8g1dz[5,1,2] 10.51 0.4265 0.007701 9.681 10.51 11.35g1dz[5,2,1] 8.541 0.3917 0.008129 7.78 8.533 9.334g1dz[5,2,2] 8.759 0.4353 0.007574 7.924 8.759 9.617

Note: The nodes are defined as follows: deviance � minus twice the log-likelihood; mu[1],mu[2] � means of first and second variable; sigma2.g �elements of 2 � 2 genetic covariance matrix; sigma2.e � elements of 2 � 2 environmental covariance matrix; g1mz[i,k] � genetic effect onkth variable in ith MZ twin pair (“genetic score”); g1dz[i,j,k] � genetic effect on kth variable in jth twin of ith DZ pair.

The flexibility of the MCMC method, however,is illustrated by the fact that it yields almost as a by-product, estimates of the individual genetic deviationsof the MZ and DZ twins, together with a variety of sta-tistics for evaluating their precision. The fact that MCMCis a carefully constructed simulation algorithm meansthat, at each iteration, the latent genetic and environ-mental effects of individual twins are simulated condi-tional on their phenotypes and the parameters of the ge-netic model. These latent scores and, indeed, any missingvalues are thus treated like every other parameter in theMCMC model and, after the algorithm has converged,can be sampled and summarized to provide mean valuesand estimates of error for the genetic and environmen-tal deviations of individual twins. In multivariate geneticanalyses, MCMC can be expected to evaluate genetic orenvironmental “factor scores” (see, e.g., Molenaar et al.,1990) and their confidence intervals at the same time asfitting the genetic factor model at very little extra com-putational cost. We regard the availability of estimatesof individual scores and missing values as a significantbonus of the Bayesian approach. These estimates aregiven for the first 10 MZ and DZ pairs in Table II. Wenote in passing that, although we did not simulate miss-ing data in this basic example, in principle the MCMCmethod also can also routinely simulate missing valuesat no additional computational cost. However, in the cur-rent version of BUGS, the simulation of multivariate nor-mal deviates cannot accommodate missing values. Themethod thus appears to accomplish in a relativelystraightforward manner much of what can be achievedby other tailor-made algorithms.

Example 2: Nonlinear Growth Curve Models forTwin Data

The graph in Figure 1 is modified simply to takeinto account a non-linear developmental model inwhich there is random genetic and environmental vari-ation in the parameters of a non-linear growth curve(Fig. 2). Introducing a third plate, nested within indi-vidual twins, to reflect the repeated measures of theoutcome variable extends the model. In the figure, theresponse of the jth twin of the ith DZ pair on the kthoccasion is denoted by rdz[l,j,k]. The responses for MZtwins are denoted by the rmz[l,j,k]. The model assumesthat the responses on the kth occasion have expectedvalues ydz[l,j,k] and ymz[l,j,k]. The residual variancesare assumed, for simplicity, to be constant and inde-pendent over occasions (i.e. any correlation across timeis explained by the underlying developmental model).


The residual variance is the inverse of the precisiontau.res that is assumed to be the same for MZ and DZtwins. The expected values of the MZ and DZ responsesare assumed to be logical functions of underlying ran-dom latent variables whose (multivariate) genetic andenvironmental structure can be captured by the modelin Figure 1. In this case we assume that the responsesare measured at ten equally spaced intervals (k�1 ... 10),and that the temporal change within individuals followsa four-parameter logistic function with random geneticand environmental variation between individuals in eachof the four parameters. Thus we write (for DZ twins)

The individual parameters thus correspond to ini-tial value, y[l,j,1], range from initial value to asymp-totic value, y[l,j,2], time of maximum rate of change,y[l,j,3] and rate of change, y[l,j,4]. Similar parametersare defined for MZ pairs. Other functional forms canbe incorporated by relatively minor changes in the codegiven in Appendix 2.

Data were simulated using SAS for 500 pairs eachof MZ and DZ twins on the assumption that the fourgrowth curve parameters each had independent addi-tive genetic (A) and within-family environmental (E)components. The heritabilities of the four componentswere all assumed to be 0.8. Population parameter val-ues were assumed to be N[1, 1], ranges were N[10, 1],times of maximum growth were N[5.5, 1], and rates ofchange were N[1,0.1]. The residual within-occasionerror was assumed to be 0.5. The genetic and environ-mental correlations between the random growth-curveparameters were all assumed to be zero in the originaldata simulation, but the genetic and environmental co-variances were allowed to be free parameters in theBayesian analysis. Noninformative multivariate normalpriors were assumed for the mean response vectors andWishart priors (omega.g and omega.e respectively)were assumed for the precisions of the genetic and en-vironmental covariance matrices.

The 10,000 MCMC iterations took approximately45 minutes on a laptop computer. The statistics for theprincipal parameters of the model sampled over the last2000 MCMC iterations are summarized in Table III. Themeans of the growth curve parameters and error vari-ances are reproduced quite precisely. The heritabilitiesof the four random components (based on the medianvalues of the genetic and environmental components)are: initial value 0.932; range 0.825; time of maximumchange 0.762; and rate of change 0.766. The apparent

{1 � exp(�y[l,j,4] � (k � y[l,j,3]))}.rdz[l,j,k] � y[l,j,1] � y[l,j,2]/

upward bias in the estimated heritability of the initialvalue is not resolved, but cannot be explained simply bythe correlations between twins observed in the simulatedinitial values since these were 0.826 (MZ) and 0.404 (DZ)respectively, close to their expected values. The MCMCestimates of the genetic and environmental covariancesare all close to the zero values assumed in the simula-tion. As in the case of the basic multivariate model (Ex-ample 1), the MCMC analysis automatically estimatesthe genetic effects contributing to growth curve para-meters of the individual twins but illustrative values arenot presented here because of space limitations. Theseindividual effects, for example, can be further analyzedto identify interesting sub groups or populations of twinsthat are hidden in the observed data, which would be ex-tremely difficult to do in a standard MLE approach.

Example 3: Genotype � Environment Interactionand Correlation

Our third example addresses a challenging problemthat has been frustratingly insoluble within the frame-work of structural modeling or conventional regressionanalysis, namely that of the genetic control of sensitiv-ity to the environment (G � E interaction). Available


methods for the analysis of G � E have depended onstratification of a sample of relatives by values of anenvironmental covariate (e.g., twins discordant for anenvironmental factor or siblings stratified by a putativeenvironmental covariate) and testing for heterogeneityamong genetic parameters within strata. This approachassumes that the environmental measures are fixed ratherthan random and independent of genetic effects on theoutcome of interest. Environments are not fixed and theirindependence can seldom be guaranteed in practice when,for example, aspects of the family environment that inter-act with genetic liability are themselves correlated withgenetic liability. Our preliminary studies suggest that theMCMC approach may provide a framework for remov-ing this restriction and enhancing our capability of a morerigorous and flexible analysis of G � E interaction.

Clearly, there are many different ways in whichnonadditive effects, including G � E, may feature in afamily study. We consider one example likely to berelevant to future analyses of developmental twin data.We assume that we have measured an outcome, symp-toms of depression, for example, that is influenced bygenetic and environmental factors that may interact.We further assume that we have measured a covariate(endophenotype) that is itself partly genetically deter-

Fig. 2. Graphical model for random additive genetic and environmental effects in four-paramater logistic growth curve model for MZ and DZ twin data.

mined and indexes the genetic sensitivity of the individ-ual to a specific environmental factor. Finally, we assumethat we have measured a variable that is hypothesized tobe a covariate (partly “environmental”) of the outcome.The regression of the outcome on the environmental co-variate varies between subjects as a function of differ-ential (possibly genetic) sensitivity of individuals to themeasured environmental covariate. We could, for exam-ple, envision a system in which pre-pubertal anxiety wasgenetically correlated with post-pubertal depression, butwas also an index of genetic sensitivity to the impact oflife events. The problem may be further complicated bythe fact that there may be genetic effects on exposure tolife events and that these may correlate genetically withanxiety and or depression.

Figure 3 represents the graph that specifies themodel for this particular problem. As before, the modelis based on the underlying multivariate linear modelrepresented in Figure 1, with modifications to allow forthe unique aspects of the GxE model.


As before, the model comprises two similar com-ponents representing the relationships among the nodesfor DZ (left side of the figure) and MZ twin individu-als nested within pairs. In this case, we define threevariables each having their own independent errors. ForDZ twins, we define rdz[i,j] as the response variablefor the jth twin in the ith DZ pair, the covariates sdz[i,j]and edz[i,j] represent the corresponding index of sen-sitivity to the environment and the measured environ-ment respectively. The model for MZ manifest variablesfollows the same basic pattern, with modifications toallow for the fact that both within-family and between-family genetic differences contribute to differencesbetween MZ pairs (see Fig. 3, cf. Jinks and Fulker,1970). The DZ responses have expected values

and precision tau.r equal to the inverse of the within-family environmental variance in the responsessigma2.r. The product term represents the interaction

erdz[i,j] � g1dz[i,j,3] � esdz[i,j]edz[i,j]

Table III. Summary Statistics for Four-Parameter Logistic Model for 10-Occasion Longitudinal MZ and DZ Twin Data from 2000 MCMCIterations after 8000 Iteration “Burn-in”

Node Mean SD MC error 2.5% Median 97.5%

mu[1] 1.055 0.02848 0.001005 0.9997 1.056 1.111mu[2] 10.04 0.02754 0.001018 9.987 10.04 10.09mu[3] 5.522 0.02766 0.001356 5.467 5.523 5.575mu[4] 1.003 0.008608 3.276E-4 0.9863 1.003 1.02

sigma2.g[1,1] 0.9451 0.03889 0.001958 0.8679 0.9442 1.023sigma2.g[1,2] �0.07922 0.02532 0.001042 �0.1282 �0.0794 �0.02951sigma2.g[1,3] 0.05634 0.02609 0.001382 0.004993 0.05594 0.1087sigma2.g[1,4] �0.01235 0.008509 5.598E-4 �0.02899 �0.01197 0.003468sigma2.g[2,2] 0.8235 0.03886 0.00215 0.7472 0.821 0.9045sigma2.g[2,3] �0.0197 0.02618 0.001536 �0.06924 �0.01916 0.03014sigma2.g[2,4] 0.005297 0.008833 6.184E-4 �0.01222 0.005127 0.0222sigma2.g[3,3] 0.6971 0.03652 0.002406 0.6257 0.6966 0.7698sigma2.g[3,4] 0.003212 0.008325 6.365E-4 �0.01387 0.003397 0.01841sigma2.g[4,4] 0.07236 0.003782 2.267E-4 0.06502 0.07238 0.07988sigma2.e[1,1] 0.06867 0.004513 3.425E-4 0.06029 0.06856 0.07788sigma2.e[1,2] 0.0968 0.006521 4.504E-4 0.08467 0.09652 0.1102sigma2.e[1,3] �0.01891 0.005676 3.872E-4 �0.03016 �0.01882 �0.007335sigma2.e[1,4] �0.01453 0.002015 1.55E-4 �0.01872 �0.01445 �0.01076sigma2.e[2,2] 0.1752 0.01076 6.975E-4 0.1552 0.1749 0.1968sigma2.e[2,3] 0.02002 0.008789 5.566E-4 0.002384 0.0203 0.03775sigma2.e[2,4] 0.001372 0.002974 1.936E-4 �0.0043 0.001348 0.007316sigma2.e[3,3] 0.2184 0.01431 0.001042 0.1908 0.2175 0.2473sigma2.e[3,4] �0.001946 0.00311 2.306E-4 �0.007892 �0.002043 0.004432sigma2.e[4,4] 0.02215 0.001642 1.592E-4 0.01918 0.02206 0.02553

sigma2.res 0.49 0.005364 4.599E-4 0.4796 0.4899 0.5014

Notes: Subscripts 1 . . . 4 on estimates of means (mu) and genetic and environmental covariances (sigma2.g and sigma2.e) refer to parameters offour-parameter logistic model as follows: 1 � initial value (1.0); 2 � total growth (final asymptotic value-initial value, 10.0); 3 � age of maximumrate of change (5.5); 4 � growth rate (1.0). “True” values are given in parentheses. Sigma2.res is the residual, within occasion, variance (0.5), as-sumed to be constant across occasions. Residual effects are assumed to be uncorrelated across occasions.

between genes and the measured environment, edz[l,j],as the product of the measured environment and a ran-dom coefficient, esdz[i,j], that depends on the geno-type of the individual such that esdz[l,j] � g1dz[i,j,2].

The environmental measures, edz[i,j], have ex-pected values eedz[i,j] � g1dz[i,j,1] and precision tau.eequal to the inverse of the within-family environmen-tal effect on the environmental measure.

The sensitivities to the environment, esdz[i,j], areassumed to be indexed by the measured variable sdz[i,j]that have expected values esindz[i,j] � beta[1] �beta[2]esdz[l,j] and precision tau.s equal to the within-family environmental variance in the index of sensi-tivity to the environment, sigma2.s. The coefficientsbeta[1] and beta[2] are the intercept and slope re-spectively of the regression of the index of sensitivityon the latent genetic sensitivity to the environment.

The structure of the three latent genetic variables,g1dz[l,j,k],k � 1 . . . 3, follows that already describedabove for the basic multivariate linear structural modelfor MZ and DZ twin data, requiring specification oftheir mean vector, mu, and the inverse, tau.g, of halfthe additive genetic covariance matrix, sigma2.g. In itscurrent form, the model assumes that there is no within-family environmental correlation between the three


manifest variables except for that introduced betweenthe response variable and the environmental measureby virtue of the regression of the former on the latter.Other parameterizations may be devised that enable thisassumption to be relaxed.

Appendix 3 reproduces the code generated byWinBUGS, with some minor modifications to allow di-rect inspection of genetic covariances and environ-mental variances.

The population parameters used to generate the sim-ulated data are provided in Table V. The three latent vari-ables are all assumed to be independent with unit totalvariance. The assumption of independence is made inthe simulation to simplify inspection of the simulateddata points but not in the data analysis. Thus, althoughthe data do not include G-E correlation in the firstexample, they do allow for it in the analysis through theestimation of the positive-definite genetic covariancematrix. If the analysis is correct, the covariances betweenthe latent variables should all be zero. The simulationsassumed that (additive) genetic effects contributed torandom variation in sensitivity to the environment(h2 � 0.4), the environmental index (h2 � 0.4) and theoutcome variable prior to GxE interaction (h2 � 0.8).Although the mean sensitivity to the environment is zero,

Fig. 3. Graphical model for random additive genetic effects and within-family environmental effects on sensitivity to partly heritableenvironmental effects (G � E interaction in presence of genotype-environment correlation).

there is considerable variation, some of it genetic, amongindividuals in sensitivity to the environment. In this case,some individuals will show a positive regression ofphenotype on the environment and others a negativeresponse. The latent variables are assumed to be normalin the current formulation.

Table IV summarizes the statistics realized in asimulation of 1000 MZ and 1000 DZ pairs from thesepopulation parameters. We note that the simulated val-ues for the measured environment, sensitivity to theenvironment and the outcome prior to GxE yield sta-tistics quite close to those expected. The responsevariable, which includes the effects of GxE interac-tion, is significantly less correlated in MZ and DZtwins than the raw outcome variable in the absence ofGxE. However, we note that GxE interaction addsabout 40% to the variance of the raw outcome in theabsence of GxE.

Table V summarizes the results of the MCMCanalysis of the simulated data when allowance is madein the model for random effects on sensitivity to theenvironment. We note that BUGS correctly recoversthe latent genetic covariance structure of the influenceson the outcome, the environment and sensitivity to theenvironment, and correctly characterizes the relation-ship between the index of sensitivity to the environ-ment and the latent sensitivity to the environment. Theestimated genetic contribution to the environment and


the genetic contribution to sensitivity are about rightand the genetic covariances between the measures arealso about zero as assumed in the simulation.

The analysis of GxE does not address directly thequestion of the impact of removing GxE in the modelfor data in which there is known to be substantial GxE.Each of the MCMC iteration in BUGS computes anapproximate realization of the posterior distribution ofdeviance (minus twice the logarithm of the likelihood)that can be traced and summarized over iterations inexactly the same way as any other node in the model.Subsequently, different models can be compared basedon their deviance distributions. Alternatively, one canuse the DIC (Deviance Information Criterion) which isa penalized version of the posterior mean of the de-viance function (Spiegelhalter et al., 1998). For the fullGxE model, the average deviance over 5000 iterations(Table V) was 23450.0.

The crucial element of our GxE model is the ran-dom component in sensitivity to the environment. Wethus recast the model of Figure 3 to retain a fixed ef-fect of the measured environment (Fig. 4), but remov-ing the random effect. After a 2000 iteration “burn-in”we obtained the revised parameter estimates also givenin Table V. The average deviance was 26410.0, whichis significantly higher than the value obtained under thefull model indicating strong support for GxE in the sim-ulated example.

Table IV. Summary Statistics for Simulated Data on 1000 MZ and 1000 DZ Twins in Presence of G � E and Genetic Effects onEnvironmental Index (MZ Correlations Are in the Upper Triangle and DZ Correlations in the Lower Triangle)

Correlations V11 V21 V31 S1 Resp1 V12 V22 V32 S2 Resp2

V11 1.0000 �0.0017 �0.0345 0.0342 �0.0895 0.4006 0.0284 �0.0140 0.0342 �0.0419V21 0.0369 1.0000 �0.0222 0.6355 �0.0052 0.0155 0.4179 �0.0054 0.6355 �0.0197V31 �0.0558 �0.0058 1.0000 �0.0278 0.8402 0.0013 0.0017 0.7938 �0.0278 0.6836S1 0.0507 0.6530 0.0239 1.0000 �0.0248 �0.0111 0.6308 �0.0423 1.0000 �0.0533Resp1 �0.0183 0.0074 0.8218 0.0625 1.0000 �0.0298 �0.0154 0.6661 �0.0248 0.6863V12 0.2034 0.0042 �0.0131 0.0407 �0.0274 1.0000 �0.0303 0.0007 �0.0111 �0.0566V22 0.0599 0.2003 0.0083 0.3218 0.0479 0.0401 1.0000 0.0108 0.6308 �0.0142V32 0.0490 0.0114 0.4151 0.0176 0.3699 0.0362 0.0339 1.0000 �0.0423 0.8460S2 0.0620 0.3030 0.0291 0.5398 0.0477 0.0337 0.6318 0.0342 1.0000 �0.0533Resp2 0.0454 0.0062 0.3576 0.0123 0.3595 �0.0022 0.0123 0.8662 0.0238 1.0000

MZ Mean 0.0025 0.9350 10.0343 �0.0503 10.0552 �0.0210 0.9468 10.0306 �0.0503 10.0248MZ S.D. 0.9810 1.0004 0.9809 0.6295 1.1417 0.9751 1.0248 1.0045 0.6295 1.1825DZ Mean 0.0124 1.0132 10.0756 0.0104 10.1089 �0.0362 1.0143 10.0125 0.0082 10.0335DZ S.D. 1.1061 0.9993 0.9931 0.6469 1.2014 0.9681 0.9948 1.0315 0.6524 1.2163

Note: Variables are labeled as follows:V11 Measured environment on first twin; V21 Index of environmental sensitivity of first twin; V31 Phenotype of first twin in absence of G � E (notused in analysis); S1 genetic sensitivity of first twin to the environment; Resp1 Phenotype of first twin.V12, V22, V32, S2 and Resp2 are the corresponding variables for the second twin.

DISCUSSION AND CONCLUSION

Bayesian statisticians and other MCMC enthusi-asts speak of the experience of “model liberation” muchas some behavior-geneticists spoke of their first en-


counter with nonlinear optimization and structural mod-eling 25 years ago. The primary goal of this report is toencourage researchers in behavior genetics to see howfar these approaches will indeed help them move be-yond some of the limitations inherent in their current

Fig. 4. Reduced graphical model for environmental effects in absence of G � E interaction (cf. Fig. 3).

Table V. Results for Fitting Full Model for G � E Interaction and Reduced Model Without G � E to Simulated Twin Data with Genetic Effects on Measured Environment

Full model with (G � E) Reduced model (no G � E)True

Node Mean SD MC error 2.5% Median 97.5% Mean SD MC error 2.5% Median 97.5% Value

deviance 23450.0 237.8 10.86 23010.0 23450.0 23930.0 26410.0 240.7 13.67 25930.0 26410.0 26880.0 —mu[1] �0.0102 0.0184 0.0007 �0.0469 �0.0102 0.0265 �0.0098 0.0176 0.0006 �0.0433 �0.0094 0.0242 0.0mu[2] �0.0473 0.0266 0.0002 �0.0992 �0.0475 0.0057 0.9803 0.0181 0.0007 0.9447 0.9805 1.015 0.0mu[3] 10.04 0.0212 0.0006 9.9980 10.04 10.08 10.06 0.0233 0.0007 10.01 10.06 10.1 10.0beta[1] 1.029 0.0269 0.0003 0.9777 1.028 1.081 �0.0578 0.0246 0.0014 �0.1083 �0.0572 �0.0110 1.0beta[2] 1.021 0.0465 0.0003 0.9306 1.020 1.115 — — — — — — 1.0e2.e 0.5849 0.0231 0.0001 0.5395 0.5845 0.6313 0.5826 0.0234 0.0012 0.5375 0.5822 0.6300 0.6e2.r 0.2013 0.0113 0.0005 0.1806 0.2009 0.2247 0.4267 0.0188 0.0008 0.3920 0.4257 0.4658 0.2e2.s 0.6020 0.0215 0.0009 0.5616 0.6018 0.6462 0.5971 0.0243 0.0013 0.5520 0.5964 0.6464 0.6sigma2.g[1,1] 0.3852 0.0272 0.0020 0.3329 0.3848 0.4406 0.3885 0.0267 0.0016 0.3366 0.3885 0.4404 0.4sigma2.g[1,2] 0.0171 0.0136 0.0007 �0.0107 0.0170 0.0436 0.0203 0.0156 0.0008 �0.0089 0.0200 0.0531 0.0sigma2.g[1,3] 0.0127 0.0265 0.0020 �0.0406 0.0127 0.0650 0.0115 0.0281 0.0018 �0.0420 0.0117 0.0659 0.0sigma2.g[2,2] 0.3943 0.0270 0.0021 0.3433 0.3931 0.4496 0.4114 0.0282 0.0021 0.3585 0.4107 0.4682 0.4sigma2.g[2,3] 0.0148 0.0155 0.0006 �0.0161 0.0152 0.0449 �0.0006 0.0205 0.0009 �0.0404 0.0006 0.0407 0.0sigma2.g[3,3] 0.8143 0.0316 0.0015 0.7542 0.8136 0.8799 0.9872 0.0395 0.0016 0.9135 0.9849 1.0700 0.8

methods. We have shown that with the relatively un-informative priors the MCMC approach yields virtuallyidentical results to the conventional analysis of twin co-variance in the relatively simple case of a multivariate(linear) model for genetic and environmental variation.We note, however, that even in this setting, the MCMCapproach may yield information that is not so readilyavailable through the current maximum-likelihood ap-proaches. We then showed how the method seems towork quite adequately for two conceptually very im-portant but less tractable kinds of model that seem to beon the growing edge of inquiry in behavior genetics. Arelatively minor alteration of the basic multivariate ge-netic model, allows the specification of random geneticand environmental effects in a four-parameter logisticgrowth curve that would be difficult to specify gener-ally in conventional genetic analysis software that seeksto maximize the likelihood numerically. The genetic andenvironmental covariance structure of the initial values,ranges, ages of maximum growth and rates of changecan all be estimated in a relatively straightforward man-ner by the MCMC approach. Similarly, the parametersof a non-linear model specifying the effects of geneti-cally determined sensitivity to random, measured envi-ronmental influences could be explored (G � E inter-action). Since the simulated data also allowed forgenetic effects on the salient environmental influences,we must conclude provisionally that the same basicmodel can incorporate certain forms of epistasis andG � E interaction in the presence of gene-environmentcorrelation. Again, such models have defied the inge-nuity of exponents of the linear structural model.

MCMC methods are not necessarily a panacea. Al-though these early results look promising, the field isrife with warnings. As with current software for linearstructural modeling, it is relatively easy to specify mod-els in WinBUGS and to obtain plausible parameter es-timates for models that are poorly identified or even non-sense. On the other hand, long experience in theapplication of linear structural models to real and simu-lated behavioral-genetic problems has generated a criti-cal mass of researchers who are familiar with the pitfallsthat await the unwary. Further exploration alone will tellwhether these early hopes for MCMC are justified byimportant new insights that take us to significant newscientific territory. So far, we have shown that we candevelop and run BUGS scripts that produce output thatcompares favorably with the underlying simulated mech-anism. However, the number of examples is small andthere is a significant need for a more systematic explo-ration of specific cases through simulation and applica-


tion to real data. The properties of the linear model arewell-understood within the framework of ML estima-tion. The properties of MCMC estimates of parametersfrom nonlinear genetic models are nowhere near soclearly understood and need to be explored in more de-tail for each new class of model. Our current study sug-gests that the exploration might usefully start with mod-els for nonlinear developmental change and G � Einteraction. This is a research program in its own right.

Many of the problems experienced with ML meth-ods, such as those associated with uncertain conver-gence, colinearity, inefficient parameterization, etc. donot disappear when changing to a different approach.For the case of the linear model, we have shown thatMCMC recovers the underlying genetic and environ-mental parameters as well as the conventional struc-tural modeling approach. Many of the diagnostic ap-proaches used in ML have analogues in MCMC. Forexample, the problem of testing for convergence andmodel identification can be addressed in MCMC byrunning multiple chains in parallel beginning from dif-ferent starting values. The autocorrelations of estimatesfrom successive MCMC iterations serve a diagnosticrole in indicating poorly identified parameters and/orparameters for which convergence is slow. As with ML,reparameterization of a model can significantly im-prove the performance of the MCMC algorithm.

The application of Bayesian methods and MCMCto problems related to behavior genetics is still relativelyyoung and there are several issues that have still to beresolved. Paramount among these difficulties is the lackof a clear framework for comparing models analogousto that available through likelihood-ratio testing withinthe framework of maximum likelihood estimation.Whereas it is possible to specify a distribution for like-lihood ratios in ML, the differences between average de-viances under two MCMC modes does not have a dis-tribution that has generally been specified. The problemof model comparison within a Bayesian framework isstill being addressed (e.g., Spiegelhalter et al., 2002).

It remains to be seen how far MCMC can help re-searchers in behavior genetics to shake off the shack-les of the linear model in response to pressing newquestions. Practice will tell whether reservations thatmight be felt today are justified or whether they reflectthe same resistance to change that accompanied the in-troduction of linear structural modeling to the analysisof twin data twenty-five years ago. We hope these basicexamples will encourage others to explore some newhorizons that they may have imagined but were yet un-able to view easily through current lenses.

APPENDIX A.1

BUGS Code for MCMC Analysis of Genetic and Environmental Covariances for MZ and DZ Twin Pairs##### Code is generated from doodle by WINBUGS 13. Comments and derived parameters added manually

model;{

##### AE model for DZ pairsfor(i in 1:ndz){g2dz[i, 1:N] ~ dmnorm(mu[1:N],tau.g[1:N, 1:N])

}tau.g[1:N, 1:N] ~ dwish(omega.g[,],N)tau.e[1:N, 1:N] ~ dwish(omega.e[,],N)for (i in 1:nmz){g2mz[i, 1:N] ~ dmnorm(mu[1:N],tau.g[1:N, 1:N])

}mu[1:N]~ dmnorm(mean[],precis[,])for(j in 1:2){for(i in 1:ndz){ydz[i,j,1:N] ~ dmnorm(g1dz[i,j, 1:N],tau.e[1:N, 1:N])

}}for(j in 1:2){for(i in 1:ndz){g1dz[i,j,1:N] ~ dmnorm(g2dz[i, 1:N],tau.g[1:N, 1:N])

}}

##### AE model for MZ pairsfor(j in 1:2){for(i in 1:nmz){ymz[i, j, 1:N] ~ dmnorm(g1mz[i, 1:N],tau.e[1:N, 1:N])

}}for(i in 1:nmz){g1mz[i, 1:N] ~ dmnorm(g2mz[i, 1:N],tau.g[1:N, 1:N])

}

########### Derived parameters (covariance matrices) ###########for (k in 1:N){for (l in 1:N){sigma2.g[k,l]<- inverse(tau.g[,],k,l)*2

# sigma2.g is genetic covariance matrixsigma2.e[k,l] <- inverse(tau.e[,],k,l)

# sigma2.e is within-family environmental covariance matrix}

}}


APPENDIX A.2

BUGS Code for Genetic and Environmental Random Effects in Four-parameter Logistic Growth-CurveModel (cf. Fig. 2).

model;

##### Model for four-parameter logistic developmental model for twin data########### Program written from corresponding doodle and modified to yield genetic and environmentalcovariance matrices ######

{

for (i in 1:ndz){mdz[i, 1:4] ~ dmnorm(mu[1:4],tau.g[1:4, 1:4])

}tau.g[1:4, 1:4] ~ dwish(omega.g[,],4)tau.e[1:4, 1:4] ~ dwish(omega.e[,],4)for( i in 1 : nmz ) {mmz[i, 1:4] ~ dmnorm(mu[1:4],tau.g[1:4, 1:4])

}mu[1:4] ~ dmnorm(mean[],precis[,])for(j in 1:2){for(i in 1:ndz){y[i, j, 1:4] ~ dmnorm(xdz[i, j, 1:4],tau.e[1:4, 1:4])

}}for(j in 1:2){for(i in 1:ndz){xdz[i, j, 1:4] ~ dmnorm(mdz[i, 1:4],tau.g[1:4, 1:4])

}}for(j in 1:2){for(i in 1:nmz){z[i, j, 1:4] ~ dmnorm(xmz]i, 1:4],tau.e[1:4, 1:4])

}}for(i in 1:nmz){xmz[i, 1:4] ~ dmnorm(mmz[i, 1:4],tau.g[1:4, 1:4])

}tau.res ~ dnorm(0.0, 1.0E-6)for(j in 1:2){for(i in 1:ndz){for(k in 1:T){ydz[i, j, k] <- y[i, j, 1] + y[i, j, 2]/(1 + exp((-y[i, j, 4]) * (k - y[i, j, 3])))

}}

}for(j in 1:2){for(i in 1:ndz){for(k in 1:T){rdz[i, j, k] ~ dnorm(ydz[i, j, k],tau.res)

}}

}for(j in 1:2){for(i in 1:nmz){for(k in 1:T){zmz[i, j, k] <- z[i, j, 1] + z[i, j, 2]/(1 + exp(( -z[i, j, 4]) * (k - z[i, j, 3])))

}}

}


for(j in 1:2){for(i in 1:nmz){for(k in 1:T){rmz[i, j, k] ~ dnorm(zdz[i, j, k],tau.res)

}}

}

########### Derived parameters (covariance matrices) ###########

sigma2.res<-1/tau.res

for (k in 1:4){for (l in 1:4){sigma2.g[k,l]<- inverse(tau.g[,],k,l)*2 # sigma2.g is genetic covariance

matrixsigma2.e[k,l] <- inverse(tau.e[,],k,l) # sigma2.e is within-family

environmental covariance matrix}

}}

APPENDIX A.3

BUGS Code for Model with GxE in Presence of Genotype-Environment Correlation

model;{for(i in 1:ndz){g2dz[i, 1:N] ~ dmnorm(mu[1:N],tau.g[1:N, 1:N])

}tau.g[1:N, 1:N] ~ dwish(omega.g[,], N)for( i in 1 : nmz ){g2mz[i, 1:N] ~ dmnorm(mu[1:N],tau.g[1:N, 1:N])

}mu[1:N] ~ dmnorm(mean[],precis[,])for(i in 1:ndz){for(j in 1:2){g1dz[i, j, 1:N] ~ dmnorm(g2dz[i, 1:N],tau.g[1:N, 1:N])


}for(i in 1:ndz){for(j in 1:2){edz[i, j] ~ dnorm(eedz[i, j],tau.e)

}}for(i in 1:ndz){for(j in 1:2){sdz[i, j] ~ dnorm(esindz[i, j],tau.s)

}}for(i in 1:nmz){for(j in 1:2){ermz[i, j] <- g1mz[i, 3] + esmz[i] * emz[i, j]

}}


for(i in 1:nmz){for(j in 1:2){emz[i, j] ~ dnorm(eemz[i],tau.e)

}}tau.e ~ dgamma(0.001,0.001)tau.s ~ dgamma(0.001,0.001)for(i in 1:ndz){for(j in 1:2){rdz[i, j] ~ dnorm(erdz[i, j],tau.r)

}}for(i in 1:nmz){for(j in 1:2){rmz[i, j] ~ dnorm(ermz[i, j],tau.r)

}}tau.r ~ dgamma(0.001,0.001)for(i in 1:ndz){for(j in 1:2){erdz[i, j] <- g1dz[i, j, 3] + esdz[i, j] * edz[i, j]

}}for(i in 1:ndz){for(j in 1:2){eedz[i, j] <- g1dz[i, j, 1]

}}for(i in 1:ndz){for(j in 1:2){esdz[i, j] <- g1dz[i, j, 2]

}}for(i in 1:nmz){esmz[i] <- g1mz[i, 2]

}for(i in 1:nmz){eemz[i] <- g1mz[i, 1]

}for(i in 1:nmz){for(j in 1:2){smz[i, j] ~ dnorm(esinmz[i],tau.s)

}}beta[1:2] ~ dmnorm(mbeta[],prbeta[ , ])for(i in 1:ndz){for(j in 1:2){esindz[i, j] <- beta[1] + beta[2] * esdz[i, j]

}}for(i in 1:nmz) {esinmz[i] <- beta[1] + beta[2] * esmz[i]

}


for (k in 1:N){for (l in 1:N){sigma2.g[k,l]<- inverse(tau.g[,],k,l)*2 # sigma2.g is genetic covariance

matrix

}}


e2.e<-1/tau.e #residual variance in environmente2.s<-1/tau.s #residual variance in sensitivitye2.r<-1/tau.r #residual variance in response}

APPENDIX A.4

BUGS Code for Reduced Model with Measured Environmental Effects but No G � E Interaction (cf. Appendix A.3)

model;{for(i in 1:ndz){g2dz[i, 1:N] ~ dmnorm(mu[1:N],tau.g[1:N, 1:N])

}tau.g[1:N, 1:N] ~ dwish(omega.g[,],N)for(i in 1 : nmz){g2mz[i, 1:N] ~ dmnorm(mu[1:N],tau.g[1:N, 1:N])

}mu[1:N] ~ dmnorm(mean[],precis[,])for(i in 1:ndz){for(j in 1:2){g1dz[i, j, 1:N] ~ dmnorm(g2dz[i, 1:N],tau.g[1:N, 1:N])


}for(i in 1:ndz){for(j in 1:2){edz[i, j] ~ dnorm(eedz[i, j],tau.e)

}}for(i in 1:ndz){for(j in 1:2){sdz[i, j] ~ dnorm(esdz[i, j],tau.s)

}}for(i in 1:nmz){for(j in 1:2){ermz[i, j] <-g1mz[i, 3] + beta *emz[i, j]

}}for(i in 1:nmz){for(j in 1:2){emz[i, j] ~ dnorm(eemz[i],tau.e)

}}tau.e ~ dgamma(0.001,0.001)tau.s ~ dgamma(0.001,0.001)for(i in 1:ndz){for(j in 1:2){rdz[i, j] ~ dnorm(erdz[i, j],tau.r)

}}for(i in 1:nmz){for(j in 1:2){rmz[i, j] ~ dnorm(ermz[i, j],tau.r)

}}


tau.r ~ dgamma(0.001,0.001)for(i in 1:ndz){for(j in 1:2){erdz[i, j] <- g1dz[i, j, 3] + beta * edz[i, j]

}}for(i in 1:ndz){for(j in 1:2){eedz[i, j] <- g1dz[i, j, 1]

}}for(i in 1:ndz){for(j in 1:2){esdz[i, j] <- g1dz[i, j, 2]

}}for(i in 1:nmz){esmz[i] <- g1mz[i, 2]

}for(i in 1:nmz){eemz[i] <- g1mz[i, 1]

}for(i in 1:nmz){for(j in 1:2){smz[i, j] ~ dnorm(esmz[i],tau.s)

}}beta ~ dnorm(0.0, 1.0E-6)


for (k in 1:N){for (l in 1:N){

sigma2.g[k,l]<- inverse(tau.g[,],k,l)*2 # sigma2.g is genetic covariance matrix

}}

e2.e<-1/tau.e #residual variance in environmente2.s<-1/tau.s #residual variance in sensitivitye2.r<-1/tau.r # residual variance in response

}


REFERENCES

Besag, J. E. (2000). Markov Chain Monte Carlo for Statistical In-ference. Working Paper #9. Center for Statitics and Social Sci-ences, University of Washington, WA.

Besag, J. E., and Green, P. J. (1993). Spatial statistics andBayesian computation (with Discussion). J. R. Stat. Soc. B 55:25–37.

Brooks, S. P. (1998). Markov Chain Monte Carlo and its applica-tions. Statistician 47:69–100.

Burton, P. R., Tiller, K. J., Gurrin, L. C., Cookson, W. O. C. M.,Musk, A. W., and Palmer, L. J. (1999). Genetic variance com-ponents analysis of binary phenotypes using generalized linearmixed models and Gibbs sampling. Genet. Epidemiol. 17:118–140.

ACKNOWLEDGMENTS

This work is part of the Statistical Genetics Core(P.I., L. J. Eaves) of the NIMH Center for Developmen-tal Epdemiology (MH57761, P.I., Adrian Angold) andis also partly supported by MH45268 (P.I., L. J. Eaves).The statistical analysis in this paper was conductedusing the program WinBUGS 1.3, freely available onthe Internet through the generosity of the MRC BUGSproject at Cambridge, England. We thank an anony-mous referee for a careful and thoughtful critical read-ing of the manuscript.

Clayton, D. G. (1999). Linear mixed models. In W. R. Gilks,S. Richardson, and D. Spielgelhalter (eds.), Markov ChainMonte Carlo in practice. London: Chapman and Hall.

Creutz, M. (1979). Confinement and the critical dimensionality ofspace-time. Physics Rev. Lett. 43:553–556.

Do, K. A., Broom, B. M., Kuhnert, P., Duffy, D. L., Todorov, A. A.,Treloar, S. A., and Martin, N. G. (2000). Genetic analysis of theage at menopause by using estimating equations and Bayesianrandom effects models. Stat. Med. 19:1217–1235.

Eaves, L. J., Last, K. A., Young, P. A., and Martin, N. G. (1978).Model-fitting approaches to the analysis of human behavior.Heredity 41:249–320.

Eaves, L. J., Martin, N. G., Heath, A. C., and Kendler, K. S. (1987)Testing genetic models for multiple symptoms: An applicationto the genetic analysis of liability to depression. Beh. Genet.17:331–341.

Fulker, D. W. (1973). A biometrical genetic approach to intelligenceand schizophrenia. Soc. Biol. 20:266–275.

Gelfand, A. E., and Smith, A. M. F. (1990). Sampling based ap-proaches to calculating marginal densities. J. Am. Stat. Assoc.85:398–409.

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995).Bayesian data analysis. New York: Chapman and Hall.

Geman, S., and Geman, D. (1984). Stochastic relaxation, Gibbs dis-tributions and the Bayesian restoration of images. Inst. Electri-cal Electronics Eng. Trans. Pattern Analysis Machine Intelli-gence 6:721–741.

Gilks, W. R., Richardson, S., and Spielgelhalter, D. (1996). MarkovChain Monte Carlo in practice. London: Chapman and Hall.

Hastings, W. K. (1970). Monte Carlo sampling methods usingMarkov chains and their applications. Biometrika 57:97–109.

Jinks, J. L., and Fulker, D. W. (1970). Comparison of the biometri-cal genetical, MAVA and classical approaches to the analysisof human behavior. Psychol. Bull. 73:311–349.

Lindstrom, M. J., and Bates, D. M. (1990). Nonlinear mixed effectsmodels for repeated measures data. Biometrics 46:673–687.

Martin, N. G., and Eaves, L. J. (1977). The genetical analysis of co-variance structure. Heredity 38:79–95.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., andTeller, E. (1953). Equations of state calculations by fast com-puting machines. J. Chem. Physics 21:087–1091.


Meyer, J. M., and Eaves, L. J. (1988). Estimating genetic parametersof survival distributions. Genet. Epidemiol. 5:265–275.

Meyer, J. M., Eaves, L. J., Heath, A. C., and Martin, N. G. (1991).Estimating genetic influences on age at menarche: a survivalanalysis approach. Am. J. Med. Genet. 39:148–154.

Molenaar, P. C. M., Boomsmas, D. I., Neeleman, D., and Dolan,C. V. (1990). Using factor scores to detect G � E interactiveorigin of ‘pure’ genetic or environmental factors obtained ingenetic covariance structure analysis. Genet. Epidemiol. 7:83–100.

Neale, M. C., Boker, S. M., Xie, G., and Maes, H. H. (1999). Mx:Statistical modeling. 5th ed. Department of Psychiatry. Box 126VCU, Richmond, VA 23298.

Neale, M. C., and Kendler, K. S. (1995). Models of comorbidity formultifactorial disorders. Am. J. Hum. Genet. 57:935–953.

Ripley, B. D. (1979). Algorithm AS 137: Simulating spatial patterns:Dependent samples from a multivariate density. Appl. Statist.28:109–112.

Shoemaker, J. S., Painter, I. S., and Weir, B. S. (1999). Bayesian sta-tistics in genetics—a guide for the uninitiated. Trends Genet.15:354–358.

Sivia, D. S. (1996). Data analysis: a bayesian tutorial. Oxford: Ox-ford University Press.

Spielgelhalter, D. J., Thomas, A., and Best, N. G. (1999). WinBugsversion 1.2 User Manual. Cambridge: MRC Biostatistics Unit.

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A.(2002). Bayesian measures of model complexity and fit. J. R.Stat. Soc. Ser. B 64:1–34.

Thomas and Gauderman (1996). Gibbs sampling methods in genet-ics. W. R. Gilks, S. Richardson, and D. Spielgelhalter (eds.), InMarkov Chain Monte Carlo in Practice. London: Chapman andHall.

Tierney, L. (1994). Markov chains for exploring posterior distribu-tions (with Discussion). Ann. Stat. 22:1701–1762.

Zeger, S. L., and Karim, M. R. (1991). Generalized linear modelswith random effects: A Gibbs sampling approach. J. Am. Stat.Assoc. 86:79–86.

Edited by Stacey Cherny

Markov Chain Monte Carlo Approaches to Analysis of...

Documents

Transcript of Markov Chain Monte Carlo Approaches to Analysis of...