Effects of Genetic and Environmental Factors on Trait ... in incorporating eQTL data into a systems...
Transcript of Effects of Genetic and Environmental Factors on Trait ... in incorporating eQTL data into a systems...
1
Effects of Genetic and Environmental Factors on Trait Network Predictions from
Quantitative Trait Locus Data
David L. Remington
University of North Carolina at Greensboro, Greensboro, North Carolina 27402
Genetics: Published Articles Ahead of Print, published on January 12, 2009 as 10.1534/genetics.108.092668
2
Running head: Factors affecting QTL functional prediction
Keywords: quantitative trait locus, structural equation model, functional prediction, trait
network, common environment
Corresponding author: David L. Remington, Department of Biology, 312 Eberhart Bldg.,
University of North Carolina at Greensboro, P.O. Box 26170, Greensboro, NC 27402-6170. tel:
336-334-4967. fax: 336-334-5839. e-mail: [email protected]
3
ABSTRACT
The use of high-throughput genomic techniques to map gene expression quantitative trait
loci (eQTLs) has spurred the development of path analysis approaches for predicting functional
networks linking genes and natural trait variation. The goal of this study was to test whether
potentially confounding factors, including effects of common environment and genes not
included in path models, affect predictions of cause-effect relationships among traits generated
by QTL path analyses. Structural equation modeling (SEM) was used to test simple QTL-trait
networks under different regulatory scenarios involving direct and indirect effects. SEM
identified the correct models under simple scenarios, but when common environmental effects
were simulated in conjunction with direct QTL effects on traits, they were poorly distinguished
from indirect effects, leading to false support for indirect models. Application of SEM to
loblolly pine QTL data provided support for biologically plausible a priori hypotheses of QTL
mechanisms affecting height and diameter growth. However, some biologically implausible
models were also well-supported. The results emphasize the need to include any available
functional information, including predictions for genetic and environmental correlations, to
develop plausible models if biologically useful trait network predictions are to be made.
4
The emergence of genome technologies provides unprecedented opportunities to dissect
the networks of gene regulation, expression, functions, and interactions by which genes affect
phenotypes. The field of systems biology arising from these opportunities has been devoted
primarily to understanding constitutive processes through investigations and modeling of
phenotypic effects of knockout mutants, but recent interests have expanded into understanding
how natural genetic variation affects phenotypic variability (BENFEY and MITCHELL-OLDS 2008;
FEDER and MITCHELL-OLDS 2003; KEURENTJES et al. 2008; KOONIN and WOLF 2008). Genetic
mapping of quantitative trait loci (QTLs) in segregating families has become a key tool for
uncovering the genetic architecture of natural trait variation and ultimately identifying the
underlying genes over the last two decades (MACKAY 2001; MACKAY 2004). However, the use
of high-throughput techniques has expanded the definition of “quantitative trait” to encompass
transcript levels assayed on a genome-wide scale, leading to expression QTL (eQTL) mapping
(BREM et al. 2002; KEURENTJES et al. 2007; SCHADT et al. 2005), and similar approaches are
being used to map QTLs associated with metabolomic variation (KEURENTJES et al. 2006; LISEC
et al. 2008).
Interest in incorporating eQTL data into a systems biology framework has spawned the
development of new analytical approaches for predicting functional networks linking genes and
traits using multiple-trait QTL data. Basic methods for combined analysis of traditional
multiple-trait QTLs, such as mapping QTLs for eigenvectors defined by principal components
analysis (PCA; BREWER et al. 2007; LANGLADE et al. 2005) and multiple-trait QTL analysis
(HALL et al. 2006; JIANG and ZENG 1995; RAUH et al. 2002), may be useful for detecting
pleiotropic (or closely linked) QTLs and providing insights on the nature of multiple-trait QTL
effects, but yield little information for causal inference. One proposed method for identifying
5
key regulators of gene expression variation uses average expression levels of gene regulatory
networks identified a priori as composite traits for eQTL analysis (KLIEBENSTEIN et al. 2006).
In another approach, expression patterns of genes located within eQTL regions shared among
genes in known pathways are compared to those of the genes in the pathway to identify likely
pathway regulators (KEURENTJES et al. 2007).
In contrast with the above approaches, DOSS et al. (2005) suggested using patterns of
overall vs. residual correlations of gene expression in eQTL studies, which differ among tightly
linked genes for pairs with shared vs. independent regulatory mechanisms, in combination with
path analysis to make causative predictions about gene regulation. Inference methods based on
this reasoning have been proposed for de novo prediction of causal relationships in gene
regulatory pathways (LIU et al. 2008; ZHU et al. 2004), between gene regulation processes and
disease (SCHADT et al. 2005), and among sets of functionally related complex morphological and
metabolic phenotypes (LI et al. 2006). Various models of network regulatory structure, in which
sets of QTLs control variation in multiple traits directly or indirectly through cause-effect
relationships, are tested with genetic marker and trait data to identify the models best supported
by the data. Statistical approaches include Bayesian networks (ZHU et al. 2004), likelihood
methods to evaluate conditional trait correlations for their support under alternative regulatory
model configurations (NETO et al. 2008; SCHADT et al. 2005), and structural equation modeling
to compare the fit of different regulatory path models to the multiple-trait covariance structure
(LI et al. 2006; LIU et al. 2008). These techniques use information from residual trait
correlations that is normally neglected in conventional QTL analysis, with the exception of some
multiple-trait QTL analysis techniques (JIANG and ZENG 1995) where it is used to increase the
power to detect joint QTLs but not for functional prediction.
6
Structural equation modeling (SEM) is well-suited to causal network prediction because
it can accommodate a large variety of model structures including feedback relationships (LIU et
al. 2008) and is adaptable to various modes of exploratory and confirmatory analyses of
biological models (LI et al. 2006; LIU et al. 2008; STEIN et al. 2003; TONSOR and SCHEINER
2007). Moreover, SEM is supported by software packages such as SAS/STAT (SAS INSTITUTE
INC. 2002) and AMOS (AMOS DEVELOPMENT CORPORATION 2006). Path analysis has long been
recognized as potentially valuable but probably underutilized tool for testing causative rather
than merely correlative relationships in quantitative genetic data (LYNCH and WALSH 1998;
WRIGHT 1921). Nevertheless, some important issues must be considered in evaluating the
reliability of SEM and other path analysis methodologies for inferring trait networks from QTL
data. For one thing, cause-effect relationships between traits, environmental effects, and effects
of other QTLs can all generate within-genotype trait correlations (DOSS et al. 2005; SCHADT et
al. 2005; STEIN et al. 2003). Can path analysis correctly distinguish these potentially
confounding effects? Secondly, how valid are the results of path analysis for exploring the
likelihood of different model structures (i.e. different directions of cause-effect relationships
among traits) rather than merely testing for the presence of particular connections within an
established structure or estimating path coefficients for a particular model? Third, the true
regulatory networks shaping patterns of trait variation may be much more complex than the
tested models, including feedback loops or unobserved traits and QTLs (LIU et al. 2008; SCHADT
et al. 2005). Under these circumstances, will poor support for overly simple models provide
reliable clues to the existence of a more complex network structure?
The goal of this study was to test the causal inferences from structural equation models
under a set of relatively simple QTL regulatory scenarios. Specifically, I wanted to test whether
7
models could correctly distinguish direct vs. indirect mechanisms of QTL effects on a pair of
traits, and test the effects of incomplete identification of QTLs, common environmental effects,
and direct regulation of unobserved traits on model predictions. To do this, I simulated multiple
QTL data sets using a backcross design involving three QTLs jointly affecting two traits in
various direct and indirect modes of action. SEMs representing different models were tested for
their capability to distinguish the correct functional scenario from incorrect scenarios in each
case. For the sake of consistent comparisons, similar locations for each QTL and similar sets of
effect sizes on the observed traits were simulated under all of the initial scenarios, a constraint
that was later relaxed. Finally, I used SEM to evaluate data from a loblolly pine (Pinus taeda L.)
QTL study of diameter and height growth, for which biologically reasonable predictions could be
made about models of joint genetic effects on the set of traits.
MATERIALS AND METHODS
QTL simulations and detection: All data sets were initially simulated using QTL
Cartographer version 1.17 (BASTEN et al. 1994; BASTEN et al. 2004) and modified as needed to
incorporate indirect QTL effects. The simulated genome using the Rmap routine consisted of
five chromosomes, each with 8 markers spaced 15 cM apart starting at position 0 cM, generated
using the Kosambi map function. Three QTLs were simulated using Rqtl at the following
locations: chromosome 1, 50 cM (Q1); chromosome 2, 40 cM (Q2); and chromosome 3, 60 cM
(Q3). Thus, the first two simulated QTLs were in intervals between markers and the third was at
a marker position. Environmental variance was set to keep the total phenotypic variance close to
1.0 so all effect sizes would be approximately in phenotypic standard deviation units. Effects of
8
all QTLs were strictly additive, with simulated allele substitution effects (a) after all
modifications on traits T1 and T2, respectively, of 0.7 and 0.5 for Q1, and 0.5 and 0.35 for Q2
and Q3. Thus, a for each QTL was uniformly greater on T1 than on T2 by a factor of
approximately 2 and the simulated variance explained by each QTL for T1 was very close to
double that for T2. Backcross families of 500 progeny segregating for the genetic markers and
three QTLs in the two traits were simulated in QTL Cartographer using Rcross.
Direct-effects scenario: Under a scenario of direct effects, the simulated QTLs directly
and independently affect both traits (Figure 1a). All marker and trait data were directly
simulated in QTL Cartographer as described above.
Indirect-effects scenario: Under the indirect (or cause-effect) scenario, the simulated
QTLs directly affect only trait 1, which in turn affects trait 2 (Figure 1b). Only the data for t1
was simulated in QTL Cartographer with the effect sizes as described above for the basic model.
Data for t2 were simulated using the equation:
iii ztt 22
122
2 += , (1)
where t1i and t2i are the values for T1 and T2, respectively, in progeny i, and zi is a standard
normal random variate generated using the RANNOR command in SAS version 9 (SAS
INSTITUTE INC. 2002). Thus, half the variance in T2 is explained by the value of T1, yielding
expected substitution effects for Q1, Q2, and Q3 of approximately 0.5, 0.35, and 0.35,
respectively, in T2, similar to the effect sizes under the direct effects model. A new marker-trait
input file was then created for Rcross that included the simulated values for both traits.
Latent-direct scenario: In this scenario, the three QTLs are assumed to have direct
effects on an unobserved trait T0, which in turn has independent direct effects on T1 and T2
(Figure 1c). Progeny values for T0 were simulated in Rcross using substitution effects of 1.0,
9
0.7, and 0.7 for the three QTLs. Progeny values for T1 and T2 were then simulated using the
equations:
iii ztt 122
022
1 += , (2a)
and
iii ztt 223
021
2 += , (2b)
with T0 explaining half of the variance in T1 and one quarter of the variance in T2, where z1i and
z2i are standard normal variates. This resulted in substitution effects for the three QTLs on the
two traits that were similar to those described above for the base model. The generated values
for t1 and t2 were then used as input into Rcross.
Direct-effects common-environment scenarios: In these scenarios, the simulated QTLs
directly and independently affected the two traits in similar fashion to the direct effects model,
but a common environmental effect on the two traits was included (Figure 1d). To generate
simulated data sets under this scenario, two traits T1* and T2* were first simulated using Rqtl and
Rcross, with Q1, Q2, and Q3 having substitution effects of 1.0, 0.7, and 0.7, respectively, on T1*
and 0.7, 0.5, and 0.5 on T2*. Three vectors of standard normal variates, zC, z1, and z2, were then
generated to represent common, T1-specific, and T2-specific environmental effects, respectively.
Progeny values for T1 and T2 were then generated using the equations:
iCiii czbztt 1*12
21 ++= , (3a)
and
iCiii czbztt 2*22
22 ++= . (3b)
Values of b and c were chosen so that b2 + c2 = 0.5, and were varied in different simulated data
sets to adjust the strength of the common environmental effect. Values of 22=b and c = 0 were
10
used to simulate strong environmental correlation, and 15.0=b and 35.0=c were used to
generate a weaker environmental correlation. The resulting substitution effects of QTLs on the
two traits were thus similar to those of the base model. The values for T1 and T2 were then
substituted for the original T1* and T2* values as input into QTL Cartographer.
Heterogeneous-effects common-environment scenarios: These scenarios were similar to
the direct-effects common-environment scenarios and data were generated using the same
approach, except that the assumption of a homogeneous ratio of relative QTL effects on the two
traits was relaxed (Figure 1e). Values for T1* and T2* were changed to result in substitution
effects for Q1, Q2, and Q3 of 0.7, 0.5, and 0.3 for T1, and 0.5, 0.5, and 0.5 for T2. Both strong
and weak environmental correlations were simulated in different data sets.
F2 direct-effects scenario: A model with direct and independent effects on the two traits
was also used to simulate an F2 family of 500 progeny. QTL locations and additive substitution
effects on the two traits were identical to the basic model described above, and a dominance
effect was also included such that the simulated d/a ratio for both traits was 22 for Q1, 1.0 for
Q2, and 0 for Q3.
QTL detection and structural equation modeling: QTL effects and locations were
estimated by maximum-likelihood composite interval mapping in QTL Cartographer, using
model 6 in Zmapqtl. Prior to running Zmapqtl, background markers were fit using forward plus
backward stepwise regression using a p-value threshold of 0.05 in SRmapqtl to control for
background effects including those of other QTLs when conducting the QTL scans. In Zmapqtl,
a 2 cM walking speed was used to scan the genome for QTL effects, a window size of 10 cM
around each tested position was used to control for QTL effects outside the current marker
interval, and up to 5 of the markers selected in SRmapqtl were used to control for background
11
effects. For backcross data sets, an empirical genome-wide α = 0.05 likelihood ratio test statistic
threshold of 11.28 was estimated from 1000 permutations of simulated data set 1 for the direct
effects scenario. For F2 data sets, an empirical threshold of 14.37 was similarly estimated from
permutation tests of one of the two F2 simulations. Finally, multiple-trait composite interval
mapping was done using model 6 in JZmapqtl to estimate joint QTL effects on the two traits.
The positions with the peak multiple-trait likelihood ratios when comparing the full and reduced
(a = 0) models were used as the QTL locations in the subsequent SEM analysis in place of the
“true” simulated locations. If the detected multiple-trait likelihood peaks were more than 3 cM
outside the range defined by the single-trait likelihood peaks identified using Zmapqtl, a location
closer to the range of the detected single-trait peaks was chosen to represent the QTL.
“Virtual markers” were generated for the likelihood ratio peak locations of each detected
QTL. For backcross progeny sets, the virtual marker score was the probability of inheriting a
donor parent allele from the heterozygous parent, given the flanking marker genotypes (coded as
1 for a heterozygote and 0 for a recurrent parent homozygote) and recombination frequencies
with the respective markers. For F2 progeny, additive genotypic coefficients at flanking markers
were scored as 1, 0, and -1 for parent 1 homozygotes, heterozygotes, and parent 2 homozygotes,
respectively. Dominance coefficients were scored as 0.5 and -0.5 for heterozygotes and
homozygotes, respectively. The additive and dominance scores for each virtual marker were
calculated as the expected values of the additive and dominance genotypic coefficients at the
estimated QTL position, given the flanking marker scores and recombination frequencies with
each marker. The MultiRegress module in QTL Cartographer was used to calculate the additive
and dominance scores for the F2 datasets
12
SEM was conducted using PROC CALIS in SAS version 9 (SAS INSTITUTE INC. 2002).
SEMs were implemented by solving a set of linear equations for each tested model using
maximum-likelihood estimation, with each equation describing a trait value (an endogenous
variable) as a linear function of variables directly “upstream”, which could be virtual marker
scores at QTL positions and/or another trait. In F2 designs, additive and dominance scores at
each virtual marker were included as separate variables. Virtual markers for all QTLs significant
at the empirical threshold for at least one of the two traits were included as variables in the SEM.
All simulated QTLs were detected with genome-wide significance for at least one, and generally
both, traits. In all cases in which a simulated QTL lacked genome-wide significance for a trait, a
likelihood ratio peak with at least comparisonwise significance was present. In several simulated
data sets, significant effects were also detected at loci that did not correspond to simulated QTL
locations, and these were also included in the models. The coefficients of the virtual marker
scores in the linear equations represent the maximum-likelihood estimates of direct QTL effects
on the traits. This approach differs from the mixture model maximum-likelihood estimation used
for interval mapping in QTL Cartographer. However, in models in which the only explanatory
variables were direct QTL effects on the observed traits, the estimated QTL effects from SEM
were very similar to those obtained with QTL Cartographer. When common environmental
effects were included in models, they were incorporated either as unobserved variables (known
as latent variables) affecting multiple traits or as a covariance between the error terms for the
traits, with both approaches producing the same results. The SEM methodology described above
was implemented in PROC CALIS by using Lineqs statements.
Analyses of all data sets included: (a) a full model with direct effects of all three QTLs
on both T1 and T2 and indirect effects of T1 on T2 (Fig. 1f); (b) a direct-effects model
13
corresponding to Figure 1a; (c) an indirect-effects model corresponding to Figure 1b; and (d) a
reversed indirect-effects model, corresponding to Figure 1b but with T2 downstream of T1. The
full model under both backcross and F2 scenarios could not be tested because it was fully
specified (i.e. had zero degrees of freedom), and thus the solutions resulted in a perfect fit to the
covariance structure of the marker-trait data (see also NETO et al. 2008). Alternative fully-
specified models with the same sets of observed (manifest) variables but different path structures
also have a statistically equivalent perfect fit to the data. Including the full model in the
analyses, however, provided unique estimates of the relative contributions of the direct and
indirect paths to the values of T2, given the model structure. A model corresponding to the
direct-effects common-environment scenario (Figure 1d) could not be tested under the simulated
conditions. This model has no degrees of freedom if all QTLs are included as effects for both
traits, and thus the maximum-likelihood solutions perfectly fit the data as in the full model.
Models corresponding to the latent direct scenario (Figure 1c, with the unobserved T0 included
in the model as a latent variable) also could not be validly tested. Due to the path structure in
which T0 is directly connected to all explanatory QTLs and observed traits, the latent variable
can be fit optimally to be highly correlated to the upstream trait in an indirect scenario or to
characterize the correlated QTL effects on the two traits in a direct scenario. Consequently, a
latent direct model fit the data better than both the direct and indirect models under all of the
simulated scenarios and is uninformative about the true model.
Structural equation models were evaluated and compared using three criteria. (1) The
likelihood ratio (LR) test statistic (LRT) is -2 times the log likelihood ratio for the data under the
tested model vs. a fully-specified model. All tested models are nested within a fully-specified
model (either the full model described above or a statistically equivalent model with a different
14
path structure), so the LR test statistic has an asymptotic χ2 distribution based on the number of
degrees of freedom in the tested model. A significant p-value (p < 0.05) represents a failure of
the tested model to explain the covariance structure of the data fully (i.e. insufficiency of the
model), and insignificant p-values identify models that adequately fit the data (MITCHELL 1992).
(2) Standard and adjusted goodness-of-fit indices (GFI and AGFI, respectively) estimate model
fit based on the residual errors, with the latter adjusted for degrees of freedom. Thus, GFI may
be considered analogous to the R2 in a linear model and AGFI analogous to an adjusted R2.
While both measures should produce values between 0 and 1, poorly-fitting models with small
degrees of freedom may result in negative AGFI (PROC CALIS documentation in SAS
INSTITUTE INC. 2002). (3) Akaike’s Information Criterion (AIC; AKAIKE 1974; AKAIKE 1987)
reduces the LRT statistic by twice the number of degrees of freedom in the model, allowing
comparison of non-nested models. The model with the lowest AIC is favored.
Loblolly pine QTL data collection and analysis: Year 8 height and diameter data were
collected on 146 surviving selfed progeny of a single loblolly pine parental clone from a
previously-reported genetic mapping study of inbreeding depression, growing in a South
Carolina Forestry Commission field site near Tillman, SC (REMINGTON and O'MALLEY 2000a;
REMINGTON and O'MALLEY 2000b). Malformed trees for which diameters and heights could not
be measured conventionally were assigned heights of 1.0 m and diameters of 1.0 cm when
severely dwarfed, and heights of 2.0 m and diameters of 3.0 cm. for larger bent-over or multiple-
stemmed trees, which the field measurement crew determined to be reasonable estimates.
QTL mapping was done using the 8-year field data, previously-scored AFLP
marker data (REMINGTON and O'MALLEY 2000a; REMINGTON and O'MALLEY 2000b), and a
genetic map developed from outcross progeny of the same parental clone (REMINGTON et al.
15
1999). Composite interval mapping was performed using QTL Cartographer as described above
for simulated data. Trait data for year 3 height from the study of REMINGTON and O'MALLEY
(2000A) were also included, and height growth between year 3 and year 8 (year 3-8 growth) was
calculated from the difference between the year 8 and year 3 height measurements. Permutation
tests on year 8 height data established an extremely high empirical genome-wide α = 0.05 LR test
statistic threshold of 55.76 for year 8 height, probably because the trait data were bimodally
distributed due to very large phenotypic effects of the QTL associated with the cad locus in year
8. I therefore used for all traits the empirical genome-wide α = 0.05 LR test statistic threshold of
16.42 estimated for earlier height growth measurements in REMINGTON and O'MALLEY (2000A).
I also included two suggestive QTL peaks for year 3 height with lower test statistic values
because they corresponded to significant QTL regions for annual growth in either year 2 or year
3. For the purposes of this study, using this lower threshold is actually conservative, since not
including all true QTL effects in the model introduces residual trait correlations that lead to bias
in the SEM results. Virtual markers were developed for single-trait or multiple-trait QTL
likelihood peak locations estimated from Zmapqtl or JZMapqtl and evaluated by SEM in SAS
PROC CALIS as described above for simulated data.
In evaluating SEMs for year 3 height and year 3-8 growth and for year 8 height and
diameter, QTL regions were included as direct effects for each trait unless there was no evidence
at all of contribution to one of the traits (i.e. no likelihood ratio peak in the region exceeding the
nominal comparison-wise significance level of 5.99 and with homozygous effects in the same
direction as those of the other traits). Excluding non-contributing QTL effects provided
additional degrees of freedom that allowed testing of models that included various combinations
of direct, indirect, and correlated-environment effects on traits. This opportunity did not exist
16
with the simulated data, since in nearly every case all QTLs were either significant at a genome-
wide level or suggestive at a comparisonwise level for both traits. In models with indirect
effects, QTL regions contributing only to the downstream trait were included as effects on that
trait.
RESULTS
Backcross simulations and SEM testing: Two simulated backcross data sets were
tested against structural equation models (SEMs) for each causal scenario to test the hypotheses
(a) that the correct model would have the best fit to the simulated data sets using AIC as a
criterion and (b) that the correct model would adequately explain the simulated data (likelihood
ratio p-value ≥ 0.05). All simulations consisted of three QTLs (Q1, Q2 and Q3) affecting two
observed traits (T1 and T2). The three QTLs were simulated to have homogeneous effects ratios
on the two traits except as noted. Simulation results are summarized in Table 1.
Direct-effects scenario: The first pair of simulations was generated under the direct
scenario (Figure 1a), in which the three QTLs each have direct effects on both T1 and T2. In
both simulated data sets, the direct model correctly outperformed the indirect model in which T2
was downstream of T1 (Figure 1b), with the two simulations resulting in AIC values of 2.7627
vs. 18.6995 and 3.6130 vs. 16.7255, respectively, for the direct vs. indirect models, or AIC
differences of 15.94 and 13.12 between the models. The direct model outperformed the reversed
indirect model (T1 downstream of T2) by even larger margins. In both simulations, however, the
direct model was weakly rejected as a sufficient explanation for the data (LRT p-values of 0.029
and 0.018, respectively, and modestly positive AIC values relative to a fully-specified model)
17
while the indirect model was strongly rejected (p < 0.0001). In the full model in which T2 was
affected by the QTLs via both direct and indirect paths, a substantial proportion of the effect on
T2 (0.165 and 0.196, respectively) was inferred to be indirect, exerted through T1.
Single QTL in SEM: I also tested the first of the two direct model simulations with three
SEMs that each included only one of the predicted QTLs. Not including all true QTL effects in
the model can be expected to result in residual correlations between the traits, confounding the
“signal” used by SEM to distinguish direct and indirect models. Consistent with this prediction,
the direct model was strongly rejected and the indirect model had higher AIC values in each case
when only one of the three QTLs was included (Table 2). In one case, when only Q3 was
included in the model, the indirect model was found to fit the data adequately (p = 0.1253).
Indirect scenario: When simulations were done under the indirect scenario (Figure 1b),
the indirect model adequately fit the data, with insignificant p-values and negative AIC values,
and correctly performed far better than either the direct model or the indirect model with the trait
effects reversed from the simulated scenario (T2 upstream of T1). In the full model, the
proportion of the effect on T2 inferred to be via the indirect path was greater than 0.98 in both
simulations.
Latent direct scenario: A more complex but potentially realistic scenario was
represented by a pair of simulations in which the QTLs affected an unobserved trait, which in
turn had independent direct effects on T1 and T2 (Figure 1c). The latent direct scenario could
correspond to a situation in which genetic variation in the amount or activity of a transcriptional
regulator affects expression of downstream genes or other observed traits, or any other situation
in which the direct QTL effects are on a trait that is not readily observed. In these simulations,
both the direct and indirect models were rejected, with the direct model showing much poorer fit
18
than the indirect model with T2 downstream but better than the reversed indirect model with T1
downstream. In the full model, proportions of the effects on T2 inferred to come through the
indirect pathway were 0.70 and 0.84 respectively in the two simulations. A SEM corresponding
to the latent direct model is uninformative about the correct scenario due to the model structure,
and thus could not legitimately be tested (see Materials and Methods).
Common-environment direct scenario: The next scenario was one in which T1 and T2
were directly affected by the QTLs as in the direct scenario, but in which common environmental
effects on T1 and T2 were included (CE direct scenario; Figure 1d). This corresponds to a
highly realistic situation in which an environmental variable such as nutrient availability may
jointly affect multiple aspects of growth and development. In the first of these simulations,
common environmental effects were simulated to be strong, explaining 50% of the variance in
T1 and T2. Under this scenario, not only was the indirect model incorrectly favored but it
adequately fit the data, while the direct model and the reversed indirect model were strongly
rejected. In the full model, essentially all of the effect on T2 was inferred to come through the
indirect pathway. In a second simulation, in which the common environmental effect explained
only 15% of the phenotypic variance in both traits, all three of these models were strongly
rejected but the indirect model showed by far the best overall fit. In the full model, the
proportion of the effect on T2 attributed to the indirect path was 0.73.
CE direct scenario with relaxed homogeneity assumptions: There is no particular reason
to expect multiple QTLs to have a homogeneous ratio of effects on multiple traits under a direct-
effects scenario. Thus, I generated additional data sets under the common environmental effects
scenario in which the relative effects of the three QTLs on the two traits were different, to test
whether the direct model would perform better than the indirect model under these relaxed
19
assumptions. Under the CE heterogeneous-direct scenarios, Q1, Q2, and Q3 had larger, equal,
and smaller effects, respectively, on T1 relative to T2, with all three QTLs simulated to have
equal effects on T2. Two data sets were simulated under this scenario, one with strong
correlated environmental effects and the second with weaker correlated effects.
Under both CE heterogeneous-direct scenarios, the fit of the direct model and of both
indirect models were all strongly rejected (Table 1). When the correlated environmental effects
were strong, the indirect model performed best by AIC and adjusted goodness-of-fit criteria,
followed in order by the indirect-reversed and direct models. Under the full model, a proportion
greater than 0.92 of the effects on T2 were attributed to the indirect pathway. When the
correlated environmental effects were weaker, the direct model outperformed the other models
using AIC as the criterion, but the indirect model had the highest adjusted goodness-of-fit.
Under the full model, more than two-thirds of the effect on T2 was attributed to the direct
pathway.
F2 simulations under direct scenario: I also tested SEM performance on F2 progeny
sets simulated under the direct scenario. The additive effects of the three QTLs were simulated
to be the same as those in the backcross simulations. Dominance effects were also simulated,
with d/a ratios differing among QTLs but homogeneous between the two traits. The F2 SEMs
included independent additive and dominance effects for each detected QTL. The SEM results
were similar to those obtained for the direct scenario in the backcross simulations, with the direct
model strongly outperforming the indirect models but still failing as a sufficient explanation of
the data (Table 3). Under the full model, the proportions of the effects on T2 attributed to the
indirect pathway were 0.11 and 0.13, respectively in the two simulations, slightly less than in the
direct-scenario backcross simulations.
20
SEM predictions with loblolly pine growth QTL data: The F2 models were tested on
data from loblolly pine (Pinus taeda L.) progeny of a self-pollinated parent, originally planted as
part of a QTL study of inbreeding depression (REMINGTON and O'MALLEY 2000a; REMINGTON
and O'MALLEY 2000b). Surviving eight-year-old trees were scored for height and breast-height
diameter. The F2 progeny had previously been scored for a set of mapped AFLP markers
covering the loblolly pine genome (REMINGTON and O'MALLEY 2000b; REMINGTON et al. 1999)
and for annual height growth through the first three years after germination (REMINGTON and
O'MALLEY 2000a). Applying SEM to the year 3 and year 8 QTL data provided an opportunity to
address questions about the genetic basis for juvenile-mature correlations and plant allometry,
topics of interest to plant biologists and tree breeders. One hypothesis was that QTLs affecting
year 3 height would continue to affect subsequent height growth between years 3 and 8 (year 3-8
growth). A prediction of this hypothesis is that year 3 height QTLs would also have direct
effects on year 3-8 growth. However, year 3 height QTLs might also be expected to continue to
affect subsequent growth indirectly, through a carry-over effect of the growth trajectory
established in individual plants through year 3. A second hypothesis was that QTL effects on
year 8 diameter would largely be indirect effects of genetic variation for year 8 height rather than
vice versa, under the assumption that diameter growth in trees is controlled largely as an
allometric response to height growth. A height-diameter allometric exponent close to 1 (i.e. a
linear relationship between height and diameter) has been reported for relatively young trees
such as those in this study, so indirect effects of tree height QTLs on diameter would be expected
to be more or less linear (NIKLAS 1995). Common environmental effects would also be expected
to generate positive correlations between traits in both the year 3-8 and height-diameter analyses.
21
Four QTL regions that exceeded the likelihood ratio test statistic of 16.42 were identified
for eight-year height or diameter. Two QTL regions, LG1-14 and LG8-88, were significant for
both year 8 height and diameter. The LG8-88 region corresponds to the location of the
segregating cad-n1 mutation, which results in production of chemically altered lignin and
reduced wood stiffness, apparently interfering with the trees’ ability to maintain upright form and
sustained height growth (MACKAY et al. 1997; RALPH et al. 1997; REMINGTON and O'MALLEY
2000a). A third QTL region (LG9-08) was significant for year 8 height and strongly suggestive
for diameter (LRT = 16.24). Each of these regions had large additive effects with substantial
overdominance for both traits. A fourth QTL region (LG1-80) was significant for year 8 height,
with large symmetrically overdominant effects, but had no apparent effect on diameter (Table 4).
The height effect associated with LG1-80 may be an artifact of extreme segregation distortion, as
it is flanked by near-lethal recessive alleles linked in repulsion and has substantial deficits of
both homozygous genotypes, but this locus was included in the SEMs anyway to account for all
prospective QTL effects. The LG8-88 and LG1-14 QTL regions had similar effects on height at
earlier ages (REMINGTON and O'MALLEY 2000a). The LG9-08 region was also suggestive of
effects on year 3 growth, with a peak LRT value of 9.18 and additive and dominance effects in
the same direction as in year 8. Two QTL regions on linkage groups 12 (LG12-116) and 3
(LG3-03), which were significant for year 2 and year 3 growth, respectively, had nearly
significant effects on total year 3 height but showed no evidence of effects in year 8. QTL
effects for year 3-8 growth were very similar to those for year 8 height (Table 5).
In fitting SEMs for year 3 height and year 3-8 growth, only QTLs with significant or
suggestive effects were included as direct effects; thus, LG12-116 and LG3-03 were only
included in direct effects for year 3 height, and LG1-80 was only included for year 3-8 growth.
22
Similarly, in comparing year 8 height and diameter, LG1-80 was only included in direct effects
for year 8 height. In both cases, QTL regions with direct effects only on the downstream trait
were included in models with indirect effects. The additional degrees of freedom provided by
excluding non-contributing QTLs allowed testing of models that included various combinations
of direct, indirect, and correlated-environment effects on traits.
Effects on year 3 height and year 3-8 growth were poorly explained both by models with
only direct QTL effects and with only indirect effects of year 3 height on year 3-8 growth (Table
6). A model with both direct and indirect effects on year 3-8 growth (direct + indirect model)
adequately fit the data and had the lowest AIC value. Adding environmental correlation to the
direct + indirect model only marginally improved model fit but this direct + indirect + CE model
had a higher AIC value than the simpler direct + indirect model. A model with direct effects and
environmental correlation (direct + CE model) was marginally significant for lack of fit (p =
0.0344), and had slightly poorer fit and a slightly higher AIC value than the full model,
suggesting that adding indirect effects may explain a small amount of effect not explained by
direct and common environmental effects alone but not vice versa. Even with the additional
degrees of freedom available with this data set, however, there is very little information to
distinguish whether indirect effects or common environmental effects (or both) contribute to the
relationship between the two traits.
To evaluate the utility of SEM for distinguishing between different causal structures, I
also tested a set of biologically implausible models in which year 3 height was a downstream
effect of year 3-8 growth. One implausible model with year 3 height QTLs explained only as an
indirect effect of year 3-8 growth and another with both direct and indirect effects on year 3
height had only marginally significant lack of fit to the data (p = 0.0169 and p = 0.0256,
23
respectively), and the former had an AIC value within 1.0 of the best-fitting plausible direct +
indirect model.
By contrast, effects on year 8 height and diameter were adequately explained by a model
with only indirect effects of year 8 height on year 8 diameter (Table 7). Models that included
direct QTL effects on both traits in combination with indirect effects on either height or diameter
or correlated environmental effects also adequately fit the data, but had AIC values that were
more than 4.0 larger than the indirect-only model due to the additional degrees of freedom used
in fitting the model. Models with only direct QTL effects on both traits and with height
explained only by diameter failed to fit the data and had much higher AIC values. Thus, the data
are consistent with the prediction that diameter growth is largely an allometric response to height
growth.
DISCUSSION
For this study I used simulations to examine the ability of structural equation models to
distinguish between simple but realistic genetic regulatory scenarios for multiple-trait QTL
mapping data, including those with common environmental effects, incomplete QTL
information, and to examine the sensitivity of SEM to incorrect or incomplete model structures.
The consistent results from the replicate simulations in each case suggest that the patterns they
reveal are valid, and their implications should be taken into account when interpreting the results
of QTL-based causal inference studies.
The correct models were strongly favored using p-value and AIC criteria when data sets
were simulated under direct and indirect scenarios without confounding factors and all QTLs
24
were included in the model. However, the direct model provided only marginal fit to the
simulated data when it was the correct model, which is discussed in more detail below. The
simulated QTL effects were relatively small, with Q2 and Q3 explaining only ~3% of the
phenotypic variance for T2, the trait with smaller QTL effects, in the backcross design. Success
in identifying the correct model in these circumstances is consistent with the results of a larger
set of simulations by Schadt et al. (2005), who found that the power to detect the correct model
under similar circumstances was ~80% when QTLs explained as little as 1-2% of the variation in
the less-affected trait.
Confounding effects of incomplete QTL and trait detection: These simulations
clearly show that not including all relevant QTLs in the models can lead to incorrect functional
predictions with multiple-trait QTL data. Much of the “signal” for detecting indirect QTL
effects on traits in a network, whether the traits are gene expression or functional phenotypes,
comes from the degree and pattern of residual correlations between trait values within QTL
genotypic classes (DOSS et al. 2005; SCHADT et al. 2005; STEIN et al. 2003). In the absence of
common environmental effects, only the genetic component of trait variation in a pair of traits
should be correlated if a set of genes affects both traits independently. When QTL effects on one
trait are indirect effects of variation in an upstream trait, however, both genetic and
environmental effects on the upstream trait will affect the downstream trait in the same manner,
leading to similar correlations within and among QTL classes. If all QTLs that jointly affect a
set of traits are not included in the model, effects of the unobserved or excluded QTLs will also
result in residual correlations between traits. These residual correlations may generate false
support for indirect models, with the trait having larger standardized QTL effects appearing to be
upstream. Thus, analyses that test only a single QTL region for effects on a suite of polygenic
25
traits are virtually guaranteed to place the traits in a common regulatory pathway, whether such a
pathway exists or not. Even when all experimentally detected QTL regions are included in a
model, some actual QTLs are likely to be missed due to limited experimental power (BEAVIS
1994) and lead to correlated residuals. Testing of direct vs. indirect QTL path models has been
suggested as a test of close linkage vs. pleiotropy with shared QTL regions, since the former are
by definition independently regulated (DOSS et al. 2005; SCHADT et al. 2005). For this test to
identify instances of close linkage effectively, however, these results show that all other QTL
regions that jointly affect the two traits must be included in the model.
Unobserved traits can also affect model predictions if they occur at branching points
leading to observed traits, as in the latent-direct scenario. In the two latent-direct simulations,
neither the direct nor the indirect model was well-supported, providing some indication that the
true model was more complex, but these results are not distinguishable from those expected if
the true scenario were a combination of direct and indirect effects.
Distinguishing indirect vs. common environment effects: These results show that
shared environmental effects on a pair of traits may be difficult to distinguish from indirect
effects. Models combining direct effects with either indirect effects or with common
environmental effects on a pair of traits cannot be distinguished if all QTLs affect both traits;
both will lack residual degrees of freedom and have perfect fit. It should be noted that similar
indistinguishable submodels could potentially be nested within more complex trait networks
even when the overall model has residual degrees of freedom. Even if both model structures can
be tested, however, as with the loblolly pine year 3-8 analyses, there may be little information to
distinguish common environmental effects from indirect effects because both mechanisms will
generate trait correlations within QTL genotype classes. When common environment effects
26
were included in simulations of direct-effects scenarios, indirect models were more strongly
supported in most cases. Genes with smaller standardized genetic effects tended to be placed
downstream of those with larger genetic effects even when their genetic regulation by shared
QTLs was independent.
In biologically realistic situations, shared environmental effects are probably common
because each individual organism in a genetic mapping population is shaped by the unique
environment it has experienced and by the peculiarities of developmental “noise.” These
environmental and developmental effects are likely to have correlated effects on multiple traits
(LYNCH and WALSH 1998). The influence of common environment on residual trait correlations
is well-recognized (DOSS et al. 2005; SCHADT et al. 2005; STEIN et al. 2003) but the question of
whether or not path models can distinguish environmental correlations from causative indirect
effects has received little scrutiny.
The simulations presented here probably represent some of the more challenging
scenarios for distinguishing direct, indirect, and common environmental effects. In contrast with
these simulations, there is no expectation that multiple QTLs acting independently on a set of
traits would have homogeneous effects ratios, or even the same direction of effect, on the traits.
For example, five QTL regions were found to affect both corolla length and width in a cross
between perennial and annual populations of the monkeyflower Mimulus guttatus, with ratios of
standardized effects on the two traits ranging from 0.40 to 2.25 (HALL et al. 2006). In the same
study, one QTL region with joint effects on corolla tube length and reproductive allocation, traits
that lack an obvious mechanism for a cause-effect relationship, affected both traits in the same
direction while a second affected the two traits in opposite directions (HALL et al. 2006). In the
loblolly pine data set examined in the present study, only three of six QTL regions for year 3
27
height and year 3-8 growth had detectable effects on both traits, and for these three regions the
ratio of standardized homozygous effects on the two traits ranged from 0.44 to 1.09. By contrast,
three QTL regions affecting year 8 height and diameter (ignoring LG1-80, which had negligible
homozygous effects) had standardized effects ratios in a much narrower range of 1.16 to 1.37,
consistent with the SEM inference of indirect effects on diameter. When heterogeneous QTL
effects were simulated, the false support for the indirect model relative to the direct model due to
common environmental effects was weakened or even reversed. Moreover, genetic and
environmental correlations may be in opposite directions in actual trait networks, unlike those in
our simulations. For example, traits involved in life history trade-offs are likely to have negative
genetic correlations, but shared environmental effects (e.g. variation in nutritional status) may
generate positive correlations. In such situations, confounding of common environmental effects
with indirect effects would be unlikely and in some cases SEM results might even be biased
against indirect models.
Sensitivity of the direct models: In contrast with the effects of confounding factors that
favor indirect models, these data show that direct models are highly sensitive to any residual
correlation between trait values. In all four simulations under simple direct-effects scenarios
(two backcross and two F2 simulations), the correct direct model was rejected as an adequate
explanation of the data (LR p-values < 0.05) even though it outperformed the indirect model.
Concomitantly, the full models that included both direct and indirect effects estimated as much
as 20% of the effect on T2 to occur via the indirect pathway under the direct-effects scenarios,
even though none of the actual effect was indirect. The basis for this bias toward indirect effects
is uncertain. Imprecise estimates of the true QTL locations is one possibility, although replacing
the estimated QTL locations with the actual simulated locations in one of the simulations did not
28
improve the fit (data not shown). Imprecise estimates of the true QTL effects will result in
under- or over-fitting of the model to the true sources of effects, and could also result in residual
correlations. It is also possible that the residual values generated in the QTL Cartographer
simulations were not entirely random.
Regardless of the explanation for the lack of complete model fit in the simulated data,
actual data from directly-regulated trait sets can generally be expected to show even less
independence of residuals due to the pervasiveness of common environmental effects and
undetected QTLs discussed above. Given these realities, standard tests of model fit are
inappropriate for evaluating the sufficiency of direct-effects models. Empirical tests will be
necessary but not easy to implement, as permutation tests would have to randomize trait values
within QTL classes and handling recombinants will present special problems. Parametric
bootstrap simulations may be a better option, but the algorithms required to generate hundreds of
simulated datasets, estimate QTL locations and virtual marker scores, and run multiple SEMs on
each simulation will be complex. Effects of unobserved QTLs could be incorporated as
polygenic effects in a parametric bootstrap. Crude but potentially useful estimates of polygenic
variance could be obtained from the percentage of differences between parental lines not
explained by the identified QTLs, especially if the mapping population were derived from
divergent parental lines.
In spite of the factors causing bias against the direct model, direct effects were identified
as the largest contributor to QTL effects in the experimental data on loblolly pine year 3-8
growth. These results were consistent with a priori predictions of carryover effects of earlier
QTL action and possibly common environment effect as additional contributors to year 3-8
29
growth. In contrast, QTL effects on year 8 diameter were fully explained as an indirect effect of
year 8 height, also consistent with a priori predictions based on allometry theory.
Limitations of SEM for exploratory analysis: Results of this study underscore the
need for caution in using SEM in a purely exploratory mode in the absence of functional
hypotheses to guide the choice of model structure. While SEM analyzes causal models of the
covariance structure between traits, the adequate fit of a model does not by itself imply causation
(GRACE and PUGESEK 1998; MITCHELL 1992). In data sets simulated under an indirect scenario,
reversing the cause-effect relationship between the two traits resulted in substantially poorer
model fit, as expected. With the loblolly pine year 3 – year 8 height data, however, a
biologically implausible but simple model in which year 3-8 growth alone explained year 3
height QTLs fit nearly as well as the best-fitting plausible model, which required both direct and
indirect QTL effects to explain year 3-8 growth, and fit better than any of the other plausible
models tested. The most likely explanation is that the standardized effect of QTLs on year 3-8
growth is much larger than on year 3 height due to the very large effect of the LG8-88 QTL on
year 3-8 growth. In this example, the temporal relationship between the traits made the
distinction between plausible vs. implausible model structures obvious, but this will not be the
case in other situations. For example, in explaining genetic correlations between morphological
and physiological traits, reasonable biological models might involve direct or indirect genetic
effects on either set of traits. It is therefore important to include as much prior biological
information as possible to identify plausible networks, such as prior information on co-regulated
sets of genes or structure of known regulatory pathways in eQTL analyses (KEURENTJES et al.
2007; KLIEBENSTEIN et al. 2006; ZHU et al. 2008), developmental models describing temporal
and functional relationships among morphological and physiological traits (MÜNDERMANN et al.
30
2005), and the expected direction of correlations of common environmental effects. In this
context, SEM provides a powerful tool for testing the adequacy of causal hypotheses (MITCHELL
1992). Purely exploratory analyses may be useful in systems-biology applications for reducing
the set of model structures for further experimentation (or "sparsification", per LIU et al. 2008),
but the results of such analyses should not be interpreted as support for particular causal
structures.
Summary: Results of this study confirm that the residual correlation structure in
multiple-trait QTL data, which is typically unused in QTL analyses, provides a powerful but
limited source of information on the functional basis of trait variation. Factors likely to be
pervasive in real biological data, including common environmental effects and incomplete
identification of QTLs, can masquerade as indirect causal effects and lead to incorrect
conclusions if not carefully considered. Applications of SEM or other approaches to path
analysis for functional predictions from QTL data need to incorporate as much a priori
information as possible on the traits being studied, including the expected patterns of common
environmental effects on traits, if the predictions are to be biologically useful.
Further studies should be conducted to test the predictive power of SEM under scenarios
such as the ones used here. Other studies have used large numbers of simulations to examine the
power of network modeling approaches (LIU et al. 2008; NETO et al. 2008; SCHADT et al. 2005),
but have not explored the questions addressed in the present study. Issues with statistically
indistinguishable models incorporating common environment vs. indirect effects need to be
examined in more complex trait networks, where untestable fully-specified modules may be
nested within larger models. Future research needs also include evaluating different patterns of
31
direct genetic and common environmental effects, developing methods for empirical statistical
tests, and incorporating epistatic QTL effects.
I thank Ross Whetten and Elizabeth Lacey for assistance with year 8 data collection from the
Tillman, SC loblolly pine genetic mapping population; Rongling Wu for first suggesting to me
the use of path analysis approaches for functional prediction from QTL data; Elizabeth Lacey,
Sergey Nuzhdin, and David Aylor for valuable discussions; and Malcolm Schug, Rebecca
Doerge, and two anonymous reviewers for helpful comments to improve the manuscript.
32
TABLE 1
Summary of structural equation model analyses for backcross simulations
Simulated
scenario
T2
VI /VQ 1
Tested model GFI AGFI LR p-
value
AIC
Direct (set 1) 0.165 Direct 2 0.9962 0.9432 0.0291 2.7627
Indirect 3 0.9810 0.9052 <0.0001 18.6995
Indirect reversed 4 0.9332 0.6660 <0.0001 92.3938
Direct (set 2) 0.196 Direct 0.9955 0.9332 0.0178 3.6130
Indirect 0.9825 0.9125 <0.0001 16.7255
Indirect reversed 0.9408 0.7042 <0.0001 79.3354
Indirect (set 1) 0.989 Direct 0.8440 -1.3404 <0.0001 307.5001
Indirect 0.9955 0.9776 0.1297 -0.3460
Indirect reversed 0.9553 0.7767 <0.0001 56.0307
Indirect (set 2) 0.985 Direct 0.8880 -2.1365 <0.0001 288.6861
Indirect 0.9964 0.9800 0.2796 -3.7162
Indirect reversed 0.9804 0.8904 <0.0001 26.1377
Latent direct 0.844 Direct 0.9591 0.3861 <0.0001 54.2958
(set 1) Indirect 0.9907 0.9537 0.0081 5.7892
Indirect reversed 0.9381 0.6904 <0.0001 83.9911
Latent direct 0.696 Direct 0.9594 0.3903 <0.0001 53.8670
(set 2) Indirect 0.9810 0.9048 <0.0001 18.8166
Indirect reversed 0.9477 0.7387 <0.0001 68.0184
CE (strong) 0.999 Direct 0.8987 -1.1270 <0.0001 203.9092
33
direct Indirect 0.9992 0.9957 0.8750 -6.7810
Indirect reversed 0.9496 0.7357 <0.0001 78.4443
CE (weak) 0.724 Direct 0.9698 0.3665 <0.0001 46.8858
direct Indirect 0.9872 0.9327 0.0005 11.8183
Indirect reversed 0.9524 0.7500 <0.0001 73.0995
CE (strong) 0.925 Direct 0.8766 -0.8516 <0.0001 214.5460
heterogeneous Indirect 0.9815 0.9073 <0.0001 18.1409
direct Indirect reversed 0.9698 0.8492 <0.0001 34.3724
CE (weak) 0.325 Direct 0.9880 0.8194 <0.0001 13.4429
heterogeneous Indirect 0.9747 0.8736 <0.0001 27.4442
direct Indirect reversed 0.9595 0.7975 <0.0001 49.6496
1 Proportion of model variance in T2 (sum of squared standardized model path coefficients)
attributed to T1 in a fully-specified model including both direct QTL effects and T1 effects
on T2.
2 Direct model has 1 residual degree of freedom (df) in all backcross simulations.
3 Indirect models have 1 residual df per QTL in model.
4 Indirect model with T1 downstream of T2.
34
TABLE 2
Predictions for direct scenario (set 1) with only one QTL included in structural equation
model.
QTL included in model Tested model LR p-value AIC
Q1 Direct 0.0002 12.0834
Indirect 0.0104 4.5721
Q2 Indirect 0.0001 12.6641
Direct 0.0004 10.7011
Q3 Direct <0.0001 15.1621
Indirect 0.1253 0.3499
35
TABLE 3
Summary of structural equation model analyses for F2 simulations
Simulated
scenario
T2
VI /VQ 1
Tested model GFI AGFI LR p-
value
AIC
Direct (set 1) 0.105 Direct 0.9963 0.7948 0.0021 7.4309
Indirect 0.9721 0.8082 <0.0001 61.2901
Indirect reversed 0.9306 0.5229 <0.0001 216.8199
Direct (set 2) 0.134 Direct 0.9957 0.7624 0.0009 8.9432
Indirect 0.9760 0.8351 <0.0001 49.4271
Indirect reversed 0.9398 0.5863 <0.0001 176.5202
1 Proportion of model variance in T2 (sum of squared standardized model path coefficients)
attributed to T1 in a fully-specified model including both direct QTL effects and T1 effects
on T2.
2 Direct model has 1 residual df.
3 Indirect models have 2 residual df per QTL in model.
36
TABLE 4
QTL map locations and standardized effects for year 8 height and diameter in a self-
pollinated loblolly pine family
Year 8 height (cm) 1,2 Year 8 diameter (cm) 1,2
QTL
Peak
position
(cM)
LR
statistic 2a/SD d/SD
Peak
position
(cM)
LR
statistic 2a/SD d/SD
LG1-14 17.8 21.41 -0.91 0.77 15.8 18.15 -0.79 0.74
LG1-80 79.6 21.41 0.05 1.62 NA NA NA NA
LG8-88 91.4 93.75 -1.59 1.20 83.4 71.87 -1.33 1.17
LG9-08 7.7 20.57 0.80 0.58 7.7 16.24 0.59 0.55
1 Estimates derived using Zmapqtl in QTL Cartographer version 1.17 (BASTEN et al. 1994;
BASTEN et al. 2004), using model 6 (composite interval mapping) with window size 10 cM and
fitting of up to 5 background markers. LR statistics and parameter estimates are from hypothesis
tests comparing a full model with a and d both estimated to a null model (H3:H0).
2 Phenotypic standard deviation = 246.3 cm for year 8 height and 4.62 cm for year 8 diameter.
37
TABLE 5
QTL map locations and standardized effects for year 3 height and year 3-8 growth in a self-
pollinated loblolly pine family
Year 3 height (cm) 1 Year 3-8 growth (cm) 1
QTL
Peak
position
(cM)
LR
statistic 2a/SD d/SD
Peak
position
(cM)
LR
statistic 2a/SD d/SD
LG1-14 23.8 32.34 -1.11 0.81 13.8 23.05 -1.02 0.74
LG1-80 NA NA NA NA 79.6 18.15 -0.46 1.52
LG3-03 2.2 13.54 0.64 0.20 NA NA NA NA
LG8-88 87.4 36.94 -0.76 0.57 91.4 99.57 -1.72 1.25
LG9-08 3.4 9.18 0.51 0.25 7.7 18.62 0.80 0.55
LG12-116 116.1 12.68 -0.67 0.03 NA NA NA NA
1 Estimates derived using Zmapqtl in QTL Cartographer version 1.17 (BASTEN et al. 1994;
BASTEN et al. 2004), using model 6 (composite interval mapping) with window size 10 cM and
fitting of up to 5 background markers. LR statistics and parameter estimates are from hypothesis
tests comparing a full model with a and d both estimated to a null model (H3:H0).
38
TABLE 6
Summary of structural equation model analyses on loblolly pine year 3 height and year 3-8
height growth QTL data
Tested model Year 3-8
VI /VQ 1
df
GFI AGFI LR
p-value
AIC
Direct + indirect 0.453 6 0.9886 0.8006 0.0689 -0.2958
Direct + reversed indirect 2 6 0.9861 0.7566 0.0256 2.3834
Direct 7 0.9486 0.2286 <0.0001 53.6942
Direct + CE 6 0.9867 0.7671 0.0344 1.5998
Indirect 12 0.9346 0.4276 <0.0001 65.2851
Reversed indirect 12 0.9771 0.7995 0.0169 0.5877
Direct + indirect + CE 5 0.9895 0.7792 0.0558 0.7843
1 Proportion of model variance in year 3-8 growth (sum of squared standardized model path
coefficients) attributed to year 3 height in a fully-specified model including both direct QTL
effects and year 3 height effects on year 3-8 growth.
2 Indirect model with year 3 height downstream of year 3-8 growth.
39
TABLE 7
Summary of structural equation model analyses on loblolly year 8 height and diameter
QTL data
Tested model Diam
VI /VQ 1
df
GFI AGFI LR
p-value
AIC
Direct + indirect 0.981 2 0.9993 0.9802 0.7730 -3.4851
Direct + reversed indirect 2 2 0.9951 0.8643 0.1660 -0.4087
Direct 3 0.8978 -0.8739 <0.0001 114.4400
Direct + CE 2 0.9951 0.8642 0.1658 -0.4059
Indirect 8 0.9890 0.9245 0.4171 -7.8308
Reversed indirect 8 0.9513 0.6651 <0.0001 25.0491
1 Proportion of model variance in year 8 diameter (sum of squared standardized model path
coefficients) attributed to year 8 height in a fully-specified model including both direct QTL
effects and year 8 height effects on year 8 diameter.
2 Indirect model with year 8 height downstream of year 8 diameter.
40
Figure caption:
FIGURE 1.– Path models for scenarios simulated or modeled in this study: a. direct-effects
scenario; b. indirect scenario; c. latent direct scenario; d. direct-effects scenario with common
environment effects (direct CE scenario); e. direct CE scenario with heterogeneous QTL effect
ratios; f. full model including both direct effects and indirect effects on T2. In these diagrams,
Q1, Q2, and Q3 represent simulated QTLs, T1 and T2 represent observed traits, T0 represents an
unobserved (latent) trait, and EC represents common environmental effects.
41
a.
b.
c.
d.
e.
f.
Q1Q2Q3
T1
T2 EC
Q1Q2Q3
T1
T2
T0 Q1Q2Q3
T1
T2
T1
T2 EC
Q1
Q2
Q3
Q1Q2Q3
T2 T1
Q1Q2Q3
T1
T2
42
LITERATURE CITED
AKAIKE, H., 1974 A new look at the statistical identification model. IEEE T. Automat. Contr. 19:
716-723.
AKAIKE, H., 1987 Factor analysis and AIC. Psychometrika 52: 317-332.
AMOS DEVELOPMENT CORPORATION, 2006 Amos 7.0.0, pp. Amos Development Corporation,
Spring House, PA.
BASTEN, C. J., B. S. WEIR and Z.-B. ZENG, 1994 Zmap-a QTL cartographer., pp. 65-66 in 5th
World Congress on Genetics Applied to Livestock Production: Computing Strategies and
Software., edited by C. SMITH, J. S. GAVORA, B. BENKEL, J. CHESNAIS, W. FAIRFULL et
al. Organizing Committee, 5th World Congress on Genetics Applied to Livestock
Production., Guelph, Ontario, Canada.
BASTEN, C. J., B. S. WEIR and Z.-B. ZENG, 2004 QTL Cartographer, Version 1.17, pp.
Department of Statistics, North Carolina State University, Raleigh, NC.
BEAVIS, W. D., 1994 The power and deceit of QTL experiments: lessons from comparative QTL
studies. Proceedings of the 49th Annual Corn and Sorghum Industry Research
Conference, Chicago 49: 250-266.
BENFEY, P. N., and T. MITCHELL-OLDS, 2008 From genotype to phenotype: systems biology
meets natural variation. Science 320: 495-497.
BREM, R. B., G. YVERT, R. CLINTON and L. KRUGLYAK, 2002 Genetic dissection of
transcriptional regulation in budding yeast. Science 296: 752-755.
43
BREWER, M. T., J. B. MOYSEENKO, A. J. MONFORTE and E. VAN DER KNAAP, 2007
Morphological variation in tomato: a comprehensive study of quantitative trait loci
controlling fruit shape and development. J. Exp. Bot. 58: 1339-1349.
DOSS, S., E. E. SCHADT, T. A. DRAKE and A. J. LUSIS, 2005 Cis-acting expression quantitative
trait loci in mice. Genome Res. 15: 681-691.
FEDER, M. E., and T. MITCHELL-OLDS, 2003 Evolutionary and ecological functional genomics.
Nat. Rev. Genet. 4: 651-657.
GRACE, J. B., and B. H. PUGESEK, 1998 On the use of path analysis and related procedures for the
investigation of ecological problems. Am. Nat. 152: 151-159.
HALL, M. C., C. J. BASTEN and J. H. WILLIS, 2006 Pleiotropic quantitative trait loci contribute to
population divergence in traits associated with life-history variation in Mimulus guttatus.
Genetics 172: 1829-1844.
JIANG, C.-J., and Z. B. ZENG, 1995 Multiple trait analysis of genetic mapping for quantitative
trait loci. Genetics 140: 1111-1127.
KEURENTJES, J. J. B., J. FU, C. H. RIC DE VOS, A. LOMMEN, R. D. HALL et al., 2006 The genetics
of plant metabolism. Nature Genet. 38: 842-849.
KEURENTJES, J. J. B., J. FU, I. R. TERPSTRA, J. M. GARCIA, G. VAN DEN ACKEVEKEN et al., 2007
Regulatory network construction in Arabidopsis by using genome-wide gene expression
quantitative trait loci. Proc. Natl. Acad. Sci. USA 104: 1708-1713.
KEURENTJES, J. J. B., M. KOORNNEEF and D. VREUGDENHIL, 2008 Quantitative genetics in the
age of omics. Curr. Opin. Plant Biol. 11: 123-128.
44
KLIEBENSTEIN, D. J., M. A. L. WEST, H. VAN LEEUWEN, O. LOUDET, R. W. DOERGE et al., 2006
Identification of QTLs controlling gene expression networks defined a priori. BMC
Bioinformatics 7: 308.
KOONIN, E. V., and Y. I. WOLF, 2008 Evolutionary systems biology, pp. 11-25 in Evolutionary
Genomics and Proteomics, edited by M. PAGEL and A. POMIANKOWSKI. Sinauer,
Sunderland, MA.
LANGLADE, N. B., X. FENG, T. DRANSFIELD, L. COPSEY, A. I. HANNA et al., 2005 Evolution
through genetically controlled allometry space. Proc. Natl. Acad. Sci. USA 102: 10221-
10226.
LI, R., S.-W. TSAIH, K. SHOCKLEY, I. M. STYLIANOU, J. WERGEDAL et al., 2006 Structural model
analysis of multiple quantitative traits. PLoS Genetics 2: e114.
LISEC, J., R. C. MEYER, M. STEINFATH, H. REDESTIG, M. BECHER et al., 2008 Identification of
metabolic and biomass QTL in Arabidopsis thaliana in a parallel analysis of RIL and IL
populations. Plant J. 53: 960-972.
LIU, B., A. DE LA FUENTE and I. HOESCHELE, 2008 Gene network inference via structural
equation modeling in genetical genomics experiments. Genetics 178: 1763-1776.
LYNCH, M., and B. WALSH, 1998 Genetics and Analysis of Quantitative Traits. Sinauer,
Sunderland, MA.
MACKAY, J. J., D. M. O'MALLEY, T. PRESNELL, F. L. BOOKER, M. M. CAMPBELL et al., 1997
Inheritance, gene expression, and lignin characterization in a mutant pine deficient in
cinnamyl alcohol dehydrogenase. Proc. Natl. Acad. Sci. USA 94: 8255-8260.
MACKAY, T. F. C., 2001 The genetic architecture of quantitative traits. Ann. Rev. Genet. 35: 303-
339.
45
MACKAY, T. F. C., 2004 Genetic dissection of quantitative traits., pp. 51-73 in The Evolution of
Population Biology, edited by R. S. SINGH and M. Y. UYENOYAMA. Cambridge
University Press, Cambridge, UK.
MITCHELL, R. J., 1992 Testing evolutionary and ecological hypotheses using path analysis and
structural equation modeling. Funct. Ecol. 6: 123-129.
MÜNDERMANN, L., Y. ERASMUS, B. LANE, E. COEN and P. PRUSINKIEWICZ, 2005 Quantitative
modeling of Arabidopsis development. Plant Physiol. 139: 960-968.
NETO, E. C., C. T. FERRARA, A. D. ATTIE and B. S. YANDELL, 2008 Inferring causal phenotype
networks from segregating populations. Genetics 179: 1089-1100.
NIKLAS, K. J., 1995 Size-dependent allometry of tree height, diameter and trunk-taper. Ann. Bot.
75: 217-227.
RALPH, J., J. J. MACKAY, R. D. HATFIELD, D. M. O'MALLEY, R. W. WHETTEN et al., 1997
Abnormal lignin in a loblolly pine mutant. Science 277: 235-239.
RAUH, B. L., C. BASTEN and E. S. BUCKLER, 2002 Quantitative trait loci analysis of growth
response to varying nitrogen sources in Arabiodpsis thaliana. Theor. Appl. Genet. 104:
743-750.
REMINGTON, D. L., and D. M. O'MALLEY, 2000a Evaluation of major genetic loci contributing to
inbreeding depression for survival and early growth in a selfed family of Pinus taeda.
Evolution 54: 1580-1589.
REMINGTON, D. L., and D. M. O'MALLEY, 2000b Whole-genome characterization of embryonic
stage inbreeding depression in a selfed loblolly pine family. Genetics 155: 337-348.
46
REMINGTON, D. L., R. W. WHETTEN, B.-H. LIU and D. M. O'MALLEY, 1999 Construction of an
AFLP genetic map with nearly complete genome coverage in Pinus taeda. Theor. Appl.
Genet. 98: 1279-1292.
SAS INSTITUTE INC., 2002 The SAS System for Windows, version 9.1, pp., Cary, NC.
SCHADT, E. E., J. LAMB, X. YANG, J. ZHU, S. EDWARDS et al., 2005 An integrative genomics
approach to infer causal associations between gene expression and disease. Nature Genet.
37: 710-717.
STEIN, C. M., Y. SONG, R. C. ELSTON, G. JUN, H. K. TIWARI et al., 2003 Structural equation
model-based genome scan for the metabolic syndrome. BMC Genetics 4(Suppl 1): S99.
TONSOR, S. J., and S. M. SCHEINER, 2007 Plastic trait integration across a CO2 gradient in
Arabidopsis thaliana. Am. Nat. 169: E119-E140.
WRIGHT, S., 1921 Correlation and causation. Journal of Agricultural Research 20: 557-585.
ZHU, J., P. Y. LUM, J. LAMB, D. GUHATHAKTURA, S. W. EDWARDS et al., 2004 An integrative
genomics approach to the reconstruction of gene networks in segregating populations.
Cytogenet. Genome Res. 105: 363-374.
ZHU, J., B. ZHANG, E. N. SMITH, B. DREES, R. B. BREM et al., 2008 Integrating large-scale
functional genomic data to dissect the complexity of yeast regulatory networks. Nature
Genet. 40: 854-861.