Improving Model Selection in Structural Equation Models...

35
Improving Model Selection in Structural Equation Models with New Bayes Factor Approximations Kenneth A. Bollen * Surajit Ray Jane Zavisca Jeffrey J. Harden § April 11, 2011 Abstract Often researchers must choose among two or more structural equation models for a given set of data. Typically, selection is based on having the highest chi-square p-value or the highest fit index such as the CFI or RMSEA. Though a common situation, there is little evidence on the performance of these fit indices in choosing between models. In other statistical applications, Bayes Factor approximations such as the BIC are commonly used to select between models, but these are rarely used in SEMs. This paper examines several new and old Bayes Factor approximations along with some commonly used fit indices to assess their accuracy in choos- ing the true model among a broad set of false models. The results show that the Bayes Factor Approximations outperform the other fit indices. Among these approximations one of the new ones, SPBIC, is particularly promising. The commonly used chi-square p-value and the CFI, IFI, and RMSEA do much worse. Keywords: Bayes Factor · Structural Equation Models · Empirical Bayes · Informa- tion Criterion · Model Selection * Department of Sociology, University of North Carolina, Chapel Hill, NC 27599, [email protected] Department of Mathematics and Statistics, Boston University, Boston, MA 02446, [email protected] Department of Sociology, University of Arizona, Tucson, AZ 85721, [email protected] § Department of Political Science, University of North Carolina, Chapel Hill, NC 27599, [email protected]

Transcript of Improving Model Selection in Structural Equation Models...

Page 1: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

Improving Model Selection in Structural EquationModels with New Bayes Factor Approximations

Kenneth A. Bollen∗ Surajit Ray† Jane Zavisca‡ Jeffrey J. Harden§

April 11, 2011

Abstract

Often researchers must choose among two or more structural equation models for a given setof data. Typically, selection is based on having the highest chi-square p-value or the highest fitindex such as the CFI or RMSEA. Though a common situation, there is little evidence on theperformance of these fit indices in choosing between models. In other statistical applications,Bayes Factor approximations such as the BIC are commonly used to select between models,but these are rarely used in SEMs. This paper examines several new and old Bayes Factorapproximations along with some commonly used fit indices to assess their accuracy in choos-ing the true model among a broad set of false models. The results show that the Bayes FactorApproximations outperform the other fit indices. Among these approximations one of the newones, SPBIC, is particularly promising. The commonly used chi-square p-value and the CFI,IFI, and RMSEA do much worse.

Keywords: Bayes Factor · Structural Equation Models · Empirical Bayes · Informa-tion Criterion ·Model Selection

∗Department of Sociology, University of North Carolina, Chapel Hill, NC 27599, [email protected]†Department of Mathematics and Statistics, Boston University, Boston, MA 02446, [email protected]‡Department of Sociology, University of Arizona, Tucson, AZ 85721, [email protected]§Department of Political Science, University of North Carolina, Chapel Hill, NC 27599, [email protected]

Page 2: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

1 Introduction

Structural equation models (SEMs) refer to a general statistical model that incorporates

ANOVA, multiple regression, simultaneous equations, path analysis, factor analysis, growth curve

analysis, and many other common statistical models as special cases. SEMs are well-established

in sociology and the social and behavioral sciences and are beginning to spread to the health and

natural sciences (Bentler and Stein 1992; Shipley 2000; Schnoll, Fang, and Manne 2004; Beran

and Violato 2010, e.g.,[). A key issue in the use of SEMs is determining the fit of the model to

the data and comparing the relative fit of two or more models to the same data. An asymptotic

chi-square distributed test statistic that tests whether a SEM exactly reproduces the covariance

matrix of observed variables was the first test and index of model fit (Joreskog 1969, 1973). How-

ever, given the approximate nature of all models and the enhanced statistical power that generally

accompanies a large sample size, the chi-square test statistic routinely rejects models in large sam-

ples regardless of their merit as approximations to the underlying process. Starting with Tucker

and Lewis (1973), Bentler and Bonett (1980), and Joreskog and Sorbom (1979), a variety of fit

indices have been developed to supplement the chi-square test statistic (e.g., see Bollen and Long

1993). Controversy surrounds these alternative fit indices in that often the sampling distribution of

the fit index is not known, sometimes the fit index tends to increase as sample size increases, and

there is disagreement about the optimal cutoff values for a “good fitting” model.

Bayes Factors play a role in comparing the fit of models in multiple regression, generalized

linear models, and several other areas of statistical modeling. Yet despite this common use, Bayes

Factors have received little attention in SEM literature. Cudeck and Browne (1983) and Bollen

(1989) give only brief mention to Schwarz’s (1978) Bayesian Information Criterion (BIC) as an

approximation to the Bayes Factor. Raftery (1993, 1995) provides more discussion of the BIC

in SEMs, as do Haughton, Oud, and Jansen (1997). However, it is rare to see the BIC or Bayes

Factors in applications of SEMs, despite their potential value in comparing the fit of competing

models.

1

Page 3: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

The primary purpose of this paper is to compare several widely used measures of SEM

model fit mentioned above with several different approximations to Bayes Factors, including two

new approximations: the Information Matrix-Based Information Criterion (IBIC) and the Scaled

Unit Information Prior Bayesian Information Criterion (SPBIC). In the course of presenting the

IBIC and SPBIC, we explain the potential use of Bayes Factors in model selection of SEMs. As

evidence of this usefulness, we demonstrate in a simulation study that Bayes Factor approximations

perform much better in SEM model selection than more commonly-used methods, and that IBIC

and SPBIC are among the best-performing fit statistics under several conditions.

The next section introduces the SEM equations and assumptions and reviews the current

methods of assessing SEM model fit. Following this is a section in which we present the Bayes

Factor, the BIC, and the IBIC and SPBIC. We then describe our Monte Carlo simulation study

that examines the behavior of various model selection criteria for a series of models that are either

correct, underspecified, or overspecified. Finally, we conclude with a discussion of the implications

of the results and recommendations for researchers.

2 Structural Equation Models and Fit Indices

SEMs have two primary components: a latent variable model and a measurement model.

We write the latent variable model as:

η = αη + Bη + Γξ + ζ (1)

The η vector is m × 1 and contains the m latent endogenous variables. The intercept terms are in

the m × 1 vector of αη. The m ×m coefficient matrix B gives the effect of the ηs on each other.

The N latent exogenous variables are in the N × 1 vector ξ. The m × N coefficient matrix Γ

contains the coefficients for ξ’s impact on the ηs. An m× 1 vector ζ contains the disturbances for

each latent endogenous variable. We assume that E(ζ) = 0, COV (ζ ′, ξ) = 0, and we assume that

the disturbance for each equation is homoscedastic and nonautocorrelated across cases, although

the variances of ζs from different equations can differ and these ζs can correlate across equations.

2

Page 4: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

The m × m covariance matrix Σζζ has the variances of the ζs down the main diagonal and the

across equation covariances of the ζs on the off-diagonal. The N × N covariance matrix of ξ is

Σξξ.

The measurement model is:

y = αy + Λyη + ε (2)

x = αx + Λxξ + δ (3)

The p× 1 vector y contains the indicators of the ηs. The p×m coefficient matrix Λy (the “factor

loadings”) give the impact of the ηs on the ys. The unique factors or errors are in the p× 1 vector

ε. We assume that E(ε) = 0 and COV (η′, ε) = 0. The covariance matrix for the ε is Σεε. There

are analogous definitions and assumptions for measurement Equation 3. for the q×1 vector x. We

assume that ζ , ε, and δ are uncorrelated with ξ and in most models these disturbances and errors

are assumed to be uncorrelated among themselves, though the latter assumption is not essential.1

We also assume that the errors are homoscedastic and nonautocorrelated across cases.

Define z′= [y′ x′], then the covariance matrix and mean of z′ as a function of the model

parameters are

Σ(θ) = E[(z− µz)(z− µz)′] (4)

=

Λy(I−B)−1(ΓΣξξΓ′ + Σζζ)(I−B)−1′Λ′y + Σεε Λy(I−B)−1ΓΣξξΛ

′x

ΛxΣξξΓ′(I−B)−1′Λ′y ΛxΣξξΛ

′x + Σδδ

µ(θ) = E[z] (5)

=

αy + Λy(I−B)−1(αη + Γκ)

αx + Λxκ

1We can modify the model to permit such correlations among the errors, but do not do so here.

3

Page 5: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

where θ is a vector that contains all of the parameters (i.e., coefficients, variances, and covariances)

that are estimated in a model and κ is the vector of means of ξ. Σ(θ) and µ(θ) are the implied

covariance matrix and the implied mean vector of the observed variables, z. These implied matrices

are written as functions of the parameters of the SEM. For a valid model, they imply that the

covariance matrix (Σ) and the mean vector (µ) of the observed variables, z, are exactly reproduced

by the corresponding implied matrices so that

Σ = Σ(θ) (6)

and

µ = µ(θ). (7)

The log likelihood of the maximum likelihood estimator, derived under the assumption that z

comes from a multinormal distribution, is

lnL(θ) = K − 1

2ln |Σ(θ)| − 1

2[z− µ(θ)]′Σ−1(θ)[z− µ(θ)], (8)

where K is a constant that has no effect on the values of θ that maximize the log likelihood.

The chi-square test statistic derives from a likelihood ratio test of the hypothesized model

to a saturated model where the covariance matrix and mean vector are exactly reproduced. The

null hypothesis of the chi-square test is that Equations 6 and 7 hold exactly. In large samples,

the likelihood ratio (LR) test statistic follows a chi-square distribution when the null hypothesis is

true. Excess kurtosis can affect the test statistic, though there are various corrections for excess

kurtosis (Satorra and Bentler 1988; Bollen and Stine 1992). In practice, the typical outcome of

this test in large samples is rejection of the null hypothesis that represents the model. This often

occurs even when corrections for excess kurtosis are made. At least part of the explanation is that

the null hypothesis assumes exact fit whereas in practice we anticipate at least some inaccuracies

in our modeling. The significant LR test simply reflects what we know: our model is at best an

approximation.

4

Page 6: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

In response to the common rejection of the null hypothesis of perfect fit, researchers have

proposed and used a variety of fit indices. The Incremental Fit Index (IFI, Bollen 1989), Compara-

tive Fit Index (CFI), Tucker-Lewis Index (TLI, 1973), Relative Noncentrality Index (RNI, Bentler

and Bonett 1980), and Root Mean Squared Error of Approximation (RMSEA, Steiger and Lind

1980) are just a few of the many indices of fit that have been proposed. Though the popularity

of each has waxed and waned, there are problems characterizing all or most of these. One is that

there is ambiguity as to the proper cutoff value to signify that a model has an acceptable fit. Dif-

ferent authors recommend different standards of good fit and in nearly all cases the justification

for cutoff values are not well founded. Second, the distributions of most of these fit indices are

unknown. Thus, with the exception of the RMSEA, confidence intervals for the fit indices are not

provided. Third, the fit indices that are based on comparisons to a baseline model (e.g., TLI, IFI,

RNI) can behave somewhat differently depending on the fitting function used to estimate a model.

This suggests that standards of fit should be adjusted depending on the fitting function employed.

Another issue is that the means of the sampling distributions of some measures tend to be larger

in bigger samples than in smaller ones even if the identical model is fitted. Thus, cutoff values for

such indices might need to differ depending on the size of the sample.

Another set of issues arise when comparing competing models for the same data. If the

competing models are nested, then a LR test that is asymptotically chi-square distributed is avail-

able. However, in such a case, the issue of statistical power in small samples remains. Furthermore,

the LR test is not available for nonnested models. Finally, other factors such as excess kurtosis can

impact the LR test. The situation is not made better by employing other fit indices. Though we

can take the difference in the fit indices across two or more models, there is little guidance on what

to consider as a sufficiently large difference to conclude that one model’s fit is better than another.

In addition, the standards might need to vary depending on the estimator employed and the size

of the sample for some of the fit indices. Nonnested models further complicate these comparisons

since the rationale for direct comparisons of nonnested models for most of these fit indices is not

well-developed.

5

Page 7: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

This brief overview of the LR test and the other fit indices in use in SEMs reveals that there

are drawbacks to the current methods of assessing the fit of models. It would be desirable to have

a method to assess fit that worked for nested and nonnested models and that had firm rationale in

statistical theory. From a practical point of view, it is desirable to have a measure that is calculable

using standard SEM software. In the next section, we examine the Bayes Factor and methods to

approximate it that address these criteria.

3 Approximating Bayes Factors in SEMs

The Bayes Factor expresses the odds of observing a given set of data under one model

versus an alternative model. For any two models 1 and 2, the Bayes Factor, B12, is the ratio of

P (Y|Mk) for two different models where P (Y|Mk) is the probability of the data Y given the

model, Mk. More explicitly, the Bayes Factor, B12, is

B12 =P (Y|M1)

P (Y|M2)(9)

Conceptually, the Bayes Factor is a relatively straightforward way to compare models, but

it can be complex to compute. The marginal likelihoods must be calculated for the competing

models. Following the Bayes Theorem, the marginal likelihood (also known as the integrated

likelihood) can be expressed as a weighted average of the likelihood of all possible parameter

values under a given model.

P (Y|Mk) =

∫P (Y | θk,Mk)P (θk |Mk) d θk, (10)

where θk is vector of parameters of Mk, P (Y | θk,Mk) is the likelihood under Mk, and P (θk |Mk)

is the prior distribution of θk.

Closed-form analytic expressions of the marginal likelihoods are difficult to obtain, and

numeric integration is computationally intensive with high dimensional models. Approximation

of this integral via the Laplace method is a common approach to solving this problem. Technical

6

Page 8: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

details of the Laplace method can be found in many sources, including Kass and Raftery (1995).

In brief, a second order Taylor expansion of the logP (Y |Mk) around the Maximum Likelihood

(ML) estimator θk leads to the following expression, where l(θk) is the log likelihood function

(i.e. logP (Y |θk,Mk)), IE(θk) is the expected information matrix, dk is the dimension (number of

parameters) of the model, and O(n−1/2) is the approximation error, and N is the sample size.

logP (Y |Mk) = l(θk) + logP (θk|Mk)

+dk2

log(2π)− dk2

log(n)− 1

2log∣∣∣IE(θk)

∣∣∣+O(n−1/2). (11)

This equation forms the basis for several approximations to the Bayes Factor. Here we

briefly present the equations for several approximations, including Schwarz’s BIC, Haughton’s

BIC, and our two new approximations, the IBIC and the SPBIC. Derivations and justifications for

these equations can be found in a companion technical paper (reproduced in the appendix). Here

our goal is simply to present the final equations, show their relationships to each other, and explain

how they can be calculated for SEM models using standard software.

The Standard Bayesian Information Criterion (BIC). Schwarz’s (1978) BIC drops the

second, third, and fifth terms of Equation 11, on the basis that in large samples, the first and fourth

terms will dominate in the order of error. Ignoring these terms gives

logP (Y |Mk) = l(θk)−dk2

log(n) +O(1). (12)

If we multiply this by −2, the BIC for say M1and M2 is calculated as

BIC = 2[l(θ2)− l(θ1)]− (d2 − d1) log(n). (13)

An advantage of the BIC is that it does not require specification of priors and it is readily

calculable from standard output of most statistical packages. However, the approximation does not

always work well. Problematic conditions include small sample, extremely collinear explanatory

7

Page 9: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

variables, or when the number of parameters increases with sample size. For other critiques see

Winship (1999) and Weakliem (1999).

Haughton’s BIC (HBIC). A variant on the BIC retains the third term of Equation 11

(Haughton 1988). We call this enhanced approximation the HBIC (Haughton labels it BIC*).

logP (Y |Mk) = l(θk) +dk2

log(2π)− dk2

log(n) +O(1) (14)

= l(θk)−dk2

log(n

2π) +O(1).

In comparing models M1and M2 and multiplying by −2 this leads to

HBIC = 2[l(θ2)− l(θ1)]− (d2 − d1) log( n

). (15)

The HBIC performed better in model selection in a simulation study for SEMs (Haughton, Oud,

and Jansen 1997).

The Information Matrix-Based BIC (IBIC). Following Haughton’s lead, we see no rea-

son to throw away information that is easily calculable. The estimated expected information matrix

IE(θk) is output by many SEM packages (it can be computed as the inverse of the covariance ma-

trix for parameter estimates). Taking advantage of this, we preserve the fifth term in Equation

11.2

logP (Y |Mk) = l(θk)−dk2

log( n

)− 1

2log∣∣∣IE(θk)

∣∣∣+O(1).

For modelsM1andM2 and multiplying by−2, the Information matrix based Bayesian information

Criterion (IBIC) is given by

IBIC = 2[l(θ2)− l(θ1)]− (d2 − d1) log( n

)− log

∣∣∣IE(θ2)∣∣∣+ log

∣∣∣IE(θ1)∣∣∣ . (16)

The Scaled Unit Information Prior BIC (SPBIC). An alternative derivation of Schwarz’s2Note that this is very similar to Kashyap’s (1982) approximation, which uses log(n) rather than log( n2π ).

8

Page 10: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

BIC makes use of prior distribution of θk. It is possible to arrive at Equation 12 by utilizing a unit

information prior, which assumes that the prior has approximately the same information as the

information contained in a single data point. Critics have argued that the unit information prior is

too flat to reflect a realistic prior. This prior also penalizes complex models too heavily, making

BIC overly conservative toward the null hypothesis (Weakliem 1999; Kuha 2004; Berger et al.

2006). Some researchers have proposed centering priors around the MLE (θ) of Mk, which favors

more complex models heavily. We propose to use a scaled information prior that is also centered

around the MLE, but scaled so that the prior extends out to the likelihood, and does not place all

the mass around the complex model. The final analytical expression is as follows, where Io(θk) is

the observed information matrix evaluated at θk.

logP (Y |Mk) = l(θk)−dk2

+dk2

(log pk − log(θTk IO(θk)θk)) (17)

=⇒ SPBIC = 2(l(θ1)− l(θ2))− d1

(1− log

[d1

θT1 IO(θ1)θ1

])

+ d2

(1− log

[d2

θT2 IO(θ2)θ2

]). (18)

For more details on the calculation of the scaling factor (which is model dependent and

thus leads to an empirical Bayes prior) and the derivation of the analytical expression below, see

the appendix. The essential point for the practitioner is that the prior is implicit in the information

and does not need to be explicitly calculated. All elements of the SPBIC formula are output by

standard SEM software.

Theoretically, the IBIC and the SPBIC should perform at least as well as the BIC or the

HBIC, and better under certain circumstances. The IBIC incorporates an additional term that

makes it closer to the original Laplace approximation. To the degree that this term contributes to

distinguishing between models, it holds the potential to improve accuracy. This might occur in

smaller samples where this term is not dominated by the other terms in the approximation. The

SPBIC employs a more flexible prior with more realistic assumptions than the unit information

9

Page 11: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

prior on which the standard BIC is based. << Add something on why this matters, esp. in SEM

case.>>

However, despite this theoretical justification for using Bayes Factor approximations with

SEMs, there is little empirical evidence as to their effectiveness. The only study of Bayes Factor

approximations in selecting SEMs is the Haughton, Oud, and Jansen (1997) research mentioned

above. These authors simulate a three factor model with six indicators and compare minor variants

on the true structure to determine how successful the Bayes Factor approximations and several

fit indices are in selecting the true model. Their simulation results find that the Bayes Factor

approximations as a group are superior to the other p-value of the chi-square test and the other fit

indices in correctly selecting the true generating model. They find that HBIC does the best among

the Bayes Factor approximations. Among the other fit indices and the chi-square p-value, the IFI

and RFI have the best correct model selection percentages, though their success is lower than the

Bayes Factor approximations.

The Haughton, Oud, and Jansen (1997) study is valuable in several respects, not the least of

which in its demonstration of the value of the Bayes Factor approximations. As they acknowledge,

it is unclear whether their results extend beyond the relatively simple model that they examined. In

addition, the structure of most of the models that they considered were quite similar to each other.

The performance of the Bayes Factor approximations for models with very different structures is

unknown. Also their study examined the single model that was best fitting according to their fit

measures. The distance of the best fitting model from the second best is not discussed.

Our simulation analysis seeks to test the generality of their results by looking at larger

models with alternative models that range from mildly different to very different structures. In

addition, we provide the first evidence on the performance of the IBIC and SPBIC in the selection

of SEMs. We also include several popular fit indices including the p-value of the chi-square test

statistic.

10

Page 12: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

4 Simulation Study

To test the performance of these various approximations to the Bayes Factor, we fit a variety

of models to two sets of simulated data. We use models designed by Paxton et al. (2001), which

has been the basis for several other simulation studies. This will enable comparisons to the results

for other fit indices that use the same design, but different fit indices. Standard SEM assumptions

hold in these models: exogenous variables and disturbances come from normal distributions and

the disturbances follow the assumptions presented in Section 2 above. Graphical depictions of each

models and the specific parameters used to generate the simulated data can be found in Paxton et al.

(2001). The first simulation (herefore designated SIM1) was based on the following underlying

population model. The structural model consists of three latent variables linked in a chain.η1

η2

η3

=

0 0 0

β21 0 0

0 β32 0

η1

η2

η3

+

ψ11

ψ22

ψ33

The measurement model contains 9 observed variables, 3 of which crossload on more than

one latent variable. Scaling is accomplished by fixing λ = 1 for a single indicator loading on each

latent variable

y1

y2

y3

y4

y5

y6

y7

y8

y9

=

λ11 0 0

1 0 0

λ31 0 0

λ41 λ42 0

0 1 0

0 λ62 λ63

0 λ72 λ73

0 0 1

0 0 λ93

η1

η2

η3

+

δ1

δ2

δ3

δ4

δ5

δ6

δ7

δ8

δ9

11

Page 13: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

For each simulated data set, we fit a series of misspecified models, listed in Table 1. This

range of models tests the ability of the different fit statistics to detect a variety of common errors.

Parameters were misspecified so as to introduce error into both the measurement and latent variable

components of the models. For example, factor loadings for indicators of the latent variables are

incorrectly added or dropped in models 2–4, correlated errors between indicators are introduced

in models 5 and 7, and latent variables are artificially added or dropped in models 11–12 (with

corresponding changes in the measurement model).

[Insert Table 1 here]

The second simulation (SIM2) is based on a generating model that is identical to SIM1 in

the structural model, but contains an additional two observed indicators for each latent variable.η1

η2

η3

=

0 0 0

β21 0 0

0 β32 0

η1

η2

η3

+

ψ1

ψ2

ψ3

12

Page 14: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

y1

y2

y3

y4

y5

y6

y7

y8

y9

y10

y11

y12

y13

y14

y15

=

λ11 0 0

1 0 0

λ31 0 0

λ41 0 0

λ51 0 0

λ61 λ62 0

0 1 0

0 λ82 0

0 λ92 0

0 λ10,2 λ10,3

0 λ11,2 λ11,3

0 0 1

0 0 λ13,3

0 0 λ14,3

0 0 λ15,3

η1

η2

η3

+

δ1

δ2

δ3

δ4

δ5

δ6

δ7

δ8

δ9

δ10

δ11

δ12

δ13

δ14

δ15

The misspecified models used in SIM2 are given in Table 2.

[Insert Table 2 here]

For each simulation we generated 1000 data sets with each of the following sample sizes:

N = 100, N = 250, N = 500, N = 1000, and N = 5000. Note that, following standard practice,

samples were discarded if they led to non-converging or improper solutions, and new samples were

drawn to replace the “bad” samples until a total of 1000 samples were drawn that had convergent

solutions for each of the 12 fitted models.3

In these simulations we tested the relative performance of various information criteria as

well as several traditional tests of overall model fit by calculating the percentage of samples (at3This was most problematic at small sample sizes for SIM1—at N = 100, 715 samples had to be discarded for

nonconvergence and 64 for improper solutions. The respective SIM1 numbers at other sample sizes are: N = 250: 236and 7, N = 500: 96 and 0, N = 1000: 40 and 0, and N = 5000: 27 and 0. The numbers for SIM2 are: N = 100: 85and 56, N = 250: 9 and 4, N = 500: 5 and 0, N = 1000: 0 and 0, and N = 5000: 0 and 0.

13

Page 15: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

each sample size) for which the correct model (M1) was unambiguously selected, ambiguously

contained within a set of well-fitting models, ambiguously rejected, or unambiguously rejected.

The fit statistics included in the study are the BIC, IBIC, SPBIC, HBIC, AIC, chi-square p-value,

IFI, CFI, TLI, RNI, and RMSEA. To assess the information criteria measures, we used the follow-

ing categories:

• True/Clear: The true model has the smallest value for the information criterion, and is at

least 2 points smaller than the next best model.

• True/Close: The true model has the smallest value for the information criterion, but it is

within 2 points of the next best model.

• False/Close: An incorrect model has the smallest value of the information criterion, but it is

within 2 points of the true model

• False/Clear: An incorrect model has the smallest value of the information criterion, and is

at least 2 points smaller of the true model.

We use the value of 2 as a measure of close fit based on the Jeffreys-Raftery recommenda-

tions for Bayes Factor approximations (Raftery 1995). In essence, this suggests that measures that

differ by 2 or less do not draw sharp distinctions between two models. For the chi-square p-value,

IFI, CFI, TLI, and RNI, the best fitting model is chosen as that with the highest value. For the

RMSEA, the model with the smallest value is the best fitting one. For the IFI, CFI, TLI, RNI, and

RMSEA the fit of two models is considered close if the fit is within ± 0.01. This cutoff is based

on the common practice of only considering models that fall between 0.9 and 1 for the IFI, CFI,

TLI, and RNI, and between 0.10 and 0 for the RMSEA. A different of .01 represents 10% of the

typical range of acceptable model fit. Fit indices that differ by less than this amount, differ by

less than this 10%. The chi-square p-value difference that we use to separate close fit is also 0.01.

Instances in which the p-value differs by less than 0.01 seem too close to justify the conclusion

that the models’ fit the data differently.4

4Regardless, results remain unchanged if another value, such as ± 0.001 is used for all of these cutoffs.

14

Page 16: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

4.1 Results

Figures 1 and 2 present model selection results. The y-axes in these graphs plot the per-

centage of the 1000 simulations in which a given fit statistic clearly or ambiguously selected the

true model (positive bars) and clearly or ambiguously selected an incorrect model (negative bars).

Each graph plots results at a different sample size, increasing from 100 (top-middle graph) to panel

(bottom-right). A perfect fit statistic would produce a medium gray positive bar that extends up to

100 on the y-axis of every graph, and no bar extending downward. This would indicate that the

statistic always clearly selects the true model and never selects an incorrect model.

Two main points to note in these figures are that the Bayes Factor approximations are

notably superior to the conventional methods with respect to selecting the correct model and that

their performance improves as sample size increases. The graphs show that all of the measures

perform poorly at N = 100, and improve as the sample size increases. However, this improvement

in performance is larger for the various BIC measures compared to the chi-square-based indices.

AtN = 1000 in SIM1, three of the four BIC measures clearly select the true model over 90% of the

time. In contrast, the best performing chi-square-based method, RNI, only selects the correct model

ambiguously, and in only about 65% of the simulations. In fact, even at N = 5000 (SIM1), the IFI,

CFI, TLI, RNI, and RMSEA still ambiguously select an incorrect model more than 30% of the

time and the chi-square p-value clearly selects the wrong model in almost 50% of the simulations.

Both SIM1 and SIM2 show similar patterns, though performance of all of the measures

increases slightly in SIM2, which is to be expected given that its model contains an additional two

observed indicators for each latent variable. In short, the Bayes Factor approximations substan-

tially outperform the more common fit measures in selecting the true model. These results hold for

several different cutoff values with the the IFI, CFI, TLI, RNI, and RMSEA (see note 4).

[Insert Figure 1 here]

[Insert Figure 2 here]

There are also noticeable differences between the various information criteria we examine

15

Page 17: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

in Figures 1 and 2. At small sample sizes, BIC and IBIC have the highest rate of clearly selecting

an incorrect model (both over 80% at N = 100 for SIM1), while SPBIC and HBIC perform better,

with SPBIC showing the best performance (over 20% True/Clear at N = 250 in SIM1 and 60%

in SIM2). All of the information criteria improve as N increases, with the two new measures

described here—IBIC and SPBIC—performing slightly better than the others in several instances.

BIC, IBIC, and SPBIC perform the best at the larger sample sizes in both SIM1 and SIM2, with

HBIC performing better than the conventional methods, but not quite at the level of the other three.

For instance, in SIM2 at N = 1000, BIC, IBIC, and SPBIC all clearly select the true model more

than 90% of the time (with IBIC at 100%), while HBIC falls into the True/Clear category about

85% of the time. Finally, AIC performance is between that of the various Bayesian information

criteria and the conventional methods, most often falling in one of the two ambiguous categories

(True/Close or False/Close).

4.2 Discussion

This simulation study suggests that there is much to gain from using Bayesian informa-

tion criteria—including two new measures show here—in SEM model selection. The BIC, IBIC,

SPBIC, and HBIC consistently outperform the conventional methods of conducting model selec-

tion. This pattern holds at several sample sizes and in SEM data with relatively less information

(SIM1) and relatively more information (SIM2). While all of the measures examined here perform

poorly in selecting the correct model in small samples, the Bayes Factor approximations drastically

improve as N increases while the conventional methods improve only moderately or not at all.

There is also variance between the various Bayes Factor approximations. In model selec-

tion, the new IBIC and SPBIC shown here as well as BIC are the top performers, with HBIC not

performing as well. In several cases, either IBIC or SPBIC clearly selects the true model at the

highest frequency of all the fit statistics.

16

Page 18: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

5 Conclusions

As in any statistical analysis, a key issue in the use of SEMs is determining the fit of the

model to the data and comparing the relative fit of two or more models to the same data. However,

current practice with SEMs commonly employs methods that are problematic for several different

reasons. For example, the asymptotic chi-square distributed test statistic that tests whether a SEM

exactly reproduces the covariance matrix of observed variables routinely rejects models in large

samples regardless of their merit as approximations to the underlying process. Several other fit

indices have been developed in response, but those measures also have important problems: their

sampling distributions are not commonly known, they tend to increase as sample size increases,

and there is disagreement about the optimal cutoff values for a “good fitting” model.

In this paper, we have shown that Bayes Factor approximations can serve as a useful al-

ternative to these conventional methods. Though others have noted this usefulness (e.g., Raftery

1993, 1995; Haughton, Oud, and Jansen 1997), it is rare to see the BIC or Bayes Factors in appli-

cations of SEMs. We introduce two new information criteria—the IBIC and SPBIC—to the SEM

literature and show through an extensive set of comparisons that, along with some established ap-

proximations, these fit statistics outperform several conventional measures in selecting the correct

model.

In sum, Bayes Factor approximations are a useful tool for the applied researcher in compar-

ing SEM models. Their significant improvement in model selection compared to the conventional

methods can substantially improve the current practice of SEM analysis. In addition, among the

various options for approximating a Bayes Factor, we show that our new fit indices—the IBIC and

SPBIC—perform well when compared to other information criteria. Both of these measures are

easily calculated from standard SEM software, and thus are easy for the applied researcher to use.

Thus, we recommend use of IBIC and SPBIC to researchers seeking to compare SEM models in

their work.

17

Page 19: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

A Appendix: Bayes Factor Technical Details

This paper is based on research done in an unpublished companion technical paper. Below

we reproduce key details from that paper on the Bayes Factor, methods of approximation, and

specifics of the BIC, SPBIC, and IBIC.

A.1 Bayes Factor

In this section we briefly describe the Bayes Factor and the Laplace approximation. Readers

familiar with this derivation may skip this section. For a strict definition of the Bayes Factor we

have to appeal to the Bayesian model selection literature. Thorough discussions on Bayesian

statistics in general and model selection in particular are in Gelman et al. (1995), Carlin and Louis

(1996), and Kuha (2004). Here we provide a very simple description.

We start with data Y and hypothesize that it was generated by either of two competing

models M1 and M2.5 Moreover, let us assume that we have prior beliefs about the data generat-

ing models M1 and M2, so that we can specify the probability of data coming from model M1,

P (M1) = 1 − P (M2). Now, using the Bayes theorem the ratio of the posterior probabilities is

given byP (M1|Y )

P (M2|Y )=P (Y |M1)

P (Y |M2)

P (M1)

P (M2).

The left side of the equation can be interpreted as the posterior odds of M1 vs. M2, and

it represents how the prior odds P (M1)/P (M2) is updated by the observed data Y . The factor

P (Y |M1)/P (Y |M2) that is multiplied to obtain the posterior from the prior is known as the Bayes

Factor, which we will denote as

B12 =P (Y |M1)

P (Y |M2). (19)

Typically, we will deal with parametric models Mk which are described through model

5Generalization to more than two competing models is achieved by comparing the Bayes Factor of all models to a“base” model and choosing the model with highest Bayes Factor

18

Page 20: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

parameters, say θk. So the marginal likelihoods P (Y |Mk) are evaluated using

P (Y |Mk) =

∫P (Y |θk,Mk)P (θk|Mk)dθk, (20)

where P (Y |θk,Mk) is the likelihood under Mk, and P (θk|Mk) is the prior distribution of θk.

A closed form analytical expression of the marginal likelihoods is difficult to obtain, even with

completely specified priors (unless they are conjugate priors). Moreover, in most cases θk will

be high-dimensional and a direct numerical integration will be computationally intensive if not

impossible. For this reason we turn to a way to approximate this integral given in (20).

A.1.1 Laplace Method of Approximation

The Laplace Method of approximating Bayes Factors is a common device (see Raftery

1993; Weakliem 1999; Kuha 2004). Since we also make use of the Laplace method for our pro-

posed approximations, we will briefly review it here.

The Laplace method of approximating∫P (Y |θk,Mk)P (θk|Mk)dθk = P (Y |Mk) uses a

second order Taylor series expansion of the logP (Y |Mk) around θk, the posterior mode. It can be

shown that

logP (Y |Mk) = logP (Y |θk,Mk) + logP (θk|Mk) +dk2

log(2π)− 1

2log | −H(θk)|+O(N−1),

(21)

where dk is the number of distinct parameters estimated in model Mk, π is the usual mathematical

constant, H(θk) is the Hessian matrix

∂2 log[P (Y |θk,Mk)P (θk|Mk)]

∂θk∂θk′ ,

evaluated at θk = θk.and O(N−1) is the “big-O” notation indicating that the last term is bounded

in probability from above by a constant times N−1 (see, for example, Tierney and Kadane 1986;

Kass and Vaidyanathan 1992; Kass and Raftery 1995).

Instead of the Hessian matrix we will be working with different forms of information ma-

19

Page 21: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

trices. Specifically, the observed information matrix is denoted by

IO(θ) = −∂2 log[P (Y |θk,Mk)]

∂θk∂θk′ ,

whereas the expected information matrix IE is given by

IE(θ) = −E[∂2 log[P (Y |θk,Mk)]

∂θk∂θk′

].

Let us also define the average information matrices IE = IE/N and IO = IO/N . If the

observations come from an i.i.d. distribution we have the following identity

IE = −E[∂2 log[P (Yi|θk,Mk)]

∂θk∂θk′

],

the expected information for a single observation. We will make use of this property later.

In large samples the Maximum Likelihood (ML) estimator, θk, will generally be a reason-

able approximation of the posterior mode, θk. If the prior is not that informative relative to the

likelihood, we can write

logP (Y |Mk) = logP (Y |θk,Mk) + logP (θk|Mk)+

dk2

log(2π)− 1

2log |IO(θk)|+O(N−1), (22)

where logP (Y |θk,Mk) is the log likelihood for Mk evaluated at θk, IO(θk) is the observed infor-

mation matrix evaluated at θk (Kass and Raftery 1995) and O(N−1) is the “big-O” notation that

refers to a term bounded in probability to be some constant multiplied by N−1. If the expected

information matrix, IE(θk), is used in place of IO(θk), then the approximation error is O(N−1/2)

20

Page 22: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

and we can write

logP (Y |Mk) = logP (Y |θk,Mk) + logP (θk|Mk)

+dk2

log(2π)− 1

2log∣∣∣IE(θk)

∣∣∣+O(N−1/2) (23)

If we now make use of the definition, IE = IE/N and substitute it in (23), we have

logP (Y |Mk) = logP (Y |θk,Mk) + logP (θk|Mk)+

dk2

log(2π)− dk2

log(N)− 1

2log∣∣∣IE(θk)

∣∣∣+O(N−1/2). (24)

This equation forms the basis of several approximations to Bayes Factors, which we discuss

below.

A.2 Approximation through Special Prior Distributions

The BIC has been justified by making use of a unit information prior distribution of θk.

Our first new approximation, SPBIC, builds on this rationale using a more flexible prior. We first

derive the BIC, then present our proposed modification.

A.2.1 BIC: The Unit Information Prior Distribution

We assume for model MK (the full model, or the largest model under consideration), the

prior of θK is given by a multivariate normal density

P (θK |MK) ∼ N (θ∗K ,V∗K) (25)

where θ∗K and V ∗K are the prior mean and variance of θK . In the case of BIC we take V ∗K = I−10 ,

the average information defined in Section 3. This choice of variance can be interpreted as setting

the prior to have approximately the same information as the information contained in a single data

21

Page 23: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

point.6 This prior is thus referred to as the Unit Information Prior and is given by

P (θK |MK) ∼ N

(θ∗K ,

[IO(θK)

]−1). (26)

For any model Mk nested in the full model MK , the corresponding prior P (θk|Mk) is

simply the marginal distribution obtained by integrating out the parameters that do not appear in

Mk. Using this prior in (22), we can show that

logP (Y |Mk) ≈ log(P (Y |θk,Mk))−dk2

log(N)− 1

2N(θ − θ∗)T I0(θk)(θ − θ∗)

≈ log(P (Y |θk,Mk))−dk2

log(N). (27)

Note that the the error of approximation is still of the order O(N−1). Comparing models M1 and

M2 and multiplying by −2 gives the usual BIC

BIC = 2(l(θ2)− l(θ1))− (d2 − d1) log(N) (28)

where we use the more compact notation of l(θk) ≡ log(P (Y |θk,Mk)) for the log likelihood

function at θk.

An advantage of this unit information prior rationale for the BIC is that the error is of

order O(N−1) instead of O(1) when we make no explicit assumption about the prior distribution.

However, some critics argue that the unit information prior is too flat to reflect a realistic prior

(e.g., Weakliem 1999).

A.2.2 SPBIC: The Scaled Unit Information Prior Distribution

The SPBIC, our alternative approximation to the Bayes Factor, is similar to the BIC

derivation except instead of using the unit information prior, we use a scaled unit information prior

that allows the variance to differ and permits more flexibility in the prior probability specifica-

6Note that unlike the expected information, even if the observations are i.i.d., the average observed informationIO(θk), may not be equal to the observed information of any single observation.

22

Page 24: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

tion. The scale is chosen by maximizing the marginal likelihood over the class of normal priors.

Berger (1994), Fougere (1990), and others have used related concepts of partially informative pri-

ors obtained by maximizing entropy. In general, these maximizations involve tedious numerical

calculations. We ease the calculation by employing components in the approximation that are

available from standard computer output. We first provide a Bayesian interpretation of our choice

of the scaled unit information prior and then provide an analytical expression for calculating this

empirical Bayes prior.

We have already shown how the BIC can be interpreted as using a unit information prior

for calculating the marginal probability in (22). This very prior leads to the common criticism

of BIC being too conservative towards the null, as the unit prior penalizes complex models too

heavily. The prior of BIC is centered at θ∗ (parameter value at the null model) and has a variance

scaled at the level of unit information. As a result this prior may put extremely low probability

around alternative complex models (for discussion, see Weakliem 1999; Kuha 2004; Berger et al.

2006). One way to overcome this conservative nature of BIC is to use the ML estimate, θ, of

the complex model Mk, as the center of the prior P (θk|Mk). But this prior specification puts all

the mass around the complex model and thus favors the complex model heavily. A scaled unit

information prior (henceforth denoted as SUIP) is a less aggressive way to be more favorable to

a complex model. The prior is still centered around θ∗, so that the highest density is at θ∗, but we

choose its scale, ck, so that the prior flattens out to put more mass on the alternative models. This

places higher prior probability than the unit information prior on the space of complex modes, but

at the same time prevents complete concentration of probability on the complex model. This effec-

tively leads to choosing a prior in our class that is most favorable to the model under consideration,

placing the SUIP in the class of Empirical Bayes priors.

To obtain the analytical expression for the appropriate scale, we start from the marginal

probability given in (20). Instead of using the more common form of the Laplace approximation

applied jointly to the likelihood and the prior, we use the approximate normality of the likelihood,

but choose the scale of the variance of the multivariate normal prior so as to maximize the marginal

23

Page 25: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

likelihood. As a result, our prior distribution is not fully specified by the information matrix; rather

the variance is a scaled version of the variance of the unit information prior defined in (26). This

prior has the ability to adjust itself through a scaling factor, which may differ significantly from

unity in several situations (e.g., highly collinear covariates, dependent observations).

The main steps in constructing the SPBIC are as follows. First we define the scaled unit

information prior as

Pk(θk) ∼ N

θ∗k,[IO(θk)

]−1ck

, (29)

where IO = 1NIO and ck is the scale factor which we will evaluate from the data. Note that ck may

vary across different models. Using the above prior in (29) and the generalized Laplace integral

approximation described in Appendix A, we have

logP (Y |Mk) ≈ l(θk)−1

2dk log

(1 +

N

ck

)− 1

2

ckN + ck

[(θk − θ∗)T I(θk)(θk − θ∗)

]. (30)

Now we estimate ck. Note that we can view (30) as a likelihood function in ck, and we can

write L(ck|Y ,Mk) =∫L(θk)π(θk|ck)dθk = L(Y |Mk). So the optimum ck is given by its ML

estimate, which can be obtained as follows.

2l(ck) = −dk log

(1 +

N

ck

)+ 2l(θk)−

ckN + ck

θTk IO(θk)θk

=⇒ 2l′(ck) = dkN

ck(N + ck)− θ

Tk IO(θk)θkN

(N + ck)2

=⇒ 0 = N(N + ck)dk −N ck θTk IO(θk)θk

=⇒ ckN

=

dkθTk IO(θk)θk−dk

if dk < θTk IO(θk)θk,

∞ if dk ≥ θTk IO(θk)θk.

Using this ck we arrive at the SPBIC measures under the two cases.

24

Page 26: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

Case 1: When dk < θTk IO(θk)θk, we have

logP (Y |Mk) = l(θk)−dk2

+dk2

(log dk − log(θTk IO(θk)θk)) (31)

=⇒ SPBIC = 2(l(θ2)− l(θ1))− d2

(1− log

[d2

θT2 IO(θ2)θ2

])

+ d1

(1− log

[d1

θT1 IO(θ1)θ1

])(32)

Case 2: Alternatively, when dk ≥ θTk IO(θk)θk, as ck = ∞, the prior variance goes to 0 so the prior

distribution is a point mass at the mean, θ∗. Thus we have

logP (Y |Mk) = l(θk)−1

2θTk IO(θk)θk,

which gives

SPBIC = 2(l(θ2)− l(θ1))− θT2 IO(θ2)θ2 + θT1 IO(θ1)θ1. (33)

A.3 Approximation through Eliminating O(1) Terms

An alternative derivation of the BIC does not require the specification of priors. Rather,

it is justified by eliminating smaller order terms in the expansion of the marginal likelihood with

arbitrary priors (see (24)). After providing the mathematical steps of deriving the BIC using (24),

we show how approximations of the marginal likelihood with fewer assumptions can be obtained

using readily available software output.

A.3.1 BIC: Elimination of Smaller Order Terms

One justification for the Schwarz (1978) BIC is based on the observation that the first and

fourth terms in (24) have orders of approximations ofO(N) andO(log(N)), respectively, while the

second, third, and fifth terms are O(1) or less. This suggests that the former terms will dominate

25

Page 27: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

the latter terms when N is large. Ignoring these latter terms, we have

logP (Y |Mk) = logP (Y |θk,Mk)−dk2

log(N) +O(1). (34)

If we multiply this by −2, the BIC for, say, M1 and M2 is calculated as

BIC = 2[logP (Y |θ2,M2)− logP (Y |θ1,M1)]− (d2 − d1) log(N). (35)

Equivalently using l(θk), the log-likelihood function in place of the logP (Y |θk,Mk) terms in (35)

we have

BIC = 2[l(θ2)− l(θ1)]− (d2 − d1) log(N). (36)

The relative error of theBIC in approximating a Bayes Factor isO(1). This approximation

will not always work well, for example when N is small, but also when sample size does not

accurately summarize the amount of available information. This assumption breaks down for

example when the explanatory variables are extremely collinear or have little variance, or when

the number of parameters increase with sample size (Winship 1999; Weakliem 1999).

A.3.2 IBIC and Related Variants: Retaining Smaller Order Terms

Some of the terms for the approximation of the marginal likelihood that are dropped by

the BIC can be easily calculated and retained. One variant called the HBIC retains the third

term in equation (23) (Haughton 1988; Haughton, Oud, and Jansen 1997). A simulation study

by Haughton, Oud, and Jansen (1997) found that this approximation performs better in model

selection for structural equation models than does the usual BIC.

The second term can also be retained by calculating the estimated expected information

matrix, IE(θk), which is often available in statistical software for a variety of statistical models.

Taking advantage of this and building on Haughton, Oud, and Jansen (1997), we propose a new

26

Page 28: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

approximation for the marginal likelihood, omitting only the second term in (24).

logP (Y |Mk) = logP (Y |θk,Mk)−dk2

log

(N

)− 1

2log∣∣∣IE(θk)

∣∣∣+O(1).

For modelsM1andM2 and multiplying by−2, this leads to a new approximation of a Bayes

Factor, the Information matrix-based Bayesian Information Criterion (IBIC).7 This is given by

IBIC = 2[l(θ2)− l(θ1)]− (d2 − d1) log

(N

)− log

∣∣∣IE(θ2)∣∣∣+ log

∣∣∣IE(θ1)∣∣∣ . (38)

Of the three approximations that we have discussed, IBIC includes the most terms from

equation (23), leaving out just the prior distribution term logP (θk|Mk).

7Another approximation due to Kashyap (1982), and given by

2[l(θ2)− l(θ1)]− (d2 − d1) log (N) − log∣∣∣IE(θ2)

∣∣∣+ log∣∣∣IE(θ1)

∣∣∣ (37)

is very similar to our proposed IBIC. The IBIC incorporates the estimated expected information matrix at theparameter estimates for the two models being compared.

27

Page 29: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

References

Bentler, P.M. and Douglas G. Bonett. 1980. “Significance Tests and Goodness of Fit in the Analysis

of Covariance Structures.” Psychological Bulletin 88:588–606.

Bentler, P.M. and J.A. Stein. 1992. “Structural Equation Models in Medical Research.” Statistical

Methods in Medical Research 1(2):159–181.

Beran, Tanya N. and Claudio Violato. 2010. “Structural Equation Modeling in Medical Research:

A Primer.” BMC Research Notes 3(1):267–277.

Berger, James O. 1994. “An Overview of Robust Bayesian Analysis.” Test (Madrid) 3(1):5–59.

Berger, James O., Surajit Ray, Ingmar Visser, Ma J. Bayarri and W. Jang. 2006. Generalization of

BIC. Technical report University of North Carolina, Duke University, and SAMSI.

Bollen, Kenneth A. 1989. Structural Equations with Latent Variables. New York: John Wiley &

Sons.

Bollen, Kenneth A. and J. Scott Long. 1993. Testing Structural Equation Models. Sage.

Bollen, Kenneth A. and Robert A. Stine. 1992. “Bootstrapping Goodness of Fit Measures in

Structural Equation Models.” Sociological Methods & Research 21:205–229.

Carlin, Bradley P. and Thomas A. Louis. 1996. Bayes and Empirical Bayes Methods for Data

Analysis. New York: Chapman and Hall.

Cudeck, Robert and Michael W. Browne. 1983. “Cross-validation of Covariance Structures.” Mul-

tivariate Behavioral Research 18(2):147–167.

Fougere, P. 1990. Maximum Entropy and Bayesian Methods. In Maximum Entropy and Bayesian

Methods, ed. P. Fougere. Dordrecht, NL: Kluwer Academic Publishers.

Gelman, Andrew, John B. Carlin, Hal S. Stern and Donald B. Rubin. 1995. Bayesian Data Analy-

sis. London: Chapman & Hall.

Haughton, Dominique M. A. 1988. “On the Choice of a Model to Fit Data From an Exponential

Family.” The Annals of Statistics 16(1):342–355.

Haughton, Dominique M. A., Johan H. L. Oud and Robert A. R. G. Jansen. 1997. “Information

28

Page 30: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

and Other Criteria in Structural Equation Model Selection.” Communications in Statistics, Part

B – Simulation and Computation 26(4):1477–1516.

Joreskog, Karl. 1969. “A General Approach to Confirmatory Maximum Likelihood Factor Analy-

sis.” Psychometrika 34(2):183–202.

Joreskog, Karl. 1973. Analysis of Covariance Structures. New York: Academic Press.

Joreskog, Karl and Dag Sorbom. 1979. Advances in Factor Analysis and Structural Equation

Models. Cambridge, MA: Abt Books.

Kashyap, Rangasami L. 1982. “Optimal Choice of AR and MA Parts in Autoregressive Moving

Average Models.” IEEE Transactions on Pattern Analysis and Machine Intelligence 4(2):99–

104.

Kass, Robert E. and Adrian E. Raftery. 1995. “Bayes Factors.” Journal of the American Statistical

Association 90(430):773–795.

Kass, Robert E. and Suresh K. Vaidyanathan. 1992. “Approximate Bayes Factors and Orthogonal

Parameters, with Application to Testing Equality of Two Binomial Proportions.” Journal of the

Royal Statistical Society, Series B: Methodological 54(1):129–144.

Kuha, Jouni. 2004. “AIC and BIC: Comparisons of Assumptions and Performance.” Sociological

Methods & Research 33(2):188–229.

Paxton, Pamela, Patrick J. Curran, Kenneth A. Bollen, Jim Kirby and Feinian Chen. 2001. “Monte

Carlo Experiments: Design and Implementation.” Structural Equation Modeling 8(2):287–312.

Raftery, Adrian E. 1993. Bayesian Model Selection in Structural Equation Models. In Testing

Structural Equation Models, ed. Kenneth A. Bollen and J. Scott Long. Newbury Park, CA: Sage

pp. 163–180.

Raftery, Adrian E. 1995. Sociological Methodology. Cambridge, MA: Blackwell chapter Bayesian

Model Selection in Social Research (with Discussion), pp. 111–163.

Satorra, Albert and P.M. Bentler. 1988. Scaling Corrections for Chi-Square Statistics in Covariance

Structure Analysis. In ASA 1988 Proceedings of the Business and Economic Statistics Section.

American Statistical Association Alexandra, VA: pp. 308–313.

29

Page 31: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

Schnoll, R.A., C.Y. Fang and S.L. Manne. 2004. “The Application Of SEM to Behavioral Research

in Oncology: Past Accomplishments and Future Opportunities.” Structural Equation Modeling

11(4):583–614.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” Annals of Statistics 6(2):461–

464.

Shipley, Bill. 2000. Cause and Correlation in Biology: A User’s Guide to Path Analysis, Structural

Equations, and Causal Inference. New York: Cambridge University Press.

Steiger, J. H. and J.C. Lind. 1980. Statistically Based Tests For The Number of Common Factors.

Presented at the annual meeting of the Psychometric Society, Iowa City, IA.

Tierney, Luke and Joseph B. Kadane. 1986. “Accurate Approximations for Posterior Moments and

Marginal Densities.” Journal of the American Statistical Association 81(393):82–86.

Tucker, Ledyard R. and Charles Lewis. 1973. “A Reliability Coefficient for Maximum Likelihood

Factor Analysis.” Psychometrika 38(1):1–10.

Weakliem, David L. 1999. “A Critique of the Bayesian Information Criterion for Model Selection.”

Sociological Methods & Research 27(3):359–397.

Winship, Christopher. 1999. “Editor’s Introduction to the Special Issue on the Bayesian Informa-

tion Criterion.” Sociological Methods & Research 27(3):355–358.

30

Page 32: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

Table 1: Fitted Models for SIM1

Model Description # Par. Dropped Parameters Extra ParametersCorrect ModelM1 Correct model 23Misspecified Measurement ModelsM2 Drop 1 cross

loading22 λ41

M3 Drop 2 crossloadings

21 λ41, λ72

M4 Drop 3 crossloadings

20 λ41, λ72, λ63

M5 Drop cross-loadings; addcorrelated errors

22 λ72, λ63 θ67

M6 Correlated errors 24 θ67M7 Switch factor

loadings23 λ21, λ52 λ51, λ22

Misspecified Structural ModelsM8 Add double-

lagged effect24 γ31

M9 Add correlated er-rors and double-lagged effect

25 θ67, β31

M10 No relationsamong η’s

21 β21, β32

M11 Two latent vari-ables (remove η3)

19 λ42, λ72, λ73, λ93, β32,ψ33 λ82, λ92

M12 Four latent vari-ables (add η4)

24 λ41, λ62, λ72, λ83, λ93 λ32, λ53, λ74, λ94, β43, ψ44

31

Page 33: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

Tabl

e2:

Fitte

dM

odel

sfo

rSIM

2

Mod

elD

escr

iptio

n#

Par.

Dro

pped

Para

met

ers

Ext

raPa

ram

eter

sC

orre

ctM

odel

M1

Cor

rect

mod

el35

Mis

spec

ified

Mea

sure

men

tMod

els

M2

Dro

p1

cros

slo

adin

g34

λ61

M3

Dro

p2

cros

slo

adin

gs33

λ61,λ

11,2

M4

Dro

p3

cros

slo

adin

gs32

λ61,λ

11,2,λ

10,3

M5

Dro

pcr

oss-

load

ings

;ad

dco

rrel

ated

erro

rs

34λ11,2,λ

10,3

θ 10,11

M6

Cor

rela

ted

erro

rs36

θ 10,11

M7

Switc

hfa

ctor

load

ings

35λ21,λ

72

λ71,λ

22

Mis

spec

ified

Stru

ctur

alM

odel

sM

8A

dddo

uble

-la

gged

effe

ct36

γ31

M9

Add

corr

elat

eder

-ro

rsan

ddo

uble

-la

gged

effe

ct

37θ 1

0,11,γ

31

M10

No

rela

tions

amon

’s33

β21,β

32

M11

Two

late

ntva

ri-

able

s(r

emov

eη 3

)31

λ62,λ

10,3,λ

11,3,λ

13,3,λ

14,3,λ

15,3,β

32,ψ

33

λ12,2,λ

13,2,λ

14,2,λ

15,2

M12

Four

late

ntva

ri-

able

s(a

ddη 4

)37

λ61,λ

92,λ

10,2,λ

11,2,λ

13,3,λ

14,3,λ

15,3,

λ42,λ

52,λ

83,λ

12,3,λ

11,4,λ

12,4,λ

14,4,λ

15,4,β

43,ψ

44

32

Page 34: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

True

/Cle

arTr

ue/C

lose

Fals

e/C

lear

Fals

e/C

lose

N=

100

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

N=

250

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

N=

500

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

N=

1000

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

N=

5000

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

Figu

re1:

SIM

1M

odel

Sele

ctio

nR

esul

ts

33

Page 35: Improving Model Selection in Structural Equation Models ...math.bu.edu/people/sray/preprints/BFSEM_0411.pdfImproving Model Selection in Structural Equation ... and Lewis (1973), Bentler

True

/Cle

arTr

ue/C

lose

Fals

e/C

lear

Fals

e/C

lose

N=

100

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

N=

250

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

N=

500

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

N=

1000

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

N=

5000

1009080706050403020100102030405060708090100

False Model True Model

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

BIC

IBIC

SPBIC

HBIC

AIC

p−value

IFI

CFI

TLI

RNI

RMSEA

Figu

re2:

SIM

2M

odel

Sele

ctio

nR

esul

ts

34