Creation of Dataset and screening programpeople.oregonstate.edu/~acock/growth-curves/Growth Curve...

130
GROWTH CURVES AND EXTENSIONS USING MPLUS Alan C. Acock [email protected] Department of HDFS 322 Milam Hall Oregon State University Corvallis, OR 97331 This document and selected references, data, and programs can be downloaded from http://oregonstate.edu/~acock/growth-curves/ A Note to Readers These are lecture notes for a presentation at Academica Sinica in December of 2005. This is not a self-contained, systematic Growth Curve and Related Models, Alan C. Acock, Presented at Academica Sinica, December, 2005 1

Transcript of Creation of Dataset and screening programpeople.oregonstate.edu/~acock/growth-curves/Growth Curve...

Creation of Dataset and screening program

GROWTH CURVES AND EXTENSIONS USING MPLUS

Alan C. Acock

[email protected]

Department of HDFS

322 Milam Hall

Oregon State University

Corvallis, OR 97331

This document and selected references, data, and programs can be downloaded from

http://oregonstate.edu/~acock/growth-curves/

A Note to Readers

These are lecture notes for a presentation at Academica Sinica in December of 2005. This is not a self-contained, systematic treatment of the topic. This is not intended for publication and has not been carefully edited for publication purposes. Instead, these notes intended to complement a two-day workshop presentation. The workshop will expand and clarify many of the points presented in this document. The intention of this document is to help workshop participants follow the presentation. They are much more detailed than a usual power point set of slides, but much less detailed than a self-contained treatment of the topics. Others may find these notes useful but they are not intended to be complete, nor as substitute for participation in the workshop.

GROWTH CURVES AND EXTENSIONS USING MPLUS

Outline

1. Preparing data for Mplus

2. Basic analysis using Mplus

3. A basic growth curve

a. Conceptual Model of a Growth Curve

b. The Mplus program

c. Interpreting Output

d. Interpreting Graphic Output

4. Quadratic terms in growth curves

a. Conceptual model of a growth curve

b. The Mplus program

c. Interpreting output

d. Interpreting graphic output

5. Working with missing values in growth models

a. Introduction

b. The Mplus Program

c. Interpreting output

6. Multiple group models with growth curves

a. Simultaneous estimation in multiple groups

b. Including categorical predictors to show group differences

i. Conceptual model of a growth curve

ii. The Mplus program

iii. Interpreting output

iv. Interpreting graphic output

7. Inclusion of covariates to explain variation in level and trend

a. Conceptual model of a growth curve

b. The Mplus program

c. Interpreting output

d. Interpreting graphic output

8. Growth curves with binary variables

a. Conceptual model of a growth curve

b. The Mplus program

c. Interpreting output

d. Interpreting graphic output

9. Growth curves with counts and zero inflated counts

a. Conceptual model of a growth curve

b. The Mplus program

c. Interpreting output

d. Interpreting graphic output

10. Growth mixture models

a. Variable centered vs. person centered research

b. Conceptual model of a growth curve

c. The Mplus program

d. Interpreting output

e. Interpreting graphic output

Goal of the Workshop

The goal of this workshop is to explore a variety of applications of latent growth curve models using the Mplus program. Because we will cover a wide variety of applications and extensions of growth curve modeling, we will not cover each of them in great detail. A reading list is provided for those who want more extensive treatments of the topics we cover. At the end of this workshop it is hoped that participants will be able to run Mplus programs to execute a variety of growth curve modeling applications and to correctly interpret the results.

Assumed Background

It will be assumed that participants in the workshop have some background in Structural Equation Modeling. Background in multilevel analysis will also be useful. It is possible to learn how to estimate the specific models we will cover without a comprehensive knowledge of Mplus, but some background using an SEM program is useful.

Recommended Readings (selected readings can be downloaded from http://oregonstate.edu/~acock/growth-curves/

1. Preparatory readings

a. Kline, R. B. (2005). Principles and Practice of Structural Equation Modeling, 2nd ed. New York: Guilford Press.

This is a general introduction to structural equation modeling that is more assessable than others. It does not cover growth curve modeling but does provide a solid background for what will be covered in the workshop.

b. Muthén, L., & Muthén, B. (2004). Mplus Statistical Analysis with Latent Variables: User’s Guide. Los Angles, CA: Statmodel.

Participants who plan to use Mplus need a copy of the manual. The tentative target date for release of a new version of Mplus is the end of this year.

c. Acock, A. C. (2006). A Gentle Introduction to Stata. Stata Press. (www.stata-press.com). For those not already familiar with Stata, this is a basic introduction.

2. Basic growth curve modeling

a. Curran, F. J., & Hussong, A. M. (2003). The Use of latent Trajectory Models in Psychopathology Research. Journal of Abnormal Psychology. 112:526-544. This is a general introduction to growth curves that is accessible.

b. Duncan, T. E., Duncan, S. C., Strycker, A. L. Li, F., & Alpert, A. (1999). An Introduction to Latent Variable Growth Curve Modeling: Concepts, Issues, and Applications. Mahwah, NJ: Lawrence Erbaum Associates.

Classic text on growth curve modeling.

c. Kaplan, D. (2000). Chapter 8: Latent Growth Curve Modeling. In D. Kaplan, Structural Equation Modeling: Foundations and Extensions (pp 149-170). Thousand Oaks, CA: Sage. This is a short overview.

3. Working with missing values

a. Acock, A. (2005). Working with missing values. Journal of Marriage and Family 67:1012-1028.

b. Davey, A. Savla, J., & Luo, Z. (2005). Issues in Evaluating Model Fit with Missing Data. Structural Equation Modeling 12:578-597.

c. Royston, P. (2005). Multiple Imputation of Missing Values: Update. The Stata Journal 2:1-14.

4. Limited Outcome Variables: Binary and count variables

a. Muthén, B. (1996). Growth modeling with binary responses. In A. V. Eye & C. Clogg (Eds.) Categorical Variables in Developmental Research: Methods of analysis (pp 37-54). San Diego, CA: Academic Press.

b. Long, J. S., & Freese, J. (2006). Regression Models for Categorical Dependent Variables Using Stata, 2nd ed. Stata Press (www.stata-press.com). This provides the most accessible and still rigorous treatment of how to use an interpret limited dependent variables.

c. Rabe-Hesketh, S., & Skrondal, A. (2005). Multilevel and Longitudinal Modeling Using Stata. Stata Press (www.stata-press.com). This discusses a free set of commands that can be added to Stata that will do most of what Mplus can do.

5. Growth mixture modeling

a. Muthén, B., & Muthén, L. K. (2000). Integrating person-centered and variable-centered analysis: Growth mixture modeling with latent trajectory classes. Alcoholism: Clinical and Experimental Research. vol 24:882-891.

This is an excellent and accessible conceptual introduction.

b. Muthén, B. (2001). Latent variable mixture modeling. In G. Marcoulides, & R. Schumacker (Eds.) New Developments and Techniques in Structural Equation Modeling (pp. 1-34). Mahwah, NJ: Lawrence Erlbaum.

c. Muthén, B., Brown, C. H., Booil, J., Khoo, S. Yang, C. Wang, C., Kellam, S., Carlin, J., & Liao, J. (2002). General growth mixture modeling for randomized preventive interventions. Biostatistics, 3:459-475

d. Muthén, B. Latent Variable analysis: Growth Mixture Modeling and Related Techniques for Longitudinal Data. (2004) In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences (pp. 345-368). Newbury Park, CA: Sage Publications

e. Muthén, B., Brown, C. H., Booil Jo, K, M., Khoo, S., Yang, C. Wang, C., Kellam, S., Carlin, J., Liao, J. (2002). General growth mixture modeling for randomized preventive interventions. Biostatistics. 3,4, pp. 459-475.

Brief Summary of Topics Covered in the Two Day Workshop

Introduction to Growth Curve Modeling

Growth Curves are a new way of thinking that is ideal for longitudinal studies. Instead of predicting a person’s score on a variable (e.g., mean comparison among scores at different time points or relationships among variables at different time points), we predict their growth trajectory—what is their level on the variable AND how is this changing. We will present a conceptual model, show how to apply the Mplus program, and interpret the results.

1. Working with Missing Values

Missing values are a problem with most social science research, but it is a special issue with longitudinal studies. In a 5 wave study a participant may have no data for some waves and may have incomplete data for the waves in which they were interviewed. We will discuss strategies for working with missing values (FIML and multiple imputation), show how to apply Mplus to FIML (Mplus can also analyze multiple data files from MI)), and interpret the results.

2. Multiple Groups with Growth Curves

Comparing known groups (men vs. women, married vs. single parent) to assess how their growth trajectories differ. We will show how to do this using the Mplus program and how to interpret the results.

3. Predicting Patterns of growth

When we have established a growth trajectory, this begs the question of how to explain it. Why do some individual increase or decrease on a characteristic although other individuals show little change? What predicts the level (initial level or starting level) and trajectory? We will show how to do this using Mplus and interpret the results.

4. Growth Curves with Limited Outcome Variables

Sometimes a researcher is interested in growth on a binary variable (Ever drinking alcohol for adolescents). Some times a researcher is interested in a count variable that involves a relatively rare event (Number of days an adolescent has 5+ drinks of alcohol in the last 30 days). Sometimes we are interested in both types of variables. Different variables may predict the binary variable than predict the count variable. We will show how to do this using Mplus and interpret the results.

5. Growth Mixture Models

It is possible to use Mplus to do an exploratory growth curve analysis where our focus is on the person and not the variable. We can locate clusters of people who share similar growth trajectories. This is exploratory research and the standards for it are still evolving. An example would be a study of alcohol consumption from age 15 to 30. It is possible to empirically identify different clusters of people. One cluster may never drink or never drink very much. A second cluster may have increasing alcohol consumption up to about 22 or 23 and then a gradual decline. A third cluster may be very similar to the second cluster but not decline after 23. After deriving these clusters of people who share growth trajectories, it is possible to compare them to find what differentiates membership in the different clusters. We will show how to do these analyses using Mplus and interpret the results.

Creation of Dataset and Screening Program

Our initial example will look at the BMI (Body Mass Index) of adolescents as the BMI changes between the age of 12 and the age of 18. This data is from NLSY97 (National Longitudinal Survey of Youth, 1997), using the first 7 years of data.

Before we can do anything, we need to get data into a format that Mplus can read. At this time, Mplus cannot read datasets in proprietary formats designed for other packages (Stata, SAS, SPSS). It needs an ASCII data file in which the values are separated (delimited) by a space, a comma, or in a fixed format. There are many ways to do this and whatever program you use for your standard data management/analysis can write a file in one of the formats. Some people put the file in Excel and then save it as a comma delimited file (.cvs). The file extension you should use for your data file that Mplus will read is .dat.

I use Stata and there is a close and developing relationship between Stata and Mplus. Michael Mitchell at UCLA wrote a Stata command that not only creates a dataset for Mplus, but even writes the initial program Mplus uses for basic analysis. If you have access to Stata, I recommend this command. It is called stata2mplus. If you have Stata, the command, findit stata2mplus, will locate this command and show you how to install it. An advantage of using the stata2mplus command is that it also creates a basic Mplus program that includes variable names and value labels as part of the title.

First, I open the Stata dataset within Stata and

· Keep only those items that I think might be useful for doing the growth curves. Within Mplus you have the option to select variables for each analysis, so it makes sense to keep all the variables you think you might use.

· If you have variables with long names, rename them so each variable is limited to 8 characters.

· Once I dropped the irrelevant variables using Stata, I saved the file to my flash drive. I gave it the name bmi_stata.dta (.dta is the file time Stata uses for a Stata dataset).

· Finally, entered the following Stata command:

stata2mplus using "F:\flash\academica\bmi_stata"

and this resulted in the following results:

Looks like this was a success.

To convert the file to mplus, start mplus and run

the file F:\flash\academica\bmi_stata.inp

What this program does is create two files:

· bmi_stata.dat which is a data file Mplus can read, and

· bmi_stata.inp which is a program file Mplus can run to do basic analysis. The .inp is the file type Mplus uses for its own programs.

Here are the first five cases in my dataset, bmi_stata.dat. The assumed file extension is *.dat. Mplus can read a file in this format. The stata2mplus command recoded all missing values into a -9999. You can override this if you want a -9999 to be a real value.

7935,-9999,4,-9999,2,1,2,28.67739,28.33963,26.62286,27.24928,23.62529,25.84016,26.57845,0,1,0,0,0

5526,3,-9999,2,-9999,0,1,39.29696,-9999,44.28067,39.85261,44.28067,46.0519,44.80619,1,0,0,0,0

5369,1,-9999,1,-9999,0,1,18.28824,19.7513,20.5957,19.30637,19.30637,21.03148,20.36911,1,0,0,0,0

919,0,-9999,2,-9999,0,4,17.93367,18.17581,19.18558,19.52778,19.52778,20.50417,20.30889,0,0,0,1,0

7429,4,-9999,4,-9999,0,3,17.56995,21.25472,21.1316,21.28223,21.96355,22.12994,21.96355,0,0,1,0,0

When I open the Mplus Editor I can then open the file bmi_stata.inp. I’ve made some minor changes in the file by deleting some lines and adding one subcommand I will explain in a minute.

Title:

bmi_stata.inp

Stata2Mplus convertsion for F:\flash\academica\bmi_stata.dta

id : PUBID - YTH ID CODE 1997

grlprb_y : GIRLS BEHAVE/EMOT SCALE, YTH RPT 1997

boyprb_y : BOYS BEHAVE/EMOT SCALE, YTH RPT 1997

grlprb_p : GIRLS BEHAVE/EMOT SCALE, PAR RPT 1997

boyprb_p : BOYS BEHAVE/EMOT SCALE, PAR RPT 1997

male :

race_eth :

1: white

2: black

3: hispanic

4: asian

5: other

black :

hispanic :

asian :

other :

Data:

File is F:\flash\academica\bmi_stata.dat ;

Variable:

Names are

id grlprb_y boyprb_y grlprb_p boyprb_p male race_eth bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 bmi03 white black hispanic asian other;

Missing are all (-9999) ;

! usevariables excludes grlprb and boyprob variables

! because these are sex specific.

Usevariables are race_eth bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 bmi03 white black hispanic asian other;

Analysis:

Type = basic ;

This basic program, bmi_stata.inp, produces the

· Means,

· Variances, and

· Covariances of the variables

It is useful to run this basic analysis, regardless of how you get the data into Mplus and compare them to the corresponding values using your standard statistics package to make sure the transfer was successful.

First, we will go over the command structure of this basic Mplus program. It is surprising how few commands we need to add as we move on to more complex analysis.

· LISREL users will find the command structure of Mplus remarkably simple

· AMOS users should appreciate how much more efficient this code is than drawing a complex model on a computer screen

Mplus programs are divided into a series of sections. Each major section of the program with a key word at the start of the line. The major keywords in this example are

· Title:

· Data:

· Variable:

· Analysis:

The colon is part of the keyword name. These will be highlighted in blue (automatically) in the actual program. Mplus uses a “;” to mark the end of a command or subcommand (similar to SAS).

The Title: section

Everything after Title: is part of the title until a line beginning with Data: appears. It is helpful to include a description of the purpose of the program as well as a name for the program as part of the title. In addition to what the stata2mplus program generates, I’ve added a line with the name of the file, bmi_stata.inp, so I can link a printed copy to the actual file at a later date. The stata2mplus command we ran in Stata puts a lot in the title including the value labels where they are available. You might edit these out of the file to make the file shorter.

The Data: section

This section tells Mplus where to find the file containing the data. The full path is provided and I think it is a good idea to have no spaces in the path. If you do have spaces, put quotation marks around the path as in

File is “F:\my flash\academica sinica\bmi_stata.dat” ;

Notice the semi colon is the end of a statement. Statements can continue for several lines, but end with a semi-colon. This is the way SAS does it, for those familiar with SAS.

The Variable: section

This section consists of a series of subcommands that tell Mplus the names of the variables, what values are missing, and a subset of variables to be included in the current program. Variable names are case sensitive. The names “hispanic” and “Hispanic” are different variable names.

The subcommand, Names are, is followed by a list of variable names with the order matching the order of the data file and this can go on for several lines, ending with a semi-colon. Putting the subsection keywords, Names are on a separate line is unnecessary but helps readers of a program. Limiting names to 8 characters with no spaces simplifies things. The next subcommand, Missing are all (-9999); tells Mplus that all variables have a missing value of -9999. You can use any value here. It is possible to have different values. I recommend that you replace all missing values in your dataset with some value, such as -9999, that is never a legitimate value. Mplus can incorporate missing values in the analysis using a FIML approach or multiple imputation which we will discuss later, so if there are observations that definitely should be excluded from analysis, drop those cases before transferring the data to a Mplus data file.

I’ve inserted a comment by putting an exclamation mark, “!” at the start of a line. Then I’ve inserted the Usevariables are subcommand to have a subset of variables. This is a useful command if you have a larger file that will be used for a variety of separate Mplus analyses. Noticed that I’ve dropped the items about problems for girls and boys. Without this deletion, the program would have no observations with complete data.

The Analysis: section

The last section of the program is the Analysis: and it has a single subcommand, Type = basic ;. This section is often omitted because the type of analysis is often a default for a particular model.

There are two major sections that are not in this program because they are not applicable here.

· Model: that includes the model we are estimating and

· Output: that lists the specific statistical and graphic output we want.

The following is selected output from the basic analysis. I’ve put key values in bold and preceded comments I inserted with an “!”.

SUMMARY OF ANALYSIS

Number of groups 1

Number of observations 1098! listwise

! deletion is the default

Number of dependent variables 13

Number of independent variables 0

Number of continuous latent variables 0

Observed dependent variables

Continuous ! default treats variables as continuous.

RACE_ETH BMI97 BMI98 BMI99 BMI00 BMI01

BMI02 BMI03 WHITE BLACK HISPANIC ASIAN

OTHER

Estimator ML

Information matrix EXPECTED

Maximum number of iterations 1000

Convergence criterion 0.500D-04

Maximum number of steepest descent iterations 20

! SAMPLE STATISTICS should be compared to original data

Means

RACE_ETH BMI97 BMI98 BMI99 BMI00

________ ________ ________ ________ ________

1 1.762 20.279 21.513 22.315 22.997

Means

BMI01 BMI02 BMI03 WHITE BLACK

________ ________ ________ ________ ________

1 23.445 23.991 24.486 0.542 0.231

Means

HISPANIC ASIAN OTHER

________ ________ ________

1 0.179 0.017 0.030

. . .

Correlations

RACE_ETH BMI97 BMI98 BMI99 BMI00

________ ________ ________ ________ ________

RACE_ETH 1.000

BMI97 0.128 1.000

BMI98 0.105 0.762 1.000

BMI99 0.091 0.761 0.852 1.000

BMI00 0.103 0.731 0.816 0.862 1.000

BMI01 0.123 0.714 0.809 0.861 0.874

BMI02 0.116 0.638 0.705 0.739 0.744

BMI03 0.107 0.664 0.715 0.759 0.775

WHITE -0.827 -0.175 -0.168 -0.151 -0.152

BLACK 0.130 0.117 0.143 0.136 0.128

HISPANIC 0.577 0.094 0.069 0.051 0.052

ASIAN 0.296 -0.006 -0.034 -0.016 -0.017

OTHER 0.569 0.012 0.009 0.001 0.023

Correlations

BMI01 BMI02 BMI03 WHITE BLACK

________ ________ ________ ________ ________

BMI01 1.000

BMI02 0.802 1.000

BMI03 0.820 0.753 1.000

WHITE -0.167 -0.162 -0.149 1.000

BLACK 0.126 0.108 0.118 -0.597 1.000

HISPANIC 0.066 0.090 0.053 -0.509 -0.257

ASIAN 0.003 -0.003 0.003 -0.144 -0.073

OTHER 0.027 0.004 0.024 -0.191 -0.097

Correlations

HISPANIC ASIAN OTHER

________ ________ ________

HISPANIC 1.000

ASIAN -0.062 1.000

OTHER -0.082 -0.023 1.000

It is always important to compare these values to those you had using your standard statistical package.

A Growth Curve

Estimating a basic growth curve using Mplus is quite easy. When developing a complex model it is best to start easy and gradually build complexity. Starting easy should include data screening to evaluate the distributions of the variables, patterns of missing values, and possible outliers. We will start with fitting a basic growth curve. Even if you have a theoretically specified model that is complex, always start with the simplest model and gradually add the complexity. Here we will show how structural equation modeling conceptualizes a latent growth curves, show the Mplus program, explain the new program features, and interpret the output.

Before showing a figure to represent a growth curve, we will examine a small sample of our observations:

A BMI value of 25 is considered overweight and a BMI of 30 is considered obese. With just 10 observations it is hard to see much of a trend, but it looks like people are getting a bigger BMI score as they get older. The X-axis value of 0 is when the adolescent was 12 years old, the 1 is when the adolescent was 13 years old, etc. We are using seven waves of data (labeled 0 to 6) from the panel study. We will see how to create these graphs shortly.

A growth curve requires us to have a model and we should draw this before writing the Mplus program. Figure 1 shows a model for our simple growth curve:

This figure is much simpler than it first appears.

· The key variables are the two latent variables labeled the Intercept and the Slope.

· The intercept represents the initial level and is sometimes called the initial level for this reason. It is the estimated initial level and its value may differ from the actual mean for BMI97 because in this case we have a linear growth model. It may differ from the mean of BMI97 by a lot when covariates are added because of the adjustments for the covariates.

· Unless the covariates are centered, it usually makes sense to just call it an intercept rather than the initial level. The intercept is identified by the constant loadings of 1.0 going to each BMI score. Some programs call the intercept the constant, representing the constant effect.

· The slope is identified by fixing the values of the paths to each BMI variable. In a publication you normally would not show the path to BMI97, since this is fixed at 0.0.

· We fix the other paths at 1.0, 2,0, 3.0, 4.0, 5.0, and 6.0. Where did we get these values? The first year is the base year or year zero. The BMI was measured each subsequent year so these are scored 1.0 through 6.0. Other values are possible. Suppose the survey was not done in 2000 or 2001 so that we had 5 time points rather than 7. We would use paths of 0.0, 1.0, 2.0, 5.0, and 6.0 for years 1997, 1998, 1997, 2002, and 2003.

· It is also possible to fix the first couple years and then allow the subsequent waves to be free. This might make sense for a developmental process where the yearly intervals may not reflect the developmental rate. Developmental time may be quite different than chronological time. This has the effect of “stretching” or “shrinking” time to the pattern of the data (Curran & Hussong, 2003). An advantage of this approach is that it uses fewer degrees of freedom than adding a quadratic slope.

The individuals in our sample will each have their own

· BMI score for each year

· Intercept and

· Slope represent the overall trend.

Features to notice in the figure:

· The individual variation around the Intercept and Slope are represented in Figure 1 by the RI and RS. These are the variance in the intercept and slope around their respective means.

· We expect there would be substantial variance in both of these as some individuals have a higher or lower starting BMI and some individuals will increase (or decrease) their BMI at a different rate than the average growth rate.

· In addition to the mean intercept and slope, each individual will have their own intercept and slope. We say the intercept and the slope are random effects.

a. They are random in the sense that each individual may have a steeper or flatter slope than the mean slope and

b. Each individual may have a higher or lower initial level than the mean intercept.

c. In our sample of 10 individuals shown above, notice one adolescent starts with a BMI around 12 and three adolescents start with a BMI around 30. Some have a BMI that increases and others do not.

· The variances, RI and RS are critical if we are going to explore more complex models with covariates (e.g., gender, psychological problems, race) that might explain why some individuals have a steeper or less steep growth rate than the average.

· The ei terms represent individual error terms for each year. Some years may move above or below the growth trend described by our Intercept and Slope. Sometimes it might be important to allow error terms to be correlated, especially subsequent pairs such as e97-e98, e98-e99, etc.

This is all there is to conceptualizing a growth model within an SEM framework. This is an equivalent conceptualization to studying growth curves using a multilevel approach.

Here is the Mplus program:

Title:

bmi_growth.inp

Stata2Mplus convertsion for F:\flash\academica\bmi_stata.dta

Data:

File is "F:\flash\academica\bmi_stata.dat" ;

Variable:

Names are

id grlprb_y boyprb_y grlprb_p boyprb_p male race_eth bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 bmi03 white black hispanic asian other;

Missing are all (-9999) ;

! usevariables is limited to bmi variables

Usevariables are bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 bmi03 ;

Model:

i s | bmi97@0 bmi98@1 bmi99@2 bmi00@3 bmi01@4 bmi02@5 bmi03@6;

Output:

Sampstat Mod(3.84);

Plot:

Type is Plot3;

Series = bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03(*);

What is new in this program?

· The first change is that we modify the Usevariables are: subcommand to only include the bmi variables since we are doing a growth curve for these variables.

· We drop the Analysis: section because we are doing basic growth curve and can use the default options.

· We have added a Model: section because we need to describe the model. Because Mplus was a late arrival to SEM software, he was designed after growth curves were well understood.

· Instead of tricking Mplus into doing a growth curve, Mplus has a simple built in way of doing this that matches the assumptions that fit our model. There is a single line to describe our model:

i s | bmi97@0 bmi98@1 bmi99@2 bmi00@3 bmi01@4 bmi02@5 bmi03@6;

a. In this line the “I” and “s” stand for intercept and slope. We could have called these anything such as intercept and slope or initial and trend. The vertical line, | , tells Stata that it is about to define an intercept and slope.

b. There are defaults that we do not need to note. For example,

c. the intercept is defined by a constant of 1.0 for each bmi variable. This is normally the case, so it is a default.

d. The slope is defined by fixing the path from the slope to bmi97 at 0, the slope of bmi98 at 1, etc. The @ sign is used for “at.” Don’t forget the semi-colon to end the command.

· Mplus assumes there is random error, ei for each variable and that these are uncorrelated.

· If we wanted to allow e97 and e98 to be correlated we would need to add a line saying bmi97 with bmi98; . This may seem strange because we are not really correlating bmi97 with bmi98, but e97 with e98. Mplus knows this and we do not need to generate a separate set of names for the error terms.

· Mplus also assumes that there is a residual variance for both the intercept and slope (RI and RS) and that these covary. Therefore, we do not need to mention this.

The last additional section in our Mplus program is for selecting what output we want Mplus to provide. There are many optional outputs of the program and we will only illustrate a few of these. The Output: section has the following lines

Output:

Sampstat Mod(3.84);

Plot:

Type is Plot3;

Series = bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03(*);

· The first line, Sampstat Mod(3.84) asks for sample statistics and modification indices for parameters we might free, as long as doing so would reduce chi-square by 3.84 (corresponding to the .05 level). We do not bother with parameter estimates that would have less effect than this.

· Next comes the Plot: subcommand, and we say that we want Type is Plot3; for our output. This gives us the descriptive statistics and graphs for the growth curve.

· The last line of the program specifies the series to plot. By entering the variables with an (*) at the end we are setting a path at 0.0 for bmi97, 1.0 for bmi98, etc.

Annotated Selected Growth Curve Output

The following is selected output with comments:

Number of observations 1102 ! listwise, an alternative is FIML estimation

Number of dependent variables 7 !these are the bmi scores

Number of independent variables 0

Number of continuous latent variables 2 !these are the intercept and slope

Continuous latent variables

I S

!These are the only latent variables

Estimator ML

TESTS OF MODEL FIT

!These have the standard interpretations. It is okay if the fit is not perfect here because when we add the covariates we may get a better fit. The chi-square is significant as it usually is for a large sample because any model is not likely to be a perfect fit for data. However, the CFI = .977 and TLI = .979 are both in the very good range (i.e., over .96 is very good). The RMSEA is .088 and this is not very good. Ideally, this sould be below .06, and a value that is not below .08 is considered problematic. The Standardized RMSR = .048 is acceptable (less than .05)

Chi-Square Test of Model Fit

Value 220.570

Degrees of Freedom 23

P-Value 0.0000

Chi-Square Test of Model Fit for the Baseline Model

Value 8568.499

Degrees of Freedom 21

P-Value 0.0000

CFI/TLI

CFI 0.977

TLI 0.979

RMSEA (Root Mean Square Error Of Approximation)

Estimate 0.088

90 Percent C.I. 0.078 0.099

Probability RMSEA <= .05 0.000

SRMR (Standardized Root Mean Square Residual)

Value 0.048

MODEL RESULTS

Estimates S.E. Est./S.E.

! the I and S are all fixed so no tests for them.

I |

BMI97 1.000 0.000 0.000

BMI98 1.000 0.000 0.000

BMI99 1.000 0.000 0.000

BMI00 1.000 0.000 0.000

BMI01 1.000 0.000 0.000

BMI02 1.000 0.000 0.000

BMI03 1.000 0.000 0.000

S |

BMI97 0.000 0.000 0.000

BMI98 1.000 0.000 0.000

BMI99 2.000 0.000 0.000

BMI00 3.000 0.000 0.000

BMI01 4.000 0.000 0.000

BMI02 5.000 0.000 0.000

BMI03 6.000 0.000 0.000

! The slope and intercept are correlated, the covariance is

! .416, z = 5.551, p < .001 (WITH means covariance in Mplus)

S WITH

I 0.416 0.075 5.551

Means

I 20.798 0.117 178.026

!Initial level, intercept = 20.798, (BMI starts at 20.798) z = 178.026; p < .001

!Slope = .668 (BMI goes up .668 each year), z = 35.183; p < .001

S 0.668 0.019 35.183

Intercepts

BMI97 0.000 0.000 0.000

BMI98 0.000 0.000 0.000

BMI99 0.000 0.000 0.000

BMI00 0.000 0.000 0.000

BMI01 0.000 0.000 0.000

BMI02 0.000 0.000 0.000

BMI03 0.000 0.000 0.000

! Variances, Ri and Rs in the figure, are both significant. This is what covariates will try to explain—why do some youth start higher/lower and have a different trend, i.e., slope, for the BMI?

Variances

I 13.184 0.643 20.504

S 0.213 0.018 12.147

! Following are the residual variances for the observed variables, hence they are the errors, ei’s in our figure.

Residual Variances

BMI97 5.391 0.290 18.583

BMI98 2.729 0.159 17.124

BMI99 2.697 0.144 18.752

BMI00 3.529 0.178 19.860

BMI01 2.334 0.144 16.187

BMI02 9.533 0.457 20.837

BMI03 7.134 0.397 17.956

MODEL MODIFICATION INDICES

Minimum M.I. value for printing the modification index 3.840

M.I. E.P.C. Std E.P.C. StdYX E.P.C.

! Many of these changes make no sense. We could let the path of the slope to BMI03 be free and chi-square would drop by about 45 points.

BY Statements

I BY BMI97 87.808 -0.038 -0.139 -0.032

I BY BMI99 25.404 0.013 0.049 0.011

I BY BMI00 21.840 0.014 0.050 0.011

I BY BMI03 29.103 -0.026 -0.093 -0.016

S BY BMI97 55.850 -0.870 -0.402 -0.093

S BY BMI99 17.773 0.315 0.145 0.034

S BY BMI00 18.572 0.352 0.162 0.035

S BY BMI03 44.611 -0.915 -0.423 -0.074

! When Mplus has a value it can’t compute it prints 999.000. Normally ignore these

ON/BY Statements

S ON I /

I BY S 999.000 0.000 0.000 0.000

! These “with” statements are for correlated errors. Some make sense, some don’t.

WITH Statements

BMI99 WITH BMI97 4.993 -0.349 -0.349 -0.019

BMI99 WITH BMI98 8.669 0.362 0.362 0.020

BMI00 WITH BMI97 3.912 -0.322 -0.322 -0.016

BMI00 WITH BMI99 17.357 0.503 0.503 0.026

BMI01 WITH BMI97 8.255 -0.421 -0.421 -0.021

BMI01 WITH BMI98 7.032 -0.300 -0.300 -0.015

BMI01 WITH BMI00 12.398 0.447 0.447 0.021

BMI02 WITH BMI97 4.707 0.560 0.560 0.023

BMI02 WITH BMI99 5.455 -0.431 -0.431 -0.018

BMI02 WITH BMI00 9.829 -0.649 -0.649 -0.025

BMI02 WITH BMI01 4.305 0.413 0.413 0.015

BMI03 WITH BMI97 36.224 1.488 1.488 0.060

BMI03 WITH BMI99 9.296 -0.525 -0.525 -0.021

BMI03 WITH BMI00 8.824 -0.583 -0.583 -0.022

BMI03 WITH BMI02 8.242 0.931 0.931 0.029

! We do not pay much attention to these intercepts because Mplus automatically fixes them at zero. Before freeing these, it would make more sense to free some of the coefficients for slopes, e.g., 0, 1, *, *, *, * or to try a quadratic slope as discussed in a latter section.

Means/Intercepts/Thresholds

[ BMI97 ] 79.520 -0.770 -0.770 -0.179

[ BMI99 ] 19.737 0.250 0.250 0.058

[ BMI00 ] 17.444 0.257 0.257 0.056

[ BMI03 ] 23.066 -0.483 -0.483 -0.084

PLOT INFORMATION

The following plots are available:

Histograms (sample values, estimated factor scores, estimated values)

Scatterplots (sample values, estimated factor scores, estimated values)

Sample means

Estimated means

Sample and estimated means

Observed individual values

Estimated individual values

Here are Some of the Available Plots

It is often useful to show the actual means for a small random sample of participants. These are Sample Means.

· Click on Graphs

· Observed Individual Values

This gives you a menu where you can make some selections. I used the clock to seed a random generation of observations.

Here I selected Random Order and for 20 cases. This results in the following graph:

This shows one person who started at an obese BMI = 30 and then dropped down. However, most people increased gradually.

Next, lets look at a plot of the actual means and the estimated means using our linear growth model. Click on

· Graphs and then select

· Sample and estimated means.

You can improve this graph. You might click on the legend and move it so it is not over the trend lines. You can right click inside the graph and add labels for the X axis and Y axis. You can change the labels, and you can adjust the range for each axis.

Notice that there is a clear growth trend in BMI. A BMI of 15-20 is considered healthy and a BMI of 25 is considered overweight. Notice what happens to American youth between the age of 12 and the age of 18.

A Growth Curve with a Quadratic Term

This graph is useful to seeing if there is a nonlinear trend. It is simple to add a quadratic term, if the curve is departing from linearity. Looking at the graph it may seem that the linear trend works very well, but our RMSEA was a bit big and the estimated initial BMI is higher than the observed mean. A quadratic might pick this up by having a curve that drops slightly to pick up the BMI97 mean.

The conceptual model in Figure 1 will be unchanged except a third latent variable is added. We will have the Intercept, Slope, now called linear trend), and the new latent variable called the Quadratic trend. Like the first two, the Quadratic trend will have a residual variance (RQ) that will freely covariate with RI and RL. The paths from the quadratic trend to the individual BMI variables will be the square of the path from the Linear trend to the BMI variables. Hence the values for the linear trend will remain 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, and 6.0. For the quadratic these values will be 0.0, 1.0, 4.0, 9.0, 16.0, 25.0, and 36.0.

You really appreciate the defaults in Mplus when you see what we need to change in the Mplus program when we add a quadratic slope. Here is the only change we need to make:

Model:

i s q| bmi97@0 bmi98@1 bmi99@2 bmi00@3 bmi01@4 bmi02@5 bmi03@6;

Mplus will know that the quadratic, q (we could use any name) will have values that are the square of the values for the slope, s.

Here is selected output:

TESTS OF MODEL FIT

! We have lost 4 degrees of freedom

· mean for the quadratic slope,

· variance for the quadratic slope,

· covariance of the Rq with Ri

· covariance with Rq with Rs

! The fit is excellent. a

Chi-Square Test of Model Fit

Value 61.791!Was 220.570

Degrees of Freedom 19!Was 23

P-Value 0.0000

Chi-Square Test of Model Fit for the Baseline Model

Value 8568.499

Degrees of Freedom 21

P-Value 0.0000

CFI/TLI

CFI 0.995!.977

TLI 0.994!.979

RMSEA (Root Mean Square Error Of Approximation)

Estimate 0.045!.088

90 Percent C.I. 0.033 0.058

Probability RMSEA <= .05 0.715

SRMR (Standardized Root Mean Square Residual)

Value 0.022

MODEL RESULTS

! Results for I and S are same as above. The paths for Q are simply the squared values

Q |

BMI97 0.000 0.000 0.000

BMI98 1.000 0.000 0.000

BMI99 4.000 0.000 0.000

BMI00 9.000 0.000 0.000

BMI01 16.000 0.000 0.000

BMI02 25.000 0.000 0.000

BMI03 36.000 0.000 0.000

S WITH

I 0.575 0.220 2.616

Q WITH

I -0.038 0.034 -1.116

S -0.130 0.021 -6.324

! The Negative slope, -.064, for quadratic suggests a leveling off of the growth curve.

Means

I 20.439 0.118 173.266

S 1.045 0.049 21.108

Q -0.064 0.008 -8.183

Variances

I 12.381 0.671 18.462

S 0.984 0.134 7.357

Q 0.023 0.004 6.412

Residual Variances

BMI97 4.318 0.316 13.660

BMI98 2.789 0.158 17.613

BMI99 2.442 0.141 17.357

BMI00 3.187 0.173 18.418

BMI01 2.354 0.147 16.022

BMI02 9.521 0.454 20.948

BMI03 4.989 0.491 10.157

The fit is so good because the estimated means and observed means are so close. However, there is still significance variance among individual adolescents that needs to be explained. Here are 20 estimated individual growth curves. Notice that each of these is a curve, but they start at different initial levels and have different trajectories. Next, we want to use covariates to explain these differences in the initial levels and growth trajectories.

An Alternative to Use of a Quadratic Slope

An alternative to adding a quadratic slope is to allow some of the time loadings to be free. We have used loadings of 0, 1, 2, 3, 4 for the linear slope and 0, 1, 4, 9, 16 for the quadratic slope. Alternatively we could allow all but two of the loadings to be free. We might use loadings of 0, 1, *, * . It is necessary to have the 0 and 1 fixed but the 1 does not have to be second; we could use 0, *, *, 1.

You may ask how you could justify allowing some of the time loadings to be free if there was a one month or one year difference between waves of data. The answer is that developmental time may be different than chronological time. Allowing these loadings to be free has an advantage over the quadratic in that it uses fewer degrees of freedom but still allows for growth spurts. This model is not nested under a quadratic, but you could think of a linear growth model with fixed values for each year (0, 1, 2, 3, 4) being nested within the free model that uses 0, 1, *, *. If the free model fits much better than the fixed linear model, you might use this instead of the quadratic model.

Working with Missing Values

Mplus has two ways of working with missing values. The simplest is to use full information maximum likelihood estimation with missing values (FIML). This uses all available data. For example, some adolescents were interviewed all six years but others may have skipped one, two, or even more years. We use all available information with this approach. The second approach is to utilize multiple imputations.

· Multiple imputations should not be confused with single imputation available from SPSS if a person purchases their missing values module and which gives incorrect standard errors.

· Multiple imputation involves

a. Imputing multiple datasets (usually 5-10) using appropriate procedures,

b. Estimating the model for each of these datasets, and

c. Then pooling the estimates and standard errors.

When the standard errors are pooled this way, they incorporate the variability across the 5-10 solutions and are thereby produced unbiased estimates of standard errors. Multiple imputations can be done with:

· Norm, a freeware program that works for normally distributed, continuous variables and is often used even on dichotomized variables.

· A Stata user has written a program called ICE that is an implementation of the S-Plus program called MICE, that has advantages over Norm. It does the imputation by using different estimation models for outcome variables that are continuous, counts, or categorical. See Royston (2005).

· Mplus can read these multiple datasets, estimate the model for each dataset, and pool the estimates and their standard errors.

We will not illustrate the multiple imputation approach because that involves working with other programs to impute the datasets. However, the Mplus User’s Guide, discusses how you specify the datasets in the Data: section. We will illustrate the FIML approach because it is widely used and easily implemented—and doesn’t require explaining another software package.

The conceptual model does not change with missing values. The programming for implementing the FIML solution changes very little. You will recall that we did not need an Analysis: section in our program for doing a growth curve. However, we do need one when we are doing a growth curve with missing values and using FIML estimation. Directly above the Model command we insert

Analysis:

Type = General Missing H1 ;

Estimator = MLR ;

· Type = General Missing H1; this line is the key change.

· The missing tells Mplus to do the full information maximum likelihood estimation.

· The H1 is necessary to get sample statistics in our output.

· We could do this with maximum likelihood estimation, but will use a robust maximum likelihood estimator, Estimator = MLR, instead. This is optional, but generally conservative when you have substantial missing values.

In the Output: section, we also add a single word, patterns. This will give us a lot of information about patterns of missing values. We will see just what patterns there are, the frequency of occurrence of each pattern, and the percentage of data present for each covariance estimate.

Output:

Sampstat Mod(3.84) patterns ;

Plot:

Type is Plot3;

Series = bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03(*);

Also, to simplify our presentation we will take out the quadratic term (the fit is better with the quadratic term, but it takes more space to present and interpret the results).

Here are selected, annotated results:

*** WARNING

Data set contains cases with missing on all variables.

These cases were not included in the analysis.

Number of cases with missing on all variables: 3

1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS

SUMMARY OF ANALYSIS

Number of groups 1

Number of observations 1768 ! We had 1102 observations using listwise deletion.

Number of dependent variables 7

Number of independent variables 0

Number of continuous latent variables 2

Observed dependent variables

Continuous

BMI97 BMI98 BMI99 BMI00 BMI01 BMI02 BMI03

Continuous latent variables

I S

Estimator MLR

! Robust ML estimator

Information matrix OBSERVED

Maximum number of iterations 1000

Convergence criterion 0.500D-04

Maximum number of steepest descent iterations 20

Maximum number of iterations for H1 2000

Convergence criterion for H1 0.100D-03

! An ‘x’ mean the data are present. Pattern 1 -- no missing values

! Pattern 2 – missing BMI03

SUMMARY OF MISSING DATA PATTERNS

MISSING DATA PATTERNS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

BMI97 x x x x x x x x x x x x x x x x x x x x

BMI98 x x x x x x x x x x x x x x x x x x x x

BMI99 x x x x x x x x x x x x x x x

BMI00 x x x x x x x x x x x x x

BMI01 x x x x x x x x x x x x

BMI02 x x x x x x x x x x x

BMI03 x x x x x x x x x x

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

BMI97 x x x x x x x x x x x x x x x x x x x x

BMI98 x x x x x x x x x

BMI99 x x x x x x x x x x x

BMI00 x x x x x x x x x x

BMI01 x x x x x x x x x

BMI02 x x x x x x x x x

BMI03 x x x x x x x x x x

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

BMI97 x x x x x x x x x x x x x x x

BMI98 x x x x x

BMI99 x x x x x x

BMI00 x x x x x x x x x x x

BMI01 x x x x x x x x x x x x

BMI02 x x x x x x x x x x x

BMI03 x x x x x x x x x x

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80

BMI97

BMI98 x x x x x x x x x x x

BMI99 x x x x x x x x x x

BMI00 x x x x x x x x x x x x

BMI01 x x x x x x x x x x x x

BMI02 x x x x x x x x x x x x x x

BMI03 x x x x x x x x x x x x

81

BMI97

BMI98

BMI99

BMI00

BMI01 x

BMI02 x

BMI03 x

MISSING DATA PATTERN FREQUENCIES

Pattern Frequency Pattern Frequency Pattern Frequency

1 1102 28 2 55 26

2 97 29 10 56 53

3 73 30 51 57 9

4 38 31 4 58 9

5 21 32 3 59 2

6 11 33 1 60 4

7 5 34 1 61 1

8 20 35 1 62 4

9 23 36 3 63 1

10 4 37 6 64 3

11 8 38 1 65 5

12 3 39 1 66 1

13 8 40 1 67 1

14 3 41 3 68 1

15 11 42 6 69 1

16 25 43 3 70 2

17 6 44 1 71 1

18 3 45 1 72 14

19 2 46 2 73 1

20 3 47 1 74 1

21 1 48 6 75 2

22 1 49 3 76 1

23 2 50 2 77 1

24 7 51 3 78 7

25 1 52 3 79 1

26 1 53 3 80 2

27 6 54 3 81 4

! We might want to set some minimum standard and drop observations that do not meet that. For example, we might drop people who are missing their BMI for more than 3 waves.

COVARIANCE COVERAGE OF DATA

Minimum covariance coverage value 0.100

PROPORTION OF DATA PRESENT

Covariance Coverage

BMI97 BMI98 BMI99 BMI00 BMI01

________ ________ ________ ________ ________

BMI97 0.925

BMI98 0.847 0.902

BMI99 0.850 0.856 0.910

BMI00 0.842 0.846 0.864 0.906

BMI01 0.839 0.837 0.854 0.859 0.904

BMI02 0.796 0.794 0.805 0.811 0.817

BMI03 0.777 0.775 0.788 0.788 0.801

Covariance Coverage

BMI02 BMI03

________ ________

BMI02 0.861

BMI03 0.774 0.840

! We have 77.4% of the 1768 observations answering both BMI02 and BMI03

SAMPLE STATISTICS

! Notice that the means are not dramatically different from the results of the “basic” analysis that had the 1098 observations using listwise deletion. This is reassuring that our missing values are not creating a systematic bias.

Means

BMI97 BMI98 BMI99 BMI00 BMI01

________ ________ ________ ________ ________

1 20.572 21.839 22.651 23.305 23.846

Means

BMI02 BMI03

________ ________

1 24.390 24.935

TESTS OF MODEL FIT

! If you compare nested models with MLR estimation you need to use the scaling correction factor as discussed on their web page. We are not doing that here, so this is okay.

Chi-Square Test of Model Fit

Value 116.426*

Degrees of Freedom 23

P-Value 0.0000

Scaling Correction Factor 2.302

for MLR

* The chi-square value for MLM, MLMV, MLR, ULS, WLSM and WLSMV cannot be used

for chi-square difference tests. MLM, MLR and WLSM chi-square difference

testing is described in the Mplus Technical Appendices at www.statmodel.com.

See chi-square difference testing in the index of the Mplus User's Guide.

! The chi-square is much bigger when we use FIML estimation with missing values, in part because the sample is so much bigger. Still there are some fit problems without the quadratic term. Both the CFI and TLI are a bit low to be ideal (under .96). However the RMSEA is good and that is the most widely used measure of fit.

Chi-Square Test of Model Fit for the Baseline Model

Value 1279.431

Degrees of Freedom 21

P-Value 0.0000

CFI/TLI

CFI 0.926

TLI 0.932

RMSEA (Root Mean Square Error Of Approximation)

Estimate 0.048

SRMR (Standardized Root Mean Square Residual)

Value 0.051

! The results are similar to the linear model solution with listwise deletion, but our z-scores are bigger due to having more observations.

S WITH

I 0.408 0.112 3.658

Means

I 21.035 0.105 200.935

S 0.701 0.022 32.311

Variances

I 15.051 0.958 15.714

S 0.255 0.031 8.340

Residual Variances

BMI97 5.730 0.638 8.981

BMI98 3.276 0.414 7.907

BMI99 3.223 0.351 9.175

BMI00 4.361 0.973 4.483

BMI01 2.845 0.355 8.005

BMI02 9.380 3.384 2.772

BMI03 8.589 2.736 3.139

PLOT INFORMATION

The following plots are available:

Histograms (sample values, estimated factor scores, estimated values)

Scatterplots (sample values, estimated factor scores, estimated values)

Sample means

Estimated means

Sample and estimated means

Observed individual values

Estimated individual values

Multiple Cohort Growth Model with Missing Waves

Major datasets often have multiple cohorts. NLSY97 has youth who were 12-18 in 1997. Seven years later, they are 19-25. It is quite likely that many growth processes that involve going from the age of 12 to the age of 19 are different than going from 19-25. For example, involvement in minor crimes (petty theft, etc.) may increase from 12 to 19, but then decrease from there to 25. Here is what we might have for our NLSY97 data

IndividualCohort1997199819992000200120022003

1

1985

3456778

2

1985

2435677

3

1984

4567665

4

1982

6754322

5

1982

5564221

We can rearrange this data

Case

Cohort

HD12

HD13

HD14

HD15

HD16

HD17

HD18

HD19

HD20

HD21

1

1985

3

4

5

6

7

7

8

*

*

*

2

1985

2

4

3

5

6

7

7

*

*

*

3

1984

*

4

5

6

7

6

6

5

*

*

4

1982

*

*

*

6

7

5

4

3

2

2

5

1982

*

*

*

5

5

6

4

2

2

1

In this table HD is the age at which the data was collected. To capture everybody we would need to extend the table to HD25 because the youth who were 18 in 1997 are 25 seven years latter.

This table would have massive amounts of missing data, but the missingness would not be related to other variables. It would be missing at random.

We could develop a growth curve that covered the full range from age 12 to age 25. We would have 14 waves of data even though each participant was only measured 7 times. Each participant would have data for 7 of the years and have missing values for the other 7 years.

We would want to estimate a growth model with a quadratic term and expect the linear slope to be positive (growth from 12-18) and the quadratic term to be negative (decline from 18-25).

Mplus has a special Analysis: type called MCOHORT. There is an example on the Mplus WebPage and we will not cover it here. This is an extraordinary way to deal with missing values.

Here is an example from data Muthén analyzed:

Multiple group growth curves

Multiple group analysis using SEM is extremely flexible—some would say it is too flexible because there are so many possibilities. We use gender for our grouping variable because we are interested in the trend in BMI for girls compared to boys. We think of adolescent girls are more concerned about their weight and therefore more likely to have a lower BMI than boys and to have a flatter trajectory.

There are several ways of comparing a model across multiple groups.

One approach is to see if the same model fits each group, allowing all of the estimated parameters to be different.

· Here we are saying that a linear growth model fits the data for both boys and girls, but

· We are not constraining girls and boys to have the same values on any of the parameters

· intercept mean

· slope mean

· intercept variance

· slope variance

· covariance of intercept and slope

· residual errors

We can then put increasing invariance constraints on the model. At a minimum, we want to test whether the two groups have a different intercept (level) and slope. If this constraint is acceptable we can add additional constraints on the variances, covariances, and error terms.

First, we will estimate the model simultaneously for girls and boys with no constraints on the parameters. Here is the program with new commands highlighted:

Title:

bmi_growth_gender.inp

Data:

File is "F:\flash\academica\bmi_stata.dat" ;

Variable:

Names are

id grlprb_y boyprb_y grlprb_p boyprb_p male race_eth bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 bmi03 white black hispanic asian other;

Missing are all (-9999) ;

! usevariables keeps bmi variables and gender

Usevariables are male bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 bmi03 ;

Grouping is male (0=female 1=male);

Model:

i s | bmi97@0 bmi98@1 bmi99@2 bmi00@3 bmi01@4 bmi02@5 bmi03@6;

Output:

Sampstat Mod(3.84) ;

Plot:

Type is Plot3;

Series = bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03(*);

I’ve put the only changes we need to make in bold, underline. We have a binary variable, male, that is coded 0 for females and 1 for males. We need to add this to the list of variables we are using. Then, we need to add a subcommand to the Variable: section that says we have a grouping variable, names it, and defines what the values are so the output will be labeled nicely. The command Grouping is male (0=female 1 = male); is going to give us a separate set of estimates for the parameters for girls (labeled female) and boys (labeled male).

Here is selected, annotated output:

SUMMARY OF ANALYSIS

Number of groups 2

Number of observations

Group FEMALE 528

Group MALE 574

Number of dependent variables 7

Number of independent variables 0

Number of continuous latent variables 2

Variables with special functions

Grouping variable MALE

SAMPLE STATISTICS FOR FEMALE

Means

BMI97 BMI98 BMI99 BMI00 BMI01

________ ________ ________ ________ ________

1 19.904 21.198 21.752 22.349 22.805

Means

BMI02 BMI03

________ ________

1 23.606 23.961

SAMPLE STATISTICS FOR MALE

Means

BMI97 BMI98 BMI99 BMI00 BMI01

________ ________ ________ ________ ________

1 20.652 21.835 22.858 23.638 24.063

Means

BMI02 BMI03

________ ________

1 24.370 24.994

TESTS OF MODEL FIT

Chi-Square Test of Model Fit

Value 320.535

Degrees of Freedom 46 ! Notice we have twice the degrees of freedom

P-Value 0.0000

Chi-Square Test of Model Fit for the Baseline Model

Value 8906.678

Degrees of Freedom 42

P-Value 0.0000

CFI/TLI

CFI 0.969

TLI 0.972

RMSEA (Root Mean Square Error Of Approximation)

Estimate 0.104

90 Percent C.I. 0.093 0.115

SRMR (Standardized Root Mean Square Residual)

Value 0.063

MODEL RESULTS

Estimates S.E. Est./S.E.

Group FEMALE

I |

S WITH

I 0.465 0.090 5.187

Means

I 20.421 0.157 130.261

S 0.610 0.024 24.975

Variances

I 11.579 0.801 14.457

S 0.183 0.020 8.920

Residual Variances

BMI97 4.632 0.351 13.183

BMI98 2.033 0.177 11.463

BMI99 1.896 0.153 12.367

BMI00 4.567 0.312 14.644

BMI01 2.298 0.192 11.984

BMI02 15.204 0.991 15.342

BMI03 3.400 0.349 9.730

Group MALE

S WITH

I 0.337 0.114 2.956

Means

I 21.215 0.171 124.278

S 0.697 0.027 25.551

Variances

I 14.528 0.991 14.660

S 0.232 0.026 8.918

Residual Variances

BMI97 6.306 0.471 13.391

BMI98 3.445 0.269 12.800

BMI99 3.405 0.241 14.108

BMI00 2.651 0.195 13.612

BMI01 2.132 0.183 11.671

BMI02 4.304 0.332 12.960

BMI03 10.570 0.730 14.484

Here is the graph of the two growth curves. It appears that the girls have a lower initial level and a flatter rate of growth of BMI.

We can re-estimate the model with the intercept and slope invariant. To do this we make the following modifications to the model:

Model:

i s | bmi97@0 bmi98@1 bmi99@2 bmi00@3 bmi01@4 bmi02@5 bmi03@6;

[i] (1);

[s] (2);

Model male:

[i] (1);

[s] (2);

Output:

Sampstat Mod(3.84) ;

Plot:

Type is Plot3;

Series = bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03(*);

Notice that we added two lines to the Model: section,

· [i] (1); and

· [s] (2);.

Then we added a subsection called Model male: where males are the second group and put the same two lines. The first model command is understood to be the group coded as zero on the male variable. These changes force the intercept to be equal in both groups because they are both assigned parameter (1) and the slopes to be equal because they are both assigned a parameter (2). Any parameters with a (1) after them are equal in both groups as are any parameters with a (2) after them in both groups.

When we run the revised program we obtain a chi-square that has two extra degrees of freedom because of the two constraints.

TESTS OF MODEL FIT

Chi-Square Test of Model Fit

Value 338.157! Was 320.535

Degrees of Freedom 48! Was 46

P-Value 0.0000

Chi-Square Test of Model Fit for the Baseline Model

Value 8906.678

Degrees of Freedom 42

P-Value 0.0000

CFI/TLI

CFI 0.967! .969

TLI 0.971! .972

RMSEA (Root Mean Square Error Of Approximation)

Estimate 0.105! .104

90 Percent C.I. 0.094 0.115

SRMR (Standardized Root Mean Square Residual)

Value 0.081

We can test the difference between

· the chi-square(48) = 338.17 and

· the chi-square(46) = 320.535.

· This difference, 17.635 has 48-46 = 2 degrees of freedom and is significant at the p < .001 level.

· Although we can say there is a highly significant difference between the level and trend for girls and boys, we need to be cautious because this difference of chi-square has the same problem with a large sample size that the original chi-squares have.

· In fact, the measures of fit are hardly changed whether we constrain the intercept and slope to be equal or not. Moreover, the visual difference in the graph is not dramatic.

We could also put other constraints on the two solutions such as equal variances and covariances, and even equal residual error variances, but we will not.

Alternative to Multiple Group Analysis

An alternative way of doing this, where there are two groups, is to enter the grouping variable as a predictor. This requires re-conceptualizing our model. We can think of the indicator variable Male having a direct path to both the intercept and the slope. Because the indicator variable is coded as 1 for male and 0 for female,

· If the path from Male to the Intercept is positive this means that boys have a higher initial level on BMI.

· Similarly, if there is a positive path from Male to the Slope, this indicates that boys have a steeper slope than girls on BMI.

· Such results would be consistent with our expectation that boys both start higher and gain more fat than girls during adolescence.

· This approach does not let us test for other types of invariances such as the variances, covariances, and error terms.

The following figure shows these two paths. We have omitted the residual variances, RI and RS, and their covariance to simplify the figure. However, it is important to remember that it is theses two variances we are explaining. We are explaining why some people have a higher or lower initial level and why some have a steeper or flatter slope by whether they are a girl or a boy. Here is the figure:

Here is part of the program:

Variable:

Names are

id grlprb_y boyprb_y grlprb_p boyprb_p male race_eth bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03 white black hispanic asian other;

Missing are all (-9999) ;

! usevariables is limited to bmi variables and male

Usevariables are male bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 bmi03 ;

Model:

i s | bmi97@0 bmi98@1 bmi99@2 bmi00@3 bmi01@4 bmi02@5 bmi03@6;

i on male ;

s on male ;

Output:

Sampstat Mod(3.84) ;

Plot:

Type is Plot3;

Series = bmi97 bmi98 bmi99 bmi00 bmi01 bmi02 bmi03(*);

Here is selected, annotated output:

TESTS OF MODEL FIT

Chi-Square Test of Model Fit

Value 237.517! We cannot compare this to the chi-square for the two group design because this is not nested in that model.

Degrees of Freedom 28

P-Value 0.0000

Chi-Square Test of Model Fit for the Baseline Model

Value 8602.391

Degrees of Freedom 28

P-Value 0.0000

CFI/TLI

CFI 0.976

TLI 0.976

Loglikelihood

H0 Value -19515.302

H1 Value -19396.543

Information Criteria

Number of Free Parameters 14

Akaike (AIC) 39058.603

Bayesian (BIC) 39128.672

Sample-Size Adjusted BIC 39084.204

(n* = (n + 2) / 24)

RMSEA (Root Mean Square Error Of Approximation)

Estimate 0.082

90 Percent C.I. 0.073 0.092

Probability RMSEA <= .05 0.000

SRMR (Standardized Root Mean Square Residual)

Value 0.044

MODEL RESULTS

I ON

MALE 0.793 0.233 3.409 ! Males higher

S ON

MALE 0.084 0.038 2.203 ! Males steeper

S WITH

I 0.400 0.075 5.371

Intercepts

BMI97 0.000 0.000 0.000

BMI98 0.000 0.000 0.000

BMI99 0.000 0.000 0.000

BMI00 0.000 0.000 0.000

BMI01 0.000 0.000 0.000

BMI02 0.000 0.000 0.000

BMI03 0.000 0.000 0.000

I 20.385 0.168 121.416

S 0.625 0.027 22.816

! When we add one or more predictors of the intercept and slope, the intercept and slope means are not reported under a section called “means” but are now under “intercepts”

Residual Variances

BMI97 5.391 0.290 18.583

BMI98 2.731 0.159 17.129

BMI99 2.696 0.144 18.752

BMI00 3.524 0.177 19.858

BMI01 2.327 0.144 16.175

BMI02 9.552 0.458 20.846

BMI03 7.148 0.398 17.974

I 13.027 0.636 20.471

S 0.212 0.017 12.095

!Both the intercept and slope still have variance to explain

We see that the intercept is 20.385 and the slope is .625. How is gender related to this? For girls the equation is:

Est. BMI = 20.385 + .625(Time) + .793(Male) + .084(Male)(Time)

20.385 + .625(Time) + .793(0) + .084(0)(Time)

= 20.385 + .625(Time)

For boys the equation is:

Est BMI = 20.385 + .625(Time) + .793(1) + .084(1)(Time)

= (20.385 + .793) + (.625 + .084)(Time)

= 21.178 + .709(Time)

Where Time is coded as 0, 1, 2, 3, 4, 5, 6

Using these we estimate the BMI for girls is initially 20.385. By the seventh year (Time = 6) it will be 20.385 + .625(6) or 24.135

Using these results, we estimate the BMI for boys is initially 21.178. By the seventh year it will be 21.78 + .709(6) or 26.034. Since a BMI of 25 is considered overweight, by the age of 18 we estimate the average boy will be classified as overweight.

We could use the plots provided by Mplus, but if we wanted a nicer looking plot we could use another program. I used Stata getting this graph.

The Stata command is twoway (connected Girls Age, lcolor(black) lpattern(dash) lwidth(medthick)) (connected Boys Age, lcolor(black) lpattern(solid) lwidth(medthick)), ytitle(Body Mass Index) xtitle(Age of Adolescent) caption(NLSY97 Data)

and the data is

+-----------------------+

| Age Girls Boys |

|-----------------------|

1. | 12 20.385 21.178 |

2. | 18 24.135 26.034 |

+-----------------------+

20

22

24

26

Body Mass Index (BMI)

12141618

Age 12 to 18

GirlsBoys

Comparison of Girls with Boys

Body Mass Index by Age of Adolescent

When we treat a categorical variable as a grouping variable and do multiple comparisons we can test the equality of all the parameters. When we treat it as a predictor as in this example, we only test whether the intercept and slope are different for the two groups. In this example we do not allow the other parameters to be different for boys and girls and this might be a problem in some applications.

Growth Curves with Time Invariant Covariates

An extension of having a categorical predictor includes having a series of covariates that explain variance in the intercept and slope. In this example we use what are known as time invariant covariates. These are covariates that either remain constant (gender) or for which you have a measure only at the start of the study. It is possible to add time varying covariates as well.

This has been called Conditional Latent Trajectory Modeling (Curran & Hussong, 2003) because your initial level and trajectory (slope) are conditional on other variables.

This is equivalent to the multilevel approach that calls the intercept and slope random effects. With programs such as HLM we use what they call a two level approach. Here are the parallels using a slide adapted from Muthén. Muthén has said this is the most critical thing to understand for these procedures.

Level 1 is defined as the measurement model with an intercept (level) and slope (trend/trajectory). Level 2, represented by equations 2a and 2b treats the intercept and slope as random variables that are explained by a vector of covariates.

· The yit is the outcome. In our example it is the score on BMI for individual “i” at time “t”.

a. In our figures we show them as yt and the “i” is implicit.

b. That is each individual can have a different y value at each time.

· The xt is the time score. In our example of BMI we use 0, 1, 2, 3, 4, 5, 6

· The (0i is the intercept for individual “i”.

a. The graph just below equation 1 shows three individuals who each have a different intercept.

b. Individual “1” has a higher starting value than individuals 2 or 3.

c. In the figure we show (0 because this represents the mean of (0i.

d. The paths from (0 and each yt is fixed at 1 because it is a constant effect.

· The (1ixt is the slope for individual “i” times his or her score on time.

a. With our BMI example, we score time as 0, 1, 2, 3, 4, 5, 6.

b. In the figure we use (1 because this represents the mean of (1i.

c. The paths from (1 to each yt are for BMI are 0, 1, 2, 3, 4, 5, 6. Other variables are possible.

· If we had a quadratic, we would add an (2txt2. For BMI the Xt2 would be 0, 1, 4, 9, 16, 25, 36.

· The (it is the residual error on y for individual “i” at time “t”.

a. With BMI you can imagine many factors that could have a temporary influence on a person’s BMI score on the day it was measured.

b. The figure shows et (t = 1, 2, etc.) and the “i” is implicit.

An important distinction that some make between HLM and SEM programs is that SEM programs cannot have the time vary between individuals. If the youth are measured each year, it is important that all of them are measured at the same time so they are all one year apart. Mplus has a way of eliminating this limitation of SEM by allowing each individual to have a different time between measurements. For example, Li might be measured at 12 month intervals, Jones might be measured at intervals of 11 months, then 13 months, then 9 months, etc. We are not discussing these extensions at this point (see TSCORE in the User’s Manual).

Equations (2a) and (2b) are the level two equations. Here we are explaining the individual variance in the intercept and the slope.

· (0i is the random intercept that varies from one individual to another

· (1i is the random slope that varies from one individual to another

· The wi is a vector of covariates. This can be generalized to include any number of categorical or continuous factors that predict the random intercept and random slope from equation (1). In the last section we had gender as the only w predictor.

· The α0 is the fixed intercept or value on (0i for a person who has a value of zero on all the covariates.

· The (0 is the fixed slope (notice that there is no “i” subscript so the same slope is applies for all individuals) for the effect of the covariate when predicting the value on (0i. In the previous section this was the slope for gender. Because it was positive, we said males had a higher intercept than females.

· The α1 is the fixed intercept or value of (1i for those who have a value of zero on all the wi variables.

· The (1wi is the effect of the wi variable on the slope, (1i. In our last example, because males had a positive slope, (I, we can say that males gain weight (BMI) more quickly than females between the age of 12 and 18.

· The (0i and (1i are the residuals. These are very important. A significant residual means that our vector of wi does not completely explain the intercept or slope.

*The variable White (whites = 1; nonwhites = 0) compares Whites to the combination of African American and Hispanic. Asian & Pacific Islander, and Other have been deleted from this analysis because of small sample size.

In this figure we have two covariates. One is whether the adolescent is white versus African American or Hispanic and the other is a latent variable reflecting the level of emotional problems a youth has.

· A researcher may predict that Whites have a lower initial BMI (intercept) which persists during adolescence, but the White advantage does not increase (same slope as nonwhites).

· Alternatively, a researcher may predict that being White predicts a lower initial BMI (intercept) and less increase of the BMI (smaller slope) during adolescence. This suggests that minorities start with a disadvantage (high BMI) and this disadvantaged gets even greater across adolescence.

· A researcher may argue that emotional problems are associated with both higher initial BMI (intercept) and a more rapid increase in BMI over time (slope)

By including a covariate that is a latent variable itself, emotional problems, we will show how these are handled by Mplus.

We estimated this model for boys only; girls were excluded.

The following is our Mplus program:

Title:

bmi_growth_covariatesb.inp

Stata2Mplus convertsion for F:\flash\academica\bmi_stata.dta

Data:

File is "F:\flash\academica\bmi_stata.dat" ;

Variable:

Names are

id grlprb_y boyprb_y grlprb_p boyprb_p male race_eth bmi97 bmi98 bmi99

bmi00 bmi01 bmi02 b