Cross-sectional LCA

Cross-sectional LCA

Patterns of first response to cigarettes

First smoking experience

• Have you ever tried a cigarette (including roll-ups), even a puff?

• How old were you when you first tried a cigarette?

• When you FIRST ever tried a cigarette can you remember how it made you feel? (tick as many as you want)– It made me cough

– I felt ill

– It tasted awful

– I liked it

– It made me feel dizzy

Aim

• To categorise the subjects based on their pattern of responses

• To assess the relationship between first-response and current smoking behaviour

• To try not to think too much about the possibility of recall bias

Step 1

Look at your data!!!

Examine your data structure

• LCA converts a large number of response patterns into a small number of ‘homogeneous’ groups

• If the responses in your data are fair mutually exclusive then there’s no point doing LCA

• Don’t just dive in

How many items endorsed?

numresp | Freq. Percent Cum.------------+----------------------------------- 0 | 69 2.75 2.75 1 | 1,597 63.70 66.45 2 | 569 22.70 89.15 3 | 202 8.06 97.21 4 | 68 2.71 99.92 5 | 2 0.08 100.00------------+----------------------------------- Total | 2,507 100.00

Frequency of each item (n ~ 2500)

0

200

400

600

800

1,000

1,200

1,400

cough ill taste liked dizzy

Num

ber

of p

ositi

ve r

espo

nses

Examine pattern frequency

+---------------------------------------+

| cough ill taste liked dizzy num |

|---------------------------------------|

1. | 0 0 1 0 0 468 |

2. | 0 0 0 1 0 452 |

3. | 1 0 0 0 0 449 |

4. | 1 0 1 0 0 279 |

5. | 0 0 0 0 1 194 |

|---------------------------------------|

6. | 1 1 1 0 0 94 |

7. | 1 0 0 1 0 87 |

8. | 1 0 0 0 1 76 |

9. | 0 0 0 0 0 69 |

10. | 1 1 1 0 1 59 |

|---------------------------------------|

11. | 0 0 0 1 1 56 |

12. | 1 0 1 0 1 47 |

13. | 1 0 0 1 1 35 |

14. | 0 1 0 0 0 34 |

15. | 0 0 1 0 1 27 |

|---------------------------------------|

+---------------------------------------+


|---------------------------------------|

16. | 0 1 1 0 0 17 |

17. | 0 0 1 1 0 13 |

18. | 1 1 0 0 1 9 |

19. | 1 1 0 0 0 8 |

20. | 0 1 1 0 1 7 |

|---------------------------------------|

21. | 1 0 1 1 1 7 |

22. | 1 0 1 1 0 6 |

23. | 0 1 0 0 1 5 |

24. | 1 1 1 1 1 2 |

25. | 0 1 0 1 1 2 |

|---------------------------------------|

26. | 0 1 0 1 0 1 |

27. | 1 1 1 1 0 1 |

28. | 1 1 0 1 1 1 |

29. | 0 0 1 1 1 1 |

30. | 1 1 0 1 0 1 |

+---------------------------------------+

Examine correlation structure

Polychoric correlation matrix

cough ill taste liked dizzy

cough 1

ill 0.371 1

taste 0.049 0.468 1

liked -0.510 -0.542 -0.786 1

dizzy -0.030 0.246 -0.241 -0.158 1

Step 2

Now you can fit a latent class model

Latent Class models

• Work with observations at the pattern level rather than the individual (person) level

+---------------------------------------+


|---------------------------------------|

1. | 0 0 1 0 0 468 |

2. | 0 0 0 1 0 452 |

3. | 1 0 0 0 0 449 |

4. | 1 0 1 0 0 279 |

5. | 0 0 0 0 1 194 |

|---------------------------------------|

Latent Class models

• For a given number of latent classes, using application of Bayes’ rule plus an assumption of conditional independence one can calculate the probability that each pattern should fall into each class

• Derive the likelihood of the obtained data under each model (i.e. assuming different numbers of classes) and use this plus other fit statistics to determine optimal model i.e. optimal number of classes

Latent Class models

• Bayes’ rule:

• Conditional independence: P( pattern = ’01’ | class = i)

= P(pat(1) = ‘0’ | class = i)*P(pat(2) = ‘1’ | class = i)

How many classes can I have?~ degrees of freedom

• 32 possible patterns

• Each additional class requires – 5 df to estimate the 5 prevalence of each item that class

(i.e. 5 thresholds)

– 1 df for an additional cut of the latent variable defining the class distribution

• Hence a 5-class model uses up 5*5 + 4 = 29 degrees of freedom leaving 3 df to test the model

Standard thresholds

• Mplus thinks of binary variables as being a dichotomised continuous latent variable

• The point at which a continuous N(0,1) variable must be cut to create a binary variable is called a threshold

• A binary variable with 50% cases corresponds to a threshold of zero

• A binary variable with 2.5% cases corresponds to a threshold of 1.96

Standard thresholds

Figure from Uebersax webpage

Data: File is “..\smoking_experience.dta.dat"; listwise is on; Variable: Names are sex cough ill taste liked dizzy

numresp less_12 less_13;

categorical are cough ill taste liked dizzy ; usevariables are cough ill taste liked dizzy; Missing are all (-9999) ;

classes = c(3); Analysis: proc = 2 (starts); type = mixture; starts = 1000 500; stiterations = 20; Output: tech10;

What you’re actually doing

model: %OVERALL%

[c#1 c#2];

%c#1% [cough$1]; [ill$1]; [taste$1]; [liked$1]; [dizzy$1];

+ five more threshold parameters for %c#2% and %c#3%

Defines the latent class variable

Defines the within class thresholds i.e. the prevalence of the endorsement of each item

SUMMARY OF CATEGORICAL DATA PROPORTIONS

COUGH Category 1 0.537 Category 2 0.463 ILL Category 1 0.904 Category 2 0.096 TASTE Category 1 0.590 Category 2 0.410 LIKED Category 1 0.735 Category 2 0.265 DIZZY Category 1 0.789 Category 2 0.211

RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES

Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers:

-6343.937 685561 9973 -6343.937 172907 9395 -6343.937 497824 9464 -6343.937 770684 7725 -6343.937 584663 5193 -6343.937 872295 2899 -6343.937 116150 3570 -6343.937 271339 4768 -6343.937 472383 9650 -6343.937 707126 3683Etc.

How many random starts?

• Depends on– Sample size

– Complexity of model • Number of manifest variables

• Number of classes

• Aim to find consistently the model with the lowest likelihood, within each run

Success Not there yetLoglikelihood values at local maxima,

seeds, and initial stage start numbers:

-10148.718 987174 1689 -10148.718 777300 2522 -10148.718 406118 3827 -10148.718 51296 3485 -10148.718 997836 1208 -10148.718 119680 4434 -10148.718 338892 1432 -10148.718 765744 4617 -10148.718 636396 168 -10148.718 189568 3651 -10148.718 469158 1145 -10148.718 90078 4008 -10148.718 373592 4396 -10148.718 73484 4058 -10148.718 154192 3972 -10148.718 203018 3813 -10148.718 785278 1603 -10148.718 235356 2878 -10148.718 681680 3557 -10148.718 92764 2064

Loglikelihood values at local maxima, seeds, and initial stage start numbers

-10153.627 23688 4596 -10153.678 150818 1050 -10154.388 584226 4481 -10155.122 735928 916 -10155.373 309852 2802 -10155.437 925994 1386 -10155.482 370560 3292 -10155.482 662718 460 -10155.630 320864 2078 -10155.833 873488 2965 -10156.017 212934 568 -10156.231 98352 3636 -10156.339 12814 4104 -10156.497 557806 4321 -10156.644 134830 780 -10156.741 80226 3041 -10156.793 276392 2927 -10156.819 304762 4712 -10156.950 468300 4176 -10157.011 83306 2432

Scary “warnings”

IN THE OPTIMIZATION, ONE OR MORE LOGIT THRESHOLDS APPROACHED AND WERE SET AT THE EXTREME VALUES. EXTREME VALUES ARE -15.000 AND 15.000.

THE FOLLOWING THRESHOLDS WERE SET AT THESE VALUES:

* THRESHOLD 1 OF CLASS INDICATOR TASTE FOR CLASS 3 AT ITERATION 11* THRESHOLD 1 OF CLASS INDICATOR DIZZY FOR CLASS 3 AT ITERATION 12* THRESHOLD 1 OF CLASS INDICATOR ILL FOR CLASS 3 AT ITERATION 16* THRESHOLD 1 OF CLASS INDICATOR LIKED FOR CLASS 1 AT ITERATION 34* THRESHOLD 1 OF CLASS INDICATOR TASTE FOR CLASS 1 AT ITERATION 93

WARNING: WHEN ESTIMATING A MODEL WITH MORE THAN TWO CLASSES, IT MAY BE NECESSARY TO INCREASE THE NUMBER OF RANDOM STARTS USING THE STARTS OPTION TO AVOID LOCAL MAXIMA.

THE MODEL ESTIMATION TERMINATED NORMALLY

TESTS OF MODEL FIT

Loglikelihood H0 Value -6343.937 H0 Scaling Correction Factor 1.006 for MLR

Information Criteria Number of Free Parameters 17 Akaike (AIC) 12721.873 Bayesian (BIC) 12820.930 Sample-Size Adjusted BIC 12766.916 (n* = (n + 2) / 24)

Chi-Square Test of Model Fit for the Binary and Ordered Categorical (Ordinal) Outcomes

Pearson Chi-Square

Value 623.040

Degrees of Freedom 14

P-Value 0.0000

Likelihood Ratio Chi-Square

Value 563.869

Degrees of Freedom 14

P-Value 0.0000

FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THE ESTIMATED MODEL

Latent Classes 1 600.41143 0.23949 2 1517.83320 0.60544 3 388.75538 0.15507

CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP

Latent Classes 1 630 0.25130 2 1396 0.55684 3 481 0.19186

Entropy (fuzzyness)

CLASSIFICATION QUALITY

Entropy 0.832

Average Latent Class Probabilities for Most Likely Latent Class Membership (Row) by Latent Class (Column)

1 2 3 1 0.952 0.048 0.000 2 0.000 0.979 0.021 3 0.000 0.252 0.748

Model results

Two-Tailed

Estimate S.E. Est./S.E. P-Value

Latent Class 1

Thresholds

COUGH$1 1.604 0.133 12.103 0.000

ILL$1 7.371 4.945 1.490 0.136

TASTE$1 15.000 0.000 999.000 999.000

LIKED$1 -15.000 0.000 999.000 999.000

DIZZY$1 1.890 0.139 13.604 0.000

Categorical Latent Variables

Two-Tailed

Estimate S.E. Est./S.E. P-Value

Means

C#1 0.435 0.124 3.500 0.000

C#2 1.362 0.135 10.058 0.000

RESULTS IN PROBABILITY SCALE

Latent Class 1

COUGH Category 1 0.833 0.018 45.072 0.000 Category 2 0.167 0.018 9.059 0.000 ILL Category 1 0.999 0.003 321.448 0.000 Category 2 0.001 0.003 0.202 0.840 TASTE Category 1 1.000 0.000 0.000 1.000 Category 2 0.000 0.000 0.000 1.000 LIKED Category 1 0.000 0.000 0.000 1.000 Category 2 1.000 0.000 0.000 1.000 DIZZY Category 1 0.869 0.016 54.848 0.000 Category 2 0.131 0.016 8.284 0.000

Class 1 from 3-class model

0

0.2

0.4

0.6

0.8

1

COUGH ILL TASTE LIKED DIZZY

Conditional independence

• The latent class variable accounts for the covariance structure in your dataset

• Conditional on C, any pair of manifest variables should be uncorrelated

• Harder to achieve for a cross-sectional LCA• With a longitudinal LCA there tends to be a more

ordered pattern of correlations based on proximity in time

Tech10 – response patterns

MODEL FIT INFORMATION FOR THE LATENT CLASS INDICATOR MODEL PART

RESPONSE PATTERNS

No. Pattern No. Pattern No. Pattern No. Pattern 1 10000 2 00100 3 00010 4 11100 5 11101 6 00001 7 10101 8 10010 9 10100 10 00101 11 10001 12 0000013 00011 14 01101 15 10011 16 0011017 11000 18 10111 19 11011 20 0110021 10110 22 01000 23 01001 24 1111125 01010 26 11001 27 01011 28 1101029 00111 30 11110

Tech10 – Bivariate model fit

• 5 manifest variables → number of pairs =

Overall Bivariate Pearson Chi-Square 215.353

Overall Bivariate Log-Likelihood Chi-Square 214.695

Compare with χ² (10 df) = 18.307

102

45

)!2()!3(

!5

2

5


Not bad:-

Estimated Probabilities Standardized Variable Variable H1 H0 Residual (z-score) COUGH ILL Category 1 Category 1 0.511 0.506 0.457 Category 1 Category 2 0.026 0.031 -1.321 Category 2 Category 1 0.393 0.398 -0.467 Category 2 Category 2 0.070 0.065 0.925

Bivariate Pearson Chi-Square 2.726 Bivariate Log-Likelihood Chi-Square 2.798


Terrible:-

Estimated Probabilities Standardized Variable Variable H1 H0 Residual (z-score) COUGH ILL Category 1 Category 1 0.566 0.534 3.149 Category 1 Category 2 0.338 0.370 -3.255 Category 2 Category 1 0.024 0.056 -6.850 Category 2 Category 2 0.072 0.040 7.977

Bivariate Pearson Chi-Square 116.657 Bivariate Log-Likelihood Chi-Square 117.162

Conditional Independence violated

Need more classes

Obtain the ‘optimal’ model

Assess the following for models with increasing classes• aBIC• Entropy• BLRT (Bootstrap LRT)• Conditional Independence (Tech10)

• Ease of interpretation• Consistency with previous work / theory

Model fit stats

1 class 2 class 3 class 4 class 5 class

Estimated params 5 11 17 23 29

H0 Likelihood -6962.1 -6458.7 -6343.9 -6200.1 -6100.8

aBIC 13947.4 12968.5 12766.9 12507.1 12336.5

Entropy - 0.944 0.832 0.894 0.844

Tech 10 625.2 228.1 214.7 135.9 17.6

BLRT statistic - 1006.8 229.5 287.8 198.4

BLRT p-value - < 0.0001 < 0.0001 < 0.0001 < 0.0001

5-class model

• aBIC values are still decreasing• Tech 10 is still quite high – residual correlations

between ill and both liked and dizzy

• BLRT rejects 4-class model• Not enough df to fit 6-class model so we cannot

assess fit of 5-class• Seems unlikely as BLRT values are decreasing

slowly

Cross-sectional LCA

Patterns of first response to cigarette

Attempt 2

What to do?

• We need more degrees of freedom• There were only 5 questions on response to smoking

• Add something else:– How old were you when you first tried a cigarette?

– Split into pre-teen / teen

• 6 binary variables means 64 d.f. to play with

Model fit stats – attempt 2

3 class 4 class 5 class 6 class 7 class

Estimated params 20 27 34 41 48

H0 Likelihood -7866.3 -7720.2 -7616.0 -7582.4 -7576.2

aBIC 15825.6 15565.7 15389.9 15355.1 15375.2

Entropy 0.823 0.893 0.812 0.876 0.850

Tech 10 228.9 144.6 16.8 1.2 0.29

BLRT statistic 123.3 146.1 104.2 67.3 12.4

BLRT p-value < 0.0001 < 0.0001 < 0.0001 < 0.0001 0.2100

6-class model results

CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THE ESTIMATED MODEL

Latent classes

1 53.23894 2.1%

2 541.96140 21.7%

3 396.04196 15.9%

4 454.89294 18.2%

5 750.87470 30.1%

6 295.99007 11.9%

CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP

Latent classes

1 34 1.4%

2 540 21.7%

3 403 16.2%

4 447 17.9%

5 840 33.7%

6 229 9.2%

Examine entropy in more detail

• Model-level entropy = 0.876

• Class level entropy:

1 2 3 4 5 6 1 0.953 0.000 0.000 0.000 0.026 0.020 2 0.000 0.997 0.000 0.000 0.002 0.001 3 0.000 0.000 0.958 0.000 0.017 0.025 4 0.000 0.000 0.000 0.949 0.041 0.011 5 0.025 0.005 0.000 0.036 0.851 0.083 6 0.000 0.000 0.043 0.003 0.036 0.918

Pattern level entropy

• Save out the model-based probabilities• Open in another stats package• Collapse over response patterns

Save out the model-based probabilities

savedata:

file is "6-class-results.dat";

save cprobabilities;

Varnames shown at end of output

SAVEDATA INFORMATION

Order and format of variables

COUGH F10.3 ILL F10.3 TASTE F10.3 LIKED F10.3 DIZZY F10.3 LESS_13 F10.3 ALN F10.3 QLET F10.3 SEX F10.3 CPROB1 F10.3 CPROB2 F10.3 CPROB3 F10.3 CPROB4 F10.3 CPROB5 F10.3 CPROB6 F10.3 C F10.3

Open / process in Stata

Remove excess spaces from data file, then:

insheet using 6-class-results.dat, delim(" ")

local i = 1local varnames "COUGH ILL TASTE LIKED DIZZY LESS_13 ALN

QLET SEX CPROB1 CPROB2 CPROB3 CPROB4 CPROB5 CPROB6 C"

foreach x of local varnames {rename v`i' `x'local i=`i'+1}

gen num = 1collapse (mean) CPROB* C (count) num, by(COUGH ILL

TASTE LIKED DIZZY LESS_13)

Check the assignment probabilities for each class

cough ill taste liked dizzy < 13 P_c1 P_c2 P_c3 P_c4 P_c5 P_c6Modclass n

1 1 1 0 0 0 0 0 0 0 0.052 0.948 6 64

1 1 1 0 1 0 0 0 0.003 0 0.001 0.996 6 34

1 1 1 0 0 1 0 0 0 0 0.027 0.973 6 30

1 0 1 0 1 0 0 0 0.135 0 0.062 0.803 6 29

1 1 1 0 1 1 0 0 0.003 0 0.001 0.996 6 25

1 0 1 0 1 1 0 0 0.154 0 0.032 0.815 6 18

1 1 0 0 0 0 0 0 0 0.071 0.054 0.874 6 6

0 1 1 0 1 1 0 0 0.073 0 0.012 0.915 6 4

1 1 0 0 1 0 0 0 0.303 0 0.001 0.696 6 4

1 1 0 0 1 1 0 0 0.329 0 0 0.671 6 4

0 1 1 0 0 1 0 0 0 0 0.411 0.589 6 3

0 1 1 0 1 0 0 0 0.065 0 0.024 0.912 6 3

1 1 0 0 0 1 0 0 0 0.055 0.029 0.917 6 2

1 1 1 1 0 1 0 0.001 0 0 0.023 0.977 6 1

1 1 1 1 1 0 0 0 0.039 0 0.001 0.96 6 1

1 1 1 1 1 1 0 0 0.044 0 0 0.955 6 1

cough ill taste liked dizzy < 13 P_c1 P_c2 P_c3 P_c4 P_c5 P_c6Modclass n

1 1 1 0 0 0 0 0 0 0 0.052 0.948 6 64

1 1 1 0 1 0 0 0 0.003 0 0.001 0.996 6 34

1 1 1 0 0 1 0 0 0 0 0.027 0.973 6 30

1 0 1 0 1 0 0 0 0.135 0 0.062 0.803 6 29

1 1 1 0 1 1 0 0 0.003 0 0.001 0.996 6 25

1 0 1 0 1 1 0 0 0.154 0 0.032 0.815 6 18

1 1 0 0 0 0 0 0 0 0.071 0.054 0.874 6 6

0 1 1 0 1 1 0 0 0.073 0 0.012 0.915 6 4

1 1 0 0 1 0 0 0 0.303 0 0.001 0.696 6 4

1 1 0 0 1 1 0 0 0.329 0 0 0.671 6 4

0 1 1 0 0 1 0 0 0 0 0.411 0.589 6 3

0 1 1 0 1 0 0 0 0.065 0 0.024 0.912 6 3

1 1 0 0 0 1 0 0 0 0.055 0.029 0.917 6 2

1 1 1 1 0 1 0 0.001 0 0 0.023 0.977 6 1

1 1 1 1 1 0 0 0 0.039 0 0.001 0.96 6 1

1 1 1 1 1 1 0 0 0.044 0 0 0.955 6 1

Check the assignment probabilities for each class

Bad taste (30.1%)

0

0.2

0.4

0.6

0.8

1

COUGH ILL TASTE LIKED DIZZY LESS_13

Positive experience (21.7%)

0

0.2

0.4

0.6

0.8

1


Coughed (18.2%)

0

0.2

0.4

0.6

0.8

1


Dizziness (15.9%)

0

0.2

0.4

0.6

0.8

1


V negative experience (11.9%)

0

0.2

0.4

0.6

0.8

1


Felt ill (2.1%)

0

0.2

0.4

0.6

0.8

1


Well that was a complete waste of time!

• You might think that those resulting classes could have been derived just looking at the response patterns and making some arbitrary decisions e.g.– Group all of those who had >1 negative experience

– Keep separate each group who had 1 experience

• You would have ended up with a bunch of weird patterns with no clue of what to do with them

• Strange patterns likely to be measurement error?

• LCA incorporates ALL patterns and deals with uncertainty through the posterior probabilities

Conclusions / warning

• Like EFA, LCA is an exploratory tool with the aim of summarising the variability in the dataset in a simple/interpretable way

• These results do not prove that there are 6 groups of young people in real life.

• LCA will find groupings in the data even if there is no reason to think such groups might exist. It’s just mathematics and it knows no better

Remember, we are dealing with probabilities

Model-based “Modal assignment”

Ill 53.24 2.1% 34 1.4%

Positive 541.96 21.7% 540 21.7%

Dizzy 396.04 15.9% 403 16.2%

Coughed 454.89 18.2% 447 17.9%

Bad taste 750.87 30.1% 840 33.7%

V negative 295.99 11.9% 229 9.2%

• Working with modal assignment is easy – chuck each pattern into it’s most likely class and pretend everything is OK

– Equivalent to doing a single imputation for missing data – shudder!

• Unless entropy is V high, stick with the probabilities

Covariates and outcomes

Merging the classes with other data

• In the “olden days”, you could pass your ID variable through Mplus so when you saved your class probabilities you could merge this with other data.

• Now you can pass other data through Mplus as well – hurrah!

Variable:

<snip>

auxiliary are aln qlet sex;

Reshaping the dataset

• To account for the uncertainty in our class variable we will need to weight by the posterior probabilities obtained from Mplus

• Weighted model requires a reshaping of the dataset so that each respondent has n-rows (for an n-class model) rather than just 1

Pre-shaped – first 20 kids

| ID sex dev_18 dev_42 pclass1 pclass2 pclass3 pclass4 pclass5 modclass ||--------------------------------------------------------------------------------------------------|| 30004 male 3 . .001 0 .803 0 .197 3 || 30008 male 2 1 .908 0 0 .007 .085 1 || 30010 male 2 2 .053 .001 .052 0 .894 5 || 30023 male 1 3 .115 0 .596 .001 .288 3 || 30031 male 3 4 0 0 .983 0 .016 3 ||--------------------------------------------------------------------------------------------------|| 30033 male 4 4 .392 0 .397 0 .211 3 || 30042 male 1 3 0 0 .983 0 .016 3 || 30050 male 3 2 0 0 .983 0 .016 3 || 30051 male 2 2 0 0 0 1 0 4 || 30057 male 1 3 .135 0 .002 0 .864 5 ||--------------------------------------------------------------------------------------------------|| 30058 male 1 4 0 0 .958 0 .041 3 || 30064 male 2 4 0 0 .983 0 .016 3 || 30068 male 4 3 .001 0 .803 0 .197 3 || 30070 male 3 4 0 0 .983 0 .016 3 || 30072 male 1 1 0 0 .983 0 .016 3 ||--------------------------------------------------------------------------------------------------|| 30075 male 3 3 0 0 .982 0 .018 3 || 30088 male 3 4 .03 .002 .889 .003 .076 3 || 30095 male 3 . 0 0 .983 0 .016 3 || 30098 male 3 . .068 .158 .173 .018 .583 5 || 30104 male 4 1 .008 0 .775 0 .217 3 |+--------------------------------------------------------------------------------------------------+

Pre-shaped – first 20 kids

| ID sex dev_18 dev_42 pclass1 pclass2 pclass3 pclass4 pclass5 modclass ||--------------------------------------------------------------------------------------------------|| 30004 male 3 . .001 0 .803 0 .197 3 || 30008 male 2 1 .908 0 0 .007 .085 1 || 30010 male 2 2 .053 .001 .052 0 .894 5 || 30023 male 1 3 .115 0 .596 .001 .288 3 || 30031 male 3 4 0 0 .983 0 .016 3 ||--------------------------------------------------------------------------------------------------|| 30033 male 4 4 .392 0 .397 0 .211 3 || 30042 male 1 3 0 0 .983 0 .016 3 || 30050 male 3 2 0 0 .983 0 .016 3 || 30051 male 2 2 0 0 0 1 0 4 || 30057 male 1 3 .135 0 .002 0 .864 5 ||--------------------------------------------------------------------------------------------------|| 30058 male 1 4 0 0 .958 0 .041 3 || 30064 male 2 4 0 0 .983 0 .016 3 || 30068 male 4 3 .001 0 .803 0 .197 3 || 30070 male 3 4 0 0 .983 0 .016 3 || 30072 male 1 1 0 0 .983 0 .016 3 ||--------------------------------------------------------------------------------------------------|| 30075 male 3 3 0 0 .982 0 .018 3 || 30088 male 3 4 .03 .002 .889 .003 .076 3 || 30095 male 3 . 0 0 .983 0 .016 3 || 30098 male 3 . .068 .158 .173 .018 .583 5 || 30104 male 4 1 .008 0 .775 0 .217 3 |+--------------------------------------------------------------------------------------------------+

covariates Posterior probs Modal class

The reshaping

. reshape long pclass, i(id) j(class)

(note: j = 1 2 3 4 5)

Data wide -> long--------------------------------------------------------

-Number of obs. 5584 -> 27920Number of variables 66 -> 63j variable (5 values) -> classxij variables: pclass1 pclass2 ... pclass5 -> pclass--------------------------------------------------------

-

Re-shaped – first 3 kids +--------------------------------------------------+ | id sex dev_18 dev_42 pclass class | |--------------------------------------------------| 1. | 30004 male 3 . .001 1 | 2. | 30004 male 3 . 0 2 | 3. | 30004 male 3 . .803 3 | 4. | 30004 male 3 . 0 4 | 5. | 30004 male 3 . .197 5 | |--------------------------------------------------| 6. | 30008 male 2 1 .908 1 | 7. | 30008 male 2 1 0 2 | 8. | 30008 male 2 1 0 3 | 9. | 30008 male 2 1 .007 4 | 10. | 30008 male 2 1 .085 5 | |--------------------------------------------------| 11. | 30010 male 2 2 .053 1 | 12. | 30010 male 2 2 .001 2 | 13. | 30010 male 2 2 .052 3 | 14. | 30010 male 2 2 0 4 | 15. | 30010 male 2 2 .894 5 | +--------------------------------------------------+

First kid

Third kid

Second kid

Sum = 1Constant within child

Similar with our data:. list id SEX CPROB class C in 1/12

+---------------------------------+ | id SEX CPROB class C | |---------------------------------| 1. | 30012 2 0 1 4 | 2. | 30012 2 0 2 4 | 3. | 30012 2 0 3 4 | 4. | 30012 2 .945 4 4 | 5. | 30012 2 .045 5 4 | 6. | 30012 2 .01 6 4 | |---------------------------------| 7. | 30024 2 0 1 5 | 8. | 30024 2 0 2 5 | 9. | 30024 2 0 3 5 | 10. | 30024 2 0 4 5 | 11. | 30024 2 .991 5 5 | 12. | 30024 2 .009 6 5 | |---------------------------------|

First respondent

Second respondent

Simple crosstab

. tab class SEX , row nofreq

| SEX class | 1 2 | Total-----------+----------------------+---------- Ill | 40.87 59.13 | 100.00 Positive | 40.87 59.13 | 100.00 Dizzy | 40.87 59.13 | 100.00 Coughed | 40.87 59.13 | 100.00 Bad taste | 40.87 59.13 | 100.00 V negative | 40.87 59.13 | 100.00 -----------+----------------------+---------- Total | 40.87 59.13 | 100.00

• Oops!

Simple crosstab – take 2

. tab class SEX [iw = CPROB], row nofreq

| SEX class | Male Female | Total-----------+-------------------+------- Ill | 52.9% 47.1% | 100% Positive | 32.9% 67.1% | 100% Dizzy | 43.2% 56.8% | 100% Coughed | 40.8% 59.2% | 100% Bad taste | 45.2% 54.8% | 100% V negative | 39.3% 60.7% | 100% -----------+-------------------+------- Total | 40.9% 59.1% | 100%

Compare with modal class assignment

. tab C SEX if (class==1), row nofreq

| SEX C | Male Female |-----------+-----------------+ Ill | 50.0% 50.0% | Positive | 33.0% 67.0% | Dizzy | 43.4% 56.6% | Coughed | 40.7% 59.3% | Bad taste | 45.4% 54.6% | V negative | 37.6% 62.4% | -----------+-----------------+ Total | 40.9% 59.1% |

. tab class SEX [iw = CPROB], row nofreq

| SEX class | Male Female |-----------+-----------------+ Ill | 52.9% 47.1% | Positive | 32.9% 67.1% | Dizzy | 43.2% 56.8% | Coughed | 40.8% 59.2% | Bad taste | 45.2% 54.8% | V negative | 39.3% 60.7% | -----------+-----------------+ Total | 40.9% 59.1% |

Multinomial logistic. xi: mlogit class i.SEX [iw = CPROB], rrr

Multinomial logistic regression Number of obs = 2493 LR chi2(5) = 24.52 Prob > chi2 = 0.0002Log likelihood = -4053.3746 Pseudo R2 = 0.0030------------------------------------------------------------------------------ class | RRR Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------Ill | _ISEX_2 | .7322787 .2081189 -1.10 0.273 .4195259 1.278186-------------+----------------------------------------------------------------Positive | _ISEX_2 | 1.677364 .1965463 4.41 0.000 1.333175 2.110413-------------+----------------------------------------------------------------Dizzy | _ISEX_2 | 1.082775 .1355213 0.64 0.525 .8472297 1.383807-------------+----------------------------------------------------------------Coughed | _ISEX_2 | 1.194885 .1437877 1.48 0.139 .9438344 1.512712-------------+----------------------------------------------------------------V negative | _ISEX_2 | 1.274734 .1782148 1.74 0.083 .9692081 1.676572------------------------------------------------------------------------------(class==Bad taste is the base outcome)

Class predicts binary outcome

. Outcome = weekly smoker at age of 15

char class[omit] 5. xi: logistic sm1100 i.class [iw = CPROB]

Logistic regression Number of obs = 2493 LR chi2(5) = 229.03 Prob > chi2 = 0.0000Log likelihood = -1168.697 Pseudo R2 = 0.0892

------------------------------------------------------------------------------ sm1100 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- Ill | 2.132652 .9125838 1.77 0.077 .9218961 4.933531 Positive | 7.190203 1.231216 11.52 0.000 5.140265 10.05766 Dizzy | 7.899915 1.413907 11.55 0.000 5.562583 11.21937 Coughed | 3.686492 .6831946 7.04 0.000 2.563689 5.301041 V negative | 2.243034 .497619 3.64 0.000 1.452099 3.46478------------------------------------------------------------------------------

Compare with modal class. Posterior probabilities------------------------------------------------------------------------------ sm1100 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- Ill | 2.132652 .9125838 1.77 0.077 .9218961 4.933531 Positive | 7.190203 1.231216 11.52 0.000 5.140265 10.05766 Dizzy | 7.899915 1.413907 11.55 0.000 5.562583 11.21937 Coughed | 3.686492 .6831946 7.04 0.000 2.563689 5.301041 V negative | 2.243034 .497619 3.64 0.000 1.452099 3.46478------------------------------------------------------------------------------

Modal assignment------------------------------------------------------------------------------ sm1100 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- Ill | 2.560182 1.291868 1.86 0.062 .9522577 6.88315 Positive | 7.802047 1.313428 12.20 0.000 5.609367 10.85184 Dizzy | 8.3454 1.467249 12.07 0.000 5.912796 11.77881 Coughed | 4.224301 .7686958 7.92 0.000 2.957071 6.034592 V negative | 2.861537 .6548723 4.59 0.000 1.827254 4.481255------------------------------------------------------------------------------

Conclusions

• Young people at 15yrs can report a variety of responses to their first cigarette

• Certain responses are associated with current regular smoking behaviour

• 15 year-old girls are more likely to retrospectively report a positive experience

• Recall bias is likely to play a part in these associations

Conclusions

• LCA is an exploratory tool which can be used to simplify a set of binary responses

• Extension to ordinal responses is straight-forward

• The use of ordinal data is an alternative way to boost degrees of freedom

• Resulting probabilities can be used model latent class variable as a risk factor or outcome

• A modal class variable should be used with caution

Cross-sectional LCA

Documents

Transcript of Cross-sectional LCA