Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student...

35
Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 2009 2009 Stata

Transcript of Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student...

Page 1: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

Using Stata for Subpopulation Analysis of Complex Sample

Survey DataBrady T. West

PhD Student

Michigan Program in Survey Methodology

July 30, 2009 2009 Stata Conference

Page 2: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

2

Presentation Outline

1. Introduction: Subclass Analysis Issues

2. Kish’s Taxonomy of Subclasses

3. Two Alternative Approaches to Inference

4. Variance Estimation and Methods for ‘Singletons’

5. Examples using NHANES and NHAMCS Data

6. Suggestions for Practice

7. Directions for Future Research

Page 3: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

3

Subclass Analysis Issues• Analysts of large, complex sample survey

data sets are often interested in making inferences about subpopulations of the original population that the sample was selected from (e.g., Caucasian Females)

• These subpopulations are referred to interchangeably in various literatures as subgroups, subclasses, subpopulations, domains, and subdomains, leading to confusion among analysts of survey data

Page 4: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

4

Subclass Analysis Issues, cont’d

• Software procedures for analysis of complex sample survey data are becoming more powerful, flexible, and widely available, offering analysts several options

• Analysts need to be careful when analyzing subclasses, and be aware of the alternative approaches to subclass analysis that are possible and their implications for inference

Page 5: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

5

Kish’s Taxonomy of Subclasses• Design Domains: Restricted to specific strata

according to the complex sample design (usually geographically, e.g., Texas)

• Cross-Classes: Broadly distributed (in theory) across the strata and primary sampling units defining a complex sample (e.g., African-Americans over age 50)

• Mixed Classes: Disproportionately distributed across the complex sample design (e.g., Hispanics in a sample including Los Angeles as a stratum)

• See Kish (1987), Statistical Design for Research

Page 6: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

6

Design DomainsX = Sample Element in Subclass

Stratum PSU 1 PSU 2

1 XXXXXXXXXXX

XXXXXXXXX

2 XXXXXXXXXX

XXXXXXXXXXXX

3

4

5

Page 7: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

7

Cross-ClassesStratum PSU 1 PSU 2

1 XXXXXXXXXXXX

XXXXX

2 XXXX XXXXXXX

3 XXXXXXXXXXX

XXXXXXXXX

4 XXXXXX XXXXX

5 XXXXXXXXXX

XXXXXXXXXXXX

Page 8: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

8

Mixed ClassesStratum PSU 1 PSU 2

1 XXXXXXXXXXXXXX

XXXXXXXXXXXXX

2 X

3 XXXXXXXXXXXXX

XXXXXXXXXX

4 XX

5 XXXXXXXXXXXXXX

XXXXXXXXXXXX

Page 9: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

9

Applying Kish’s Taxonomy

• The type of subclass is critical for determining an appropriate analysis approach

• Two possible approaches to inference motivated by the taxonomy:1. Unconditional approach (cross-classes, mixed classes) 2. Conditional approach (design domains)

Page 10: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

10

The Unconditional Approach

• Appropriate for Cross-Classes, and in some cases Mixed Classes; the subclass of interest theoretically can appear in all design strata and primary sampling units (PSUs)

• KEY POINT: Allow the software to process the entire survey data set, and recognize all possible design strata and PSUs; DO NOT delete sample cases not in the subclass!

Page 11: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

11

The Unconditional Approach

• Rationale: estimated variances for sample estimates of subclass parameters (based on within-stratum variance between PSUs) need to reflect sample-to-sample variability based on the full complex design

• In other words, if a particular subclass does not appear in a PSU in any given sample (although in theory it could have), that PSU should contribute 0 to variance estimates, rather than be ignored completely!

Page 12: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

12

The Unconditional Approach

• Further, the subclass sample size in each stratum is going to be a random variable, and theoretical sample-to-sample variance in realizations of this random variable should be incorporated into any variance estimation procedures

Page 13: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

13

The Unconditional Approach

• If cross-classes (or in some cases mixed classes) are being analyzed, and PSUs where the subclass does not appear (by random chance) are deleted, problems arise

• Some strata may appear to have only one PSU by design (preventing variance estimation unless an ad hoc approach is used)

• Entire design strata may be dropped, impacting variance estimates and calculations of degrees of freedom

Page 14: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

14

The Unconditional Approach: General Stata Code

• svy, subpop(indicator): command varlist, options• indicator = an indicator variable for the subpop or

an if condition, e.g., if male == 1• svy: mean, over(groupvar)• svy: prop, over(groupvar)• Stata drops strata* with no subpopulation

observations from degrees of freedom calculations* Exercise: repeat 10 times really fast

Page 15: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

15

The Conditional Approach

• Appropriate for Design Domains, where a subclass cannot appear outside of specific design strata

• The rationale behind the unconditional approach no longer applies

• Certain design strata should not contribute to variance estimation or calculation of degrees of freedom

Page 16: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

16

The Conditional Approach

• Restrict the analysis to only those design strata where the subclass of interest exists

• Variance estimates reflecting sample-to-sample variability should only be based on those design strata where the subclass can appear (unlike the unconditional approach)

• Subclass sample sizes in design domains are assumed to be fixed, by design

Page 17: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

17

The Conditional Approach: General Stata Code

• svy: command varlist if (condition), options

• (condition) might be male == 1, or a more complex combination of conditions (e.g., male == 1 & age >= 50 & age <= 90)

Page 18: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

18

Variance Estimation Methods

• All of these issues are only relevant when using Taylor Series Linearization, which is a default for variance estimation in Stata

• Conditional analyses are OK to perform when using replication methods, such as Balanced Repeated Replication or Jackknife Repeated Replication (Rust and Rao, 1996)

Page 19: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

19

Ad-hoc Fixes for ‘Singleton’ Clusters in Stata 10.1

• Stata 10.1 provides users with four ad-hoc fixes for the problem where strata are identified with only a single ultimate cluster for variance estimation in a subpopulation analysis:

1. Report Missing Standard Errors (not really a fix)2. Treat Units as Certainty Units, which contribute

nothing to the standard error3. Scale Variance using Certainty Units, which uses the

average variance from each stratum with multiple PSUs for each stratum with only a single PSU

4. Center at the Grand Mean, where the variance contribution comes from a deviation from the grand mean instead of the stratum mean

Page 20: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

20

Example: The NHANES Data

• We first consider examples based on the NHANES II data set, collected from a nationally representative multistage probability sample of the U.S. population from 1976-1980 (oldie but a goodie)

• Briefly, a sample of the U.S. population was given medical examinations in an effort to assess the health of the U.S. population

Page 21: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

21

Example NHANES Analysis

• Analysis Subclass: African-Americans ages 50 and above (this is a cross-class of the U.S. population, which can theoretically appear in all design strata and PSUs)

• Analysis Objective: Estimate the mean systolic blood pressure of this subclass and an appropriate standard error

• See West et al. (2007) for more details

Page 22: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

22

Conditional Approach:Stata Code for NHANES Analysis

• svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(missing)

• svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(centered)

• Also singleunit(certainty), singleunit(scaled)

• gen b50subp = (race == 2 & ager >= 50)• svy: mean bpsyst if b50subp == 1

Page 23: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

23

Conditional Approach: Results

Method Est. Mean TSL SE Design DF

Missing SE 144.09 . 50-29 = 21

Centered 144.09 1.66 50-29 = 21

Certainty 144.09 1.62 50-29 = 21

Scaled 144.09 1.90 50-29 = 21

Page 24: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

24

Conditional Approach?• This approach would not be appropriate for

this particular subclass• Computed standard errors would generally

be biased downward, because additional sources of sample-to-sample variability are ignored when following this approach

• Same issues apply for analytic models• Evidence that the “scaled” ad-hoc fix may

be overly conservative!

Page 25: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

25

Unconditional Approach:Stata Code for NHANES Analysis

• svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(missing)

• Note: choice of single unit option does not matter when following this approach!

• gen b50subp = (race == 2 & ager >= 50)

• svy, subpop(b50subp): mean bpsyst

Page 26: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

26

Unconditional Approach: ResultsMethod Est. Mean TSL SE Des. DF*

Missing SE 144.09 1.66 58-29 = 29

Centered 144.09 1.66 58-29 = 29

Certainty 144.09 1.66 58-29 = 29

Scaled 144.09 1.66 58-29 = 29

* Note: Stata dropped three strata with no sample units in the subpopulation.

Page 27: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

27

Unconditional Approach?

• This approach would be the appropriate choice for a cross-class such as African-Americans over the age of 50

• Inferences are theoretically appropriate

• Same idea for analytic models

• Results suggest that the “centered” and “certainty” ad-hoc fixes for conditional analyses are reasonable

Page 28: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

28

Example: The NHAMCS Data

• Analysis Subclass: Visits to Emergency Departments (ED) by African-American men ages 60 and above (this is another cross-class of the U.S. population, which can theoretically appear in all NHAMCS design strata and PSUs)

• Analysis Objective: Estimate the percentage of all ED visits by members of this subclass for dizziness and/or vertigo in 2004

• See West et al. (2008) for more details

Page 29: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

29

Stata Code for NHAMCS Analyses

• svyset cpsum [pweight = patwt], strata(cstratm) singleunit(…)

• generate subc = (settype == 3 & sex == 2 & agecat == 5 & race == 2)

• svy: tabulate dizzyrfv if subc == 1, se ci percent * conditional

• svy, subpop(subc): tabulate dizzyrfv, se ci percent * unconditional

Page 30: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

30

NHAMCS Analysis Results

Method Est. % TSL SE Design DF

Missing SE 4.82 1.576 106

Centered 4.82 1.576 106

Certainty 4.82 1.576 106

Scaled 4.82 1.576 106

Unconditional 4.82 1.590 286

Page 31: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

31

NHAMCS Analysis Implications• No problems with strata having only a single ultimate

cluster: ad-hoc fixes all give the same results• Weighted point estimates are identical• Substantially fewer design-based degrees of freedom when

following the conditional approach; the full complex design will not be reflected in estimation of sample-to-sample variance (many ultimate clusters are lost)

• Conditional analysis assumes that each sample will be of fixed size n = 397 for variance estimation purposes; no random variance!

• Conditional analysis results in overly liberal inferences

Page 32: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

32

Suggestions for Practice

• Consider Kish’s Taxonomy when determining an appropriate subclass analysis approach

• Utilize the appropriate software options for unconditional analyses when analyzing cross-classes

• Be careful with missing values when creating the subpopulation indicator

• The unconditional analysis approach generally works fine for both cases (when in doubt, use this approach)

Page 33: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

33

Directions for Future Research

• More appropriate calculation / estimation of design-based and effective degrees of freedom for sparse subclasses or mixed classes

• Development of analytic theory for interval estimation when working with small subclasses, which does not rely on asymptotic results

Page 34: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

34

References• Kish, L. 1987. Statistical Design for Research. New York:

Wiley.• Rust, K. F., and J. N. K. Rao. 1996. Variance estimation

for complex surveys using replication. Statistical Methods in Medical Research 5: 283–310.

• West, B.T., Berglund, P., and Heeringa, S.G. 2008. A Closer Examination of Subpopulation Analysis of Complex Sample Survey Data. The Stata Journal, 8(3), 1-12.

• West, B.T., Berglund, P., and Heeringa, S.G. 2007. Alternative Approaches to Subclass Analysis of Complex Sample Survey Data. Proceedings of the 2007 Joint Statistical Meetings.

Page 35: Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 20092009.

2009 Stata Conference: Subpop Analysis of Survey Data

35

Questions / Thank You!

• For additional questions, comments, or electronic copies of these slides or the papers, please send an email to [email protected]