Variance Estimation

1

Optimal Number of Replicates for Variance Estimation

Mansour Fahimi, Darryl Creel, Peter Siegel, Matt Westlake, Ruby Johnson, and Jim Chromy

Third International Conference on Establishment Surveys(ICES-III)

June 21, 2007

2

Variance Estimation

Two general approaches for variance estimation With weighted data obtained under complex designs:

Linearization

Replication

3

Linearization

Approximate complex statistics in terms of L linear statistics

Estimate variance of from:

2

1

ˆˆˆ

ˆ

L

lll

l

EE

4

Replication

Partition the full sample into R subsamples (replicates)

Obtain separate estimates for from each replicate:

Estimate variance of by:

Rr ....., 1, r ,ˆ

R

R

rr

2

1

)ˆˆ(

5

How Many Replicates?

Recommendations regarding the optimal number of replicates for variance estimation are at variance:

Computational resources required can be intensive

For certain statistics a larger number of replicates might be needed to produce stable estimates of variance

What is the point of diminishing returns?

6

Research Methodology

Relying on two complex establishment surveys, this work presents an array of empirical results regarding the number of bootstrap replicates for variance estimation:

National Study of Postsecondary Faculty (NSOPF:04)

National Postsecondary Student Aid Study (NPSAS:04)

7

General Design SpecificationsNational Study of Postsecondary Faculty (NSOPF:04)

Survey of about 35,000 faculty and instructional staff

Across a sample of 1,080 institutions

In the 50 States and the District of Columbia

8

Sampling Methodology

Institutions selected with probability proportional to a measure of size to over-represent:

Hispanic

Non-Hispanic Black

Asian and Pacific Islander

Full-time other female

Used RTI’s cost/variance optimization procedure for sample allocation

9

Institution Sampling Frame

Degree Granting Carnegie Code Public Private Total

Doctor’s 15, 16, 52 190 110 300

Master’s 21, 22 270 320 590

Bachelor’s 31, 32, 33 90 480 570

Associate’s 40, 60 1,030 150 1,180

Other/Unknown51, 53 – 59, unclassified

110 620 730

Total 1,700 1,680 3,380

10

Institution Sample

Degree Granting Public Private Total

Doctor’s 190 110 300

Master’s 120 80 200

Bachelor’s 30 130 160

Associate’s 340 10 350

Other 10 60 70

Total 680 400 1,080

11

Expected Faculty CountsFrom Sampled Institutions by Strata

NSOPF stratum Black Hispanic Asian OFTF OFTM OPT Total

Public, doctor’s 10,720 8,660 32,630 58,870 115,830 51,110 277,820

Public, master's 4,670 3,150 4,950 14,120 20,440 22,130 69,460

Public, bachelor’s 810 340 520 1,430 2,110 3,880 9,090

Public, associate’s 12,250 9,240 6,100 21,100 21,700 82,570 152,960

Public, other 150 80 170 290 630 830 2,150

Private not-for-profit, doctor’s 6,060 3,760 13,110 21,490 47,370 33,280 125,080

Private not-for-profit, master's 1,110 950 1,020 4,930 7,020 12,530 27,550

Private not-for-profit, bachelor’s 1,360 390 670 3,920 6,270 5,440 18,050

Private not-for-profit, Associate’s 20 20 40 180 450 480 1,180

Private not-for-profit, other 330 120 250 790 1,680 2,700 5,880

Total 37,480 26,710 59,460 127,120 223,500 214,940 689,210

12

Target Number of Respondentsby Institution and Faculty Strata

Institution stratum Respondents Faculty stratum Respondents

Public doctor’s 6,200 Non-Hispanic Black 1,600

Public master’s 2,700 Hispanic 1,300

Public bachelor’s 600 Asian 900

Public associate’s 7,500 Other full-time female 4,600

Public other 500 Other full-time male 8,300

Private not-for-profit doctor’s 2,600 Other part-time 7,800

Private not-for-profit master’s 1,900

Private not-for-profit bachelor’s 1,700

Private not-for-profit associate’s 100

Private not-for-profit other 700

Total 24,500 24,500

13

Distribution of Respondents(by institution and faculty strata)

Institution stratum Respondents Faculty stratum Respondents

Public doctor’s 7,460 Non-Hispanic Black 2,060

Public master’s 2,680 Hispanic 1,700

Public bachelor’s 450 Asian 1,610

Public associate’s 6,410 Other full-time female 5,850

Public other 110 Other full-time male 8,500

Private not-for-profit doctor’s 3,160 Other part-time 6,380

Private not-for-profit master’s 2,270

Private not-for-profit bachelor’s 2,520

Private not-for-profit associate’s 190

Private not-for-profit other 850

Total 26,110 Total 26,110

14

Variance Estimation Methodology(NSOPF:04)

Used methodology developed by Kaufman (2004) to create bootstrap replicate weights:

Reflected finite population correction adjustment for the first stage (institution) selection.

Second stage (faculty selection) finite population correction factors were close to one and not reflected.

Produced 65 bootstrap replicates to meet Data Analysis System (DAS) requirements of NCES.

Calculated standard error of several statistics using the above bootstrap replicates and Taylor linearization method in SUDAAN.

15

Comparisons of Variance EstimatesSE of Percent Teaching as Principal Activity by Rank

(Bootstrap vs. Linearization)

0.0

0.3

0.5

0.8

1.0

Total Professor Associateprofessor

Assistantprofessor

Instructor Lecturer Other title

Sta

nd

ard

Err

or

16

Comparisons of Variance EstimatesSE of Percent Research as Principal Activity by Rank


0.0

0.3

0.5

0.8

1.0


Assistantprofessor

Instructor Lecturer O ther title

Stan

dard

Err

or

17

Comparisons of Variance EstimatesSE of Percent Administration as Principal Activity by Rank


0.0

0.3

0.5

0.8

1.0


Assistantprofessor

Instructor Lecturer O ther title

Stan

dard

Err

or

18

Comparisons of Variance EstimatesSE of Percent Full-time by Institution Type


0.0

5.0

10.0

15.0

20.0

PublicPh.D.

PublicMS

PublicBA

PublicAsso.

PublicOther

PrivatePh.D.

PrivateMS

PrivateBA

PrivateAsso.

PrivateOther

Sta

nd

ard

Err

or

19

Revised Variance Estimation Methodology(NSOPF:04)

Used methodology developed by Kaufman (2004) to create 200 bootstrap replicate weights.

Used 10, 11, …., 200 replicates to estimate relative standard error (RSE) of different statistics.

Repeated the above using 9 random permutations of replicates to estimate RSE of the same statistics.

Used Taylor linearization to estimate relative standard error of estimates via SUDAAN.

20

RSE of Percent Asians by Number of Replicates

0.60%

0.65%

0.70%

0.75%

0.80%

0.85%

0.90%

0.95%

1.00%

0 20 40 60 80 100 120 140 160 180 200

Number of Bootstrap Replicates

Rel

ativ

e St

anda

rd E

rror

21

RSE of Percent Asians by Number of Replicates(Taylor Linearization and Permutations of Replicates)

0.40%

0.90%

1.40%

1.90%

2.40%

2.90%

3.40%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

22

RSE of Percent Age < 35 by Number of Replicates

2.00%

2.10%

2.20%

2.30%

2.40%

2.50%

2.60%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

23

RSE of Percent Age < 35 by Number of Replicates(Taylor Linearization and Permutations of Replicates)

1.2%

1.8%

2.4%

3.0%

3.6%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

24

RSE of Percent Citizen by Number of Replicates

0.15%

0.18%

0.20%

0.23%

0.25%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

25

RSE of Percent Citizen by Number of Replicates(Taylor Linearization and Permutations of Replicates)

0.10%

0.13%

0.15%

0.18%

0.20%

0.23%

0.25%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

26

RSE of Percent Full-time by Number of Replicates

0.0000001%

0.0000002%

0.0000003%

0.0000004%

0.0000005%

0.0000006%

0.0000007%

0.0000008%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

27

RSE of Percent Full-time by Number of Replicates

(Taylor Linearization and Permutations of Replicates)

0.00%

0.30%

0.60%

0.90%

1.20%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

28

RSE of Percent Master’s by Number of Replicates

0.80%

1.00%

1.20%

1.40%

1.60%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

29

RSE of Percent Master’s by Number of Replicates(Taylor Linearization and Permutations of Replicates)

0.80%

1.00%

1.20%

1.40%

1.60%

1.80%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

30

RSE of Percent Teaching as Principal Activity by Number of Replicates

0.35%

0.40%

0.45%

0.50%

0.55%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

31

RSE of Percent Teaching as Principal Activity by Number of Replicates(Taylor Linearization and Permutations of Replicates)

0.25%

0.30%

0.35%

0.40%

0.45%

0.50%

0.55%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

32

RSE of Mean Income by Number of Replicates

0.30%

0.35%

0.40%

0.45%

0.50%

0.55%

0.60%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

33

RSE of Mean Income by Number of Replicates(Taylor Linearization and Permutations of Replicates)

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

34

RSE of Median Income by Number of Replicates

1.00%

1.10%

1.20%

1.30%

1.40%

1.50%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

35

RSE of Median Income by Number of Replicates(Taylor Linearization and Permutations of Replicates)

0.10%

0.58%

1.05%

1.53%

2.00%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

36

RSE of Regression InterceptIncome = Hours + Race + Hours Race

0.60%

0.70%

0.80%

0.90%

1.00%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

37

RSE of Regression InterceptIncome = Hours + Race + Hours Race


0.60%

0.70%

0.80%

0.90%

1.00%

1.10%

0 20 40 60 80 100 120 140 160 180 200

Number of Replicates

Rel

ativ

e St

anda

rd E

rror

38

RSE of Regression Slope (Hours)Income = Hours + Race + Hours Race

6.00%

7.00%

8.00%

9.00%

10.00%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

39

RSE of Regression Slope (Hours)Income = Hours + Race + Hours Race


5.0%

6.0%

7.0%

8.0%

9.0%

10.0%

11.0%

12.0%

13.0%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

40

RSE of Regression Slope (Race)Income = Hours + Race + Hours Race

18.0%

20.5%

23.0%

25.5%

28.0%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

41

RSE of Regression Slope (Race)Income = Hours + Race + Hours Race


15.0%

20.0%

25.0%

30.0%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

42

RSE of Regression Slope (Hours Race)Income = Hours + Race + Hours Race

30.0%

35.0%

40.0%

45.0%

50.0%

55.0%

60.0%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

43

RSE of Regression Slope (Hours Race)Income = Hours + Race + Hours Race


20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

110.0%

0 20 40 60 80 100 120 140 160 180 200


Rel

ativ

e St

anda

rd E

rror

44

Conclusions(Rough & Interim)

Complex statistics do require more replicates for stable variance estimation

It seems that:

64 replicates might be inadequate

200 replicates seem to be overkill

Somewhere between 100 to 200 replicates might be sufficient

Variance Estimation

Documents

Transcript of Variance Estimation