Working with the ECLS-K Datasets Weights and other issues. Information is courtesy of the Institute...

Post on 28-Mar-2015

218 views 0 download

Tags:

Transcript of Working with the ECLS-K Datasets Weights and other issues. Information is courtesy of the Institute...

Working with the ECLS-K Working with the ECLS-K Datasets Datasets

Weights and other issues.Weights and other issues.

Information is courtesy of the Institute of Educational Sciences,National Center for Education Statistics

and is used in their training seminars.

Sampling WeightsSampling Weights

• What are sampling weights and why are they important?

• How are weights used?

• What weights are on the ECLS-K data files and when should they be used?

What is a “Weight” ?What is a “Weight” ?

• A weight is used to indicate the relative strength of an observation.

• In the simplest case, each observation is counted equally.

• For example, if we have five observations, and wish to calculate the mean, we just add up the values and divide by 5.

How are Weights Used?How are Weights Used?

• Dataset with 5 cases.

• Value 4 2 1 5 2

• Weight 1 2 4 1 2

• Sample mean (4+2+1+5+2) = 2.8

• Weighted mean (4*1) + (2*2) + (1*4) + (5*1) + (2*2)/sum of weights = (4 + 4 + 4 + 5 + 4)/10 = 2.1

What is the Difference Between What is the Difference Between Weighted and Unweighted Data?Weighted and Unweighted Data?

• With unweighted data, each case is counted equally.

• Unweighted data represent only those in the sample who provide data.

• With weighted data, each case is counted relative to its representation in the population.

• Weights allow analyses that represent the target population.

ECLS-K and WeightsECLS-K and Weights

• The ECLS-K is a sample, i.e. the entire population was not surveyed.

• The ECLS-K is not a simple random sample (SRS). That is, not all schools, teachers, and children had an equal probability of selection.

• Not all schools, teachers, and children participated.

Why Use Weights in the ECLS-K?Why Use Weights in the ECLS-K?

• The ECLS-K weights allow you to make statements about the population of U.S. children that were in kindergarten in 1998-99 or in first grade in 1999-2000. Without using weights, estimated are not nationally representative.

• Weights adjust for differential selection probabilities and reduce bias associated with non-response by adjusting for differential nonresponse.

Base Year Characteristic

Unweighted % Weighted % for sampling design

(base weight)

Weighted % for sampling design

and non-response (C1CW0)

Race/Ethnicity

White 57 56 58

Black 15 16 16

Hispanic 18 20 19

Asian 6 3 3

School Type

Public 78 87 85

Private 22 13 15

Examples of Weighted vs. Unweighted Data

Examples of Weighted vs. Unweighted DataExamples of Weighted vs. Unweighted Data

First-Grade Characteristic

Unweighted % Weighted % for sampling design

(base weight)

Weighted % for sampling design and

non-response (C4PW0)

Household SES

Bottom 20% 17 19 20

Middle 60% 59 60 60

Highest 20% 24 21 20

Family Type

Two parents 78 76 73

Single parent 20 22 24

Other 2 2 2

Types of Weights on the ECLS-KTypes of Weights on the ECLS-K

• Weights vary according to:

Level of analysis: child, teacher, or school (only child-level after base year).

Round(s) of data: cross-sectional or longitudinal.

Source(s) of data: child assessment, parent interview, and/or teacher questionnaires.

Level of Analysis – Base YearLevel of Analysis – Base Year

• Weights for School-level analyses begin with “S”.

• Weights for Teacher-level analyses begin with “B”.

• Weights for Child-level analyses begin with “C” (cross-sectional).

• Weights for Child-level analyses begin with “BY” (longitudinal).

The first element in a weight variable name indicates the The first element in a weight variable name indicates the level of analysislevel of analysis

Level of Analysis – 1Level of Analysis – 1stst, 3, 3rdrd and 5 and 5thth Grades Grades

• Weights for Child-level analyses (cross sectional and longitudinal) begin with “C”.

• One exception: weight Y2COMW0 is for child-level analyses of assessment data from rounds 1, 2 and 4 and parent and/or teacher data from spring of first grade, and one or more base year rounds of parent and/or teacher data.

Data Round(s)Data Round(s)

• Weights for cross-sectional analyses have a single round number: 1,2,3,4,5 or 6.

• Weights for longitudinal analyses have 2 or more numbers, for example:

• “45” for rounds 4 and 5.• “124” for rounds 1,2 and 4 (exception in Y2COMW0).• “1_4” for rounds 1,2,3 and 4.• “1_6F” for rounds 1,2,4,5,6 (F=full sample).• “1_5S” for rounds 1,2,3,4,5 (S=subsample).

The second element in a weight variable name The second element in a weight variable name indicates the round(s) of data.indicates the round(s) of data.

Source of the DataSource of the Data

• Child assessments (alone or in conjunction with any combination of a limited set of child characteristic, e.g. age, sex, race/ethnicity) have a “C”.

• Parent interview (with or without child data) have a “P”.

• Child AND parent AND teacher have a “CPT”.

• In 5th grade, the “CPT” is followed by either “R”, “M” or “S” for reading, math or science teacher.

The third element in a weight variable name indicates the The third element in a weight variable name indicates the source(s) of data.source(s) of data.

Weights for analyses using data from:Weights for analyses using data from:

Sources of the DataSources of the Data

Two exceptions:Two exceptions:• BYCOMW0: Child assessment data from fall

AND spring kindergarten in conjunction with one or more rounds of parent and/or teacher base year data.

• Y2COMW0: Child direct assessment data from fall AND spring kindergarten AND spring first grade, in conjunction with parent and/or teacher data from spring first grade, AND one or more base year rounds of parent and/or teacher data.

Source of the DataSource of the Data

• School administrator questionnaire• Facilities checklist• Teacher questionnaire C• Special education questionnaires• Student record abstract data• Head Start data• Salary and benefits data

Sources that do not affect choice of weightSources that do not affect choice of weight

ExampleExample

C23PW0C23PW0

• “C” for child level analysis.

• “23” for analysis of data from rounds 2 and 3.

• “P” for analysis of parent interview data.

ExampleExample

C6CPTM0C6CPTM0

• “C” for child level analysis.

• “6” for analysis of data from round 6.

• “CPTM” for analysis of child, parent, and math teacher.

Cross-sectional Examples:Cross-sectional Examples:

• C1PW0 -- Child-level analyses from round 1, parent interview data (with or without child assessment data).

• B1TW0 -- Teacher level analyses (teacher data) from round 1.

• S2SAQW0 -- School-level analysis (SAQ data) from round 2.

• C6CW0 -- Child assessment data from round 6.

• C5CPTW0 -- Child-level analyses from round 5 with child, parent AND teacher data.

Longitudinal ExamplesLongitudinal Examples

• BYPW0 – Round 1 and 2 parent interview data.

• BYCOMW0 – Round 1 and 2 assessment data and some other parent and teacher data.

• C24PW0 – Round 2 and 4 parent interview data.

• C245CW0 – Round 2, 4 and 5 assessment data.

• C1_6FCO – Round 1,2,4,5 and 6 assessment data.

All longitudinal weights are for child-level analyses.All longitudinal weights are for child-level analyses.

Third and Fifth-Grade WeightsThird and Fifth-Grade Weights

• Unlike the first grade sample, the ECLS-K sample was not freshened in third and fifth grade.

• The ECLS-K sample does not represent all third graders in 2001-02 or fifth graders in 2003-04. These samples represent all children who began kindergarten in 1998 or began first grade in 1999.

How to Use WeightsHow to Use Weights

• In SAS, use the “WEIGHT” statement.

• In SPSS, use the “WEIGHT BY” statement.

• Key Fact: All ECLS-K weights sum to population totals.

Weights in SASWeights in SAS

• SAS uses the WEIGHT statement in various PROCedures.

• PROC FREQ data = test;

• Tables Age Gender Score;

• Weight weightvar;

• Run;

Weights in SPSSWeights in SPSS

• LIST VARIABLES = age to weightvar.

• Frequencies variables = age, score /sta=default.

• weight by weightvar.

• frequencies variables = age, score /sta=default.

Weights in STATAWeights in STATA

• clear

• use “c:\temp\test1.dta"

• tabulate score age gender [pweight=weightvar]

Weights for HLM UsersWeights for HLM Users• ECLS-K weights are adjusted for nonresponse.

• ECLS-K weights are not normalized (they sum to the population N rather than the sample n).

• A within-school child-level weight can be approximated by dividing a regular child-level weight by the school-level weight.

• If the analysis includes children that stayed in the same school at each round of the analysis, the school weight (S2SAQW0) can be used as a school-level weight.

Other Frequently Asked QuestionsOther Frequently Asked Questions

• When selecting a weight, do I have to subset my dataset?

• What happens to cases where there is no positive weight?

• What weights do I use if analyzing a subsample of cases?

• What if I’m running a regression – what weights do I use?

Summary about WeightsSummary about Weights

• Weights should be used when analyzing data from the ECLS-K.

• The appropriate weight should be selected based on: Level of analysis, Round(s) of data, and Source(s) of data.

• There may not be a “perfect” weight for some analyses. The best weight can be determined with some descriptive analysis.

Variance, Calculating Standard Variance, Calculating Standard ErrorsErrors

• Why are standard errors important?

• Why not use standard errors that assume a simple random sample (SRS)?

• How to use “exact” methods for estimating standard errors.

• How to use approximation methods for estimating standard errors.

Why are Standard Errors Important?Why are Standard Errors Important?

• Standard errors are produced for estimates from sample surveys. They are a measure of the variance in the estimates associated with the selected sample being one of many possible samples.

• Standard errors are used to test hypotheses and to study group differences when making inferences to a population.

• Using inaccurate standard errors can lead to identification of statistically significant results where none are present and vice versa.

Important ConsiderationsImportant Considerations

• All weights on the ECLS-K data files sum to population totals and not sample totals.

• The ECLS-K has a complex sample design and is not a simple random sample.

The ECLS-K Sample Design:The ECLS-K Sample Design:

OversamplingOversampling

• The ECLS-K includes oversamples of private schools, and private school children.

• The ECLS-K also oversamples Asian and Pacific Islander children.

The ECLS-K Sample Design:The ECLS-K Sample Design:

ClusteringClustering

• Sample children were clustered within primary sampling units (PSUs) to reduce field costs.

• Children were in closer geographical proximity than would occur in a simple random sample.

• Children in a clustered sample tend to be more alike than those in a simple random sample.

Complex Samples and Standard ErrorsComplex Samples and Standard Errors

• The usual standard error formula assumes a simple random sample.

• Standard errors for estimates from a complex sample must account for the within cluster/across cluster variation.

• Special software can make the adjustment, or this adjustment can be approximated using the design effect.

OptionsOptions

• Exact Methods such as the TAYLOR series and REPLICATION techniques.

• Approximation Method

Exact MethodsExact Methods

• Taylor series

• Extract PSU and strata Ids from data file.

• Software available: SUDAAN, STATA (using SVY commands), and SAS (using PROC SURVEY commands).

Exact MethodsExact Methods

Replication TechniquesReplication Techniques

• Extract replication weights (90 of them).

• ECLS-K replication weights use jackknife 2 (JK2) methods.

• Software: WESVAR replication series (JK2), AM (JK2), and SAS callable SUDAAN.

Approximation MethodApproximation Method

• Two stages:

• First, normalize weights so standard error is based on actual sample size rather than population size.

• Then, use design effect (DEFF) to account for complex sampling design.

1) Normalizing Weights *1) Normalizing Weights *

• Weights on the ECLS-K sum to the population totals.

• Calculate a new weight that sums to the sample size.

• Normalized weights = (ECLS-K weight) * (sample n/population N).

• *SAS users do not need this step since estimates are produced based on the actual sample size.

Example – Normalizing WeightsExample – Normalizing Weights

• Weight to be normalized: C2PW0

• Sum of weights: 3,865,946

• Total number of cases with a positive weight: 18,950

• Normalized weight = C2PW0 * (18,950 / 3,865,946)

2) Adjusting for Complex Design2) Adjusting for Complex Design

• The ECLS-K has a complex sample design; it is not a simple random sample.

• Software packages designed for simple random samples tend to underestimate the standard errors for complex sample designs.

• Special methods are required for complex designs.

Using Design Effects (DEFF)Using Design Effects (DEFF)

• What is a design effect (DEFF)?

• It’s the ratio of the variance found in actual (complex) sample design to the variance expected in a simple random sample of the same sample size.

Using Design Effects (DEFF)Using Design Effects (DEFF)

• DEFT = the square root of DEFF = (Design standard error/ simple random sample error).

• Example for fall-kindergarten reading scores

• SE (SRS) = 0.063

• SE (Design) = 0.156

• DEFF = 0.1562/0.0632 = 6.15

• DEFT = 0.156/0.063 = square root of 6.15 = 2.48

3 Ways of Using the DEFF3 Ways of Using the DEFF• Multiply the SRS (simple random sample) standard error produced

by statistical software (when using normalized weights) by the square root of the DEFF (DEFT).

• Or

• Adjust the t-statistic by dividing it by the square root of the design effect (DEFT) or adjust the F-statistic by dividing it by the DEFF.

• Or

• Adjust the weight such that an adjusted standard error is produced.

Using a DEFF- Adjusted WeightUsing a DEFF- Adjusted Weight

• First step, create a weight that sums to the sample size (normalized weight.

• Second, divide this normalized weight by the DEFF.

• Third, use this weight for analyses. The standard errors produced will approximate the standard errors obtained using “exact” methods.

Where to find ECLS-K DEFF’sWhere to find ECLS-K DEFF’s

• Training material: “ECLS-K Specifications for Computing Standard Errors”

• ECLS-K users’ manuals:• Base Year (Kindergarten): Table 4.12• First Grade: Tables 4.13 and 9.4• Third Grade: Tables 4.14 and 9.2• Fifth Grade: Tables 4.19 and 9.2

For SAS UsersFor SAS Users• SAS base procedures such as PROC REG,

PROC FREQ, and PROC MEANS do account for the actual sample size but not for complex sampling.

• SAS procedures such as PROC SURVEYMEAN and PROC SURVEYREG (and other procedures that begin with “Survey”) use the Taylor series method to account for complex sampling and provide exact estimates of the standard errors.

PROC SURVEYREG ExamplePROC SURVEYREG Example

• Example using ECLS-K data, spring kindergarten and spring first grade variables.

• proc surveyreg data = fscores;• model c4r3mscl = c2r3mscl lowkread t4learn;• cluster c24cstr;• strata c24cpsu;• weight c24cw0;• where lowkmath = 0;• run;

PROC SURVEYLOGISTIC ExamplePROC SURVEYLOGISTIC Example

• Example using ECLS-K data, spring kindergarten and spring first grade variables.

• proc surveylogistic data = fscores;• model lowkread (desc) = c2r3mscl t4learn;• cluster c24cstr;• strata c24cpsu;• weight c24cw0;• where lowkmath = 0;• run;

PROC SURVEYFREQ ExamplePROC SURVEYFREQ Example

• Example using ECLS-K data, spring kindergarten and spring first grade variables.

• proc surveyfreq data = fscores;• tables lowkread c2r3mscl t4learn;• cluster c24cstr;• strata c24cpsu;• weight c24cw0;• run;

STATA Code for Complex DesignSTATA Code for Complex Design

• Logistic Regression Example, 3rd Grade Data

• Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU)

• Svy, subpop (male) : logit highbmi white

STATA Code for Complex DesignSTATA Code for Complex Design

• Regression Example, 3rd Grade Data

• Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU)

• Svy, subpop (male) : reg highbmi white

STATA Code for Complex DesignSTATA Code for Complex Design

• Means Example, 3rd Grade Data

• Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU)

• Svy, subpop (male) : mean highbmi female

SPSS for Complex Sample DesignSPSS for Complex Sample Design

• Use add-on to SPSS called, SPSS Complex Samples™

• Complex Samples Logistic Regression (CSLOGISTIC)—Performs binary logistic regression analysis, as well as multiple logistic regression (MLR) analysis, for samples drawn by complex sampling methods. The procedure estimates variances by taking into account the sample design used to select the sample, including equal probability and PPS methods, and WR and WOR sampling procedures. Optionally, CSLOGISTIC performs analyses for subpopulations.

• Courtesy of SPSS

Regression AnalysisRegression Analysis

• Use appropriate software such as AM, WESVAR, SUDAAN or SAS (SURVEYREG procedure).

• For SAS (PROC REG procedure), use DEFF-adjusted weights.

• For SPSS, use normalized, DEFF-adjusted weights.

SummarySummary

• Preferred: Use software that incorporates JK2 replication methods, or

• Use software that incorporates Taylor series method, or

• Last resort: Make approximate adjustments based on design effects.

All statistical tests should be based on standard All statistical tests should be based on standard errors that are calculated to account for the errors that are calculated to account for the

complex sample design of the ECLS-K.complex sample design of the ECLS-K.

ECLS-K Data AvailabilityECLS-K Data Availability

• Base Year (Kindergarten) through 5th Grade restricted use and Public Use datasets have been released.

• 8th Grade restricted use dataset should be released in the winter of 2008 and the public datasets should be released in March 2009.

Differences in Restricted Use and Differences in Restricted Use and Public Use ECLS-K Datasets.Public Use ECLS-K Datasets.

• Here’s a short explanation from the NCES: http://nces.ed.gov/ecls/kinderfaq.asp?faq=1

• Chapter 7 in the ECLS-K, 5th Grade User’s Guide has Tables 7-15 and 7-16 that describe the differences in the public and restricted datasets. The User’s Guide can be found online at: http://sodapop.pop.psu.edu/codebooks/ecls/k5userpart2.pdf