Course

Advanced sampling theory

3

Statistics NetherlandsDivision Research and Development

Department of Statistical MethodsP.O. box 4481

6401 CZ Heerlen

The Netherlands

A Course in Sampling Theory

byRobbert H. Renssen*

* The views expressed in this paper are those of the author and do not necessarily reflect the policies of Statistics Netherlands.

Project no.:RSM-50351

BPA no.:2138-98-RSM-1

Date:16 maart 1998

A course in sampling theoryby

Robbert H. Renssen

Section Statistical Methods

Division Research and Development1. IntroductionThe purpose of sample surveys is to gather information about a certain finite population by estimating finite population parameters such as means, totals, or fractions. In sampling theory, observations obtained from the sampling units are regarded as fixed. The randomness is introduced because a probability sample is observed instead of the whole population. The population to be sampled (the sampled population) should coincide with the population about which information is wanted (the target population). Sometimes, for reasons of practicability or convenience, the sampled population is more restricted than the target population. If so, it should be remembered that the conclusions drawn from the sample only apply to the sampled population. Before selecting the sample, the population must be divided into parts that are called sampling units. In principle, these units must cover the whole population and they must not overlap, in the sense that each element in the population belongs to exactly one sampling unit. The construction of a list of sampling units, called a sampling frame, is often one of the major practical problems. Sampling frames are often found to be incomplete, or partly eligible, or contain an unknown amount of duplication.

A common starting point in a survey design is to concentrate first on the mere effects of sampling and ignore any frame imperfections or any other errors, which may occur. When developing a sample survey the following issues are important, assumed that the survey is based on a probability sample:

the sampling design and the sampling selection scheme, both applying before data collection,

the estimator by which a particular parameter will be estimated, applying after the data collection.

The sampling design is a set of specifications, which defines the target population, the sampling units, and the probabilities attached to the possible samples. The sample selection scheme describes the mechanical selection of a sample according to the chosen design. Which design fits best for a particular survey depends on the auxiliary information present in the frame. The more information is available before sampling the better the sampling design can be tailored to the survey objectives. The estimator is the mathematical function by which the estimate for a particular parameter is computed. The form of the parameter often induces the choice of an estimator. An estimator may contain auxiliary information, either from the sampling frame or from external sources. The combination of a sampling design and an estimator is called a sampling strategy.

2. The design based approach; definitions and notationsWe consider a finite population U of N elements/units and associate with each element k a value yk of a scalar target variable and a p-vector xk with values of p auxiliary variables. Note that both Y and X may be interesting for publication purposes. By means of some (complex) design a sample S of fixed size n is drawn from U. Given the sampling design we consider the set ( of all possible samples. Denote for each S ( (

: probability of a specific sample S and

: estimator of a (finite) population parameter ( by means of S.According to the design based approach inference proceeds with respect to the sampling distribution of statistics of repeated samples S generated by the sampling design (Skinner et al., 1989, chap.1). For example, the design expectation and design variance are defined as

and

.

Let

denote the number of times the element k is drawn for a specific S:

. Define the first order inclusion expectation of k as

, (k = 1,...,N).

The second order inclusion expectation of element k and l is defined as

.

In the following we will consider a number of well-known sampling designs and corresponding estimators.

3. Simple random sampling, model assisted approach, and simplified formBy simple random sampling each sample has the same probability to be drawn. We consider only simple random sampling without replacement. The population and sample mean of Y are denoted by respectively

and

,

where S is a simple random sample without replacement. The population and sample variance with respect to Y are

and

.

The set ( consists of

samples of which

contain element k and

element k as well as element l, k ( l. So,

and

for k ( l.

Note that

for k = l. A direct estimator for the population mean in case of simple random sampling (without use of auxiliary information) is the sample mean. The following proofs are illustrative:

and

where we have made use of the following identity:

.

The factor (N-n)/N is called the finite population correction factor. For simple random sampling, the general regression estimator (with use of auxiliary information) is defined as

, where

(3.1)

with (k > 0 a scalar. The second term of the general regression estimator can be considered as a correction for the sample mean. An important issue is the choice of (k. It is required that all (k are known. Srndal et al. (1992) suggested taking

, where the

can be interpreted as the variance of independent random variables

defined in a superpopulation model (, of which the

are supposed to be the outcomes. More precisely, the model ( has the following features:

are assumed to be realized values of independent random variables

,

, and

.

This model is only used to determine a specific choice of xk and (k. In other words; the model serves as a vehicle for finding an appropriate general regression estimator. Once the estimator is found, the model is of no longer use. The properties of the general regression estimator (expectation and variance) are still derived from a design based of view. Finding a suitable general regression estimator by means of a superpopulation model within the framework of the design based approach is called model assisted.

Note that

. By means of the first order of the Taylor series the general regression estimator can be linearized. The partial derivatives are

,

, and

.

The first order Taylor series expansion of f at

, where B is the population regression coefficient, i.e.

equals

The general regression estimator is Approximately Design Unbiased (ADU):

.

Furthermore, the design variance of the general regression estimator can be approximated by

,

where

and

. Note that

and

are respectively the sample mean and population mean of the residuals ek. An important simplified form of the general regression estimator can be derived from the first of following results. If there is a constant p-vector c such that

for all

then

and hence

and hence

.

We only proof the first result; the second result can be proved analogously. Under the stated assertion we have

,

which gives the first result. We may distinguish several special cases of the general regression estimator:

the ratio estimator: p = 1,

, and

,

the regression estimator: p = 2,

, and

,

post-stratification: p = A (number of post-strata),

, and

. Note that (k represents A dummy variables. Each dummy variable corresponds to a post-stratum. It equals 1 if the k-th element belongs to that post-stratum, otherwise it equals 0.

These estimators can all be written in the simplified form. Take

in case of the ratio estimator,

in case of the regression estimator, and

in case of post-stratification.

4. Stratified simple random samplingFor stratified sampling designs the population U is divided into H mutually exclusive strata: denoted by U1,...,UH . In each stratum a simple random sample Sh is drawn without replacement. The population and sample mean of Y in stratum h are denoted by respectively

and

.

Furthermore, the population and sample variance with respect to Y in stratum h are

and

.

The set (h consists of

samples Sh and so ( consists of

samples S. A direct estimator for the population mean of

is obtained by

. Note that the known stratum totals Nh are used in this estimator. The design expectation and design variance are easily derived:

and

.

An important issue by stratified designs is the allocation scheme, i.e. the allocation of the sample over the strata. Two well-known allocation schemes are proportional allocation and the so-called Neyman-allocation:

and

.

The Neyman-allocation minimizes the design variance of the direct estimator for fixed sample size n. For stratified designs we distinguish two important estimators, namely

the separated general regression estimator:

the combined general regression estimator:

, where

and

Special cases of the separated general regression estimator are the separated ratio estimator and the separated simple regression estimator and special cases of the combined general regression estimator are the combined ratio estimator and the combined regression estimator. Note that the separated general regression estimator can also be considered as a special case of the combined general regression estimator. Namely if is taken as an auxiliary variable for the combined general regression estimator then we obtain the separated general regression estimator with xk as auxiliary variable. Here, the operator ( denotes Kronecker product, see e.g. Zeelenberg (1993).

5. Cluster samplingFor cluster sampling, the population U is divided into M mutually exclusive clusters (also called primary units):

. Cluster i contains Mi units/elements (called secondary units). We will discuss two sampling designs

design 1: A simple random sample (or a stratified sample) of m clusters is drawn without replacement; the complete cluster is observed,

design 2: A simple random sample (or a stratified sample) of m secondary units is drawn without replacement; the complete corresponding cluster is observed.

The choice of one these designs may depend on e.g. the available information in the sampling frames. We note that the second design has a practical advantage for panel surveys with households as observational units. Namely, once the panel has been drawn it is easier to follow persons (according to the second design) than households (according to the first design), since the composition of a household may change after the first wave. If a household composition is changed then its first order inclusion expectation should be adjusted accordingly. In case of the second design only the current household composition is needed to do so, while in case of the first design the complete history of the household composition, starting from the first wave, has to be taken into account.

Since in both designs a cluster is observed completely, the Y-variables and the X-variables can be calculated at the cluster level as well as at the level of the secondary units. To be specific, we consider household sampling, where each household member is observed. We may distinguish between household characteristics (composition of household, size of households, region), and person characteristics (sex, age, marital status, region). Study variables (target or auxiliary) may concern both types of characteristics. This implies that some study variables are defined at the level of persons (indicated by two indices) while others are defined at the level of households (indicated by one index).

Design 1. The direct estimator for the population total is defined as (Y concerns person characteristics):

with

.

(5.1)

If Y concerns household characteristics then the direct estimator is

with

.

(5.2)

The star-notation indicates that a characteristic is derived (inherited) from the value of the other observation unit. Obviously, both estimators can be formulated at the level of persons as well as at the level of households. Note that the design expectations and the design variances of these estimators can be derived easily at the level of households. The general regression estimator is

,

(5.3)

where

can be defined at the cluster level or at the level of secondary units. At the cluster level we have

,

e.g. post-stratification with respect to households. Naturally, (i should concern household characteristics. For example, according to the model assisted approach, (i can be viewed as a model variance defined for clusters. At the person level we have

,

e.g. post-stratification with respect to persons. Here, (ij can be interpreted as a model variance for persons. We will discuss the second design in the next section.

6. The Horvitz-Thompson estimatorThe Horvitz-Thompson estimator can be used as a general tool to construct unbiased estimators for many sampling designs. Let (k denote the first order inclusion expectation of the k-th sampled unit and let yk denote the observation which corresponds to this unit, then the Horvitz-Thompson estimator for the population mean of Y is defined as

.

It follows that

.

If the units are sampled without replacement then the following expression can be derived for the design variance:

It is important to note that all direct estimators, which are discussed so far, are in fact Horvitz-Thompson estimators. It remains to discuss a direct estimator for the second design in case of cluster sampling: m secondary units are drawn by simple random sampling without replacement and the complete cluster (primary unit) is observed.

Design 2. In order to calculate the first order inclusion expectation of a secondary unit uij we divide the population U into two parts, namely the secondary units belonging to the i-th primary unit and the remaining secondary units. If m secondary units are drawn by simple random sampling without replacement and for each drawn secondary unit the complete cluster is observed, i.e. is in the sample, then according to the hypergeometric distribution we have

.

(6.1)

Now, we construct a sample of m clusters Sc, such that each drawn secondary unit corresponds to precisely one cluster in Sc, namely, the cluster it belongs to. If two distinct secondary units belonging to the same cluster are drawn, then this cluster is duplicated in Sc. By construction (6.1) is also the first order inclusion expectation of cluster ci with respect to Sc. Based on Sc and the Horvitz-Thompson formalism, we may construct the following unbiased estimators for the population total of Y (if Y concerns person characteristics)

or (if Y concerns household characteristics)

.

The design variance of both estimators can be derived easily, since the elements (clusters) in Sc can be considered as a simple random sample without replacement. If Y concerns person characteristics, then is observed at the i-th element, otherwise if Y concerns household characteristics is observed.

7. The general regression estimatorBased on Horvitz-Thompson estimators the general regression estimator for the population total is defined as:

with

(7.1)

For simple random sampling (7.1) corresponds to (3.1) and for stratified simple random sampling (7.1) corresponds to the combined general regression estimator. For cluster sampling the auxiliary information in (7.1) may be used at the cluster level or at the level of secondary units, see the next section.

It is convenient and common practice to present the general regression estimator in terms of weights:

,

(7.2)

with

.

(7.3)

Often, wk are called final weights and 1/(k inclusion weights. The correction weights gk are often called g-weights. Weighting offers a way to estimate population means for study variables without needing this variable in advance. One only has to determine the vector of auxiliary variables (determine the weighting model) to calculate the weights according to (7.3). Afterwards one may calculate (7.2) for every arbitrary variable, which is observed in the sample. For each variable the result is a general regression estimation with a predetermined weighting model.

8. Consistent weighting between persons and householdsCluster designs need special attention because one can weight at two levels: weighting at the level of persons (the index k in (7.3) stands for persons) or at the level of households (the index k in (7.3) stands for households). Again we distinguish between person characteristics and household characteristics. Study variables (target or auxiliary) may concern both types of characteristics. This implies that some study variables are defined at the level of persons while others are defined at the level of households. The main issue of consistent weighting between persons and households is to translate persons characteristics into households scores (or vice versa), such that both types of characteristics can be used for either weighting procedure. In addition, all person weights within a household should be the same and equal to the household weight, i.e.

for all

. We only discuss the method of Lematre and Dufour (1987).

First note that

for all

(by design). Let

if X concerns a household characteristic and

if X concerns a person characteristic. Define the following weights at the household level:

,

(8.1)

where

is the Horvitz-Thompson estimator for the population total of X (defined at the household level). Furthermore, let

for all

if X concerns a household characteristic and

for all

if X concerns a person characteristic, and define the following weights at the person level:

.

(8.2)

By construction we have

for all

, so it follows that

is a Horvitz-Thompson estimator for the population total of X also (defined at the level of persons). Since the first order inclusion probabilities and the auxiliary variables are equal within households, the weights wij given by (8.2) must be equal within households if (ij is taken equal within households. Furthermore, person weights should also represent household weights, i.e. (8.1) should equal (8.2). Now, these demands are fulfilled if

and

for all

are inserted in (8.1) and (8.2) respectively. Two choices for

are

.

If X concerns purely household characteristics, then the first choice is interesting, while the second choice should be considered if X concerns purely person characteristics. In Nieuwenbroek (1993) these choices are motivated from a model assisted point of view. For example, suppose that X concerns person characteristics. If (ij = 1 is a suitable choice to weight at the level of persons, then one should take

and

for consistent weighting.

9. Bounding g-weights; the Huang and Fuller algorithm There is no guarantee that the general regression weights given by (7.3) are strictly positive. Apart from the fact that negative weights may induce negative population totals, many users of statistics are reluctant to work with negative weights. The problem with respect to unacceptable weights tends to increase when a very extended weighting model is used in comparison with the sample size. Several techniques have been developed to force the weights within a certain interval. The use of calibration estimators to prevent extreme weights will be discussed in the next section. In this section an algorithm largely based on Huang and Fuller (1978) is given. According to this algorithm the correction weights

are forced within a certain interval [L,U], with 0 < L < 1< U, by an iterative process (see Nieuwenbroek (1997):

Step 1. Choose the lower and upper bounds L and U for the g-weights, and the maximum number of iterations (max.

Step 2. Set ( = 0 and initialize

for all k( S.

Step 3. Calculate the g-weights:

Step 4. If all g(()-weights are within the interval or if ( = (max, the process stops; otherwise continue with step 5.

Step 5. Set ( = ( + 1 and calculate the distance

if

and

if

Note that

.

Step 6. Set

if

if

if

Note that if

falls outside the interval, i.e.

or

, then

Step 7. Repeat from step 3.

Clearly, with the help of q-factors the g-weights are adjusted such that, hopefully, they fall inside the interval after the last iteration. Nieuwenbroek (1997) strongly advises to be carefully with the specification of the interval acceptance. In particular tight bounds may cause problems. It is illustrative to show the g-weights for some particular weighting models in case of simple random sampling:

post-stratification:

if k belongs the h-th post-stratum,

ratio estimator:

with ,

simple regression:

with

.

For the weighting models, which correspond to post-stratification, and the ratio estimator, a restriction on the g-weights is not sensible. For example, in case of post-stratification the starting g-weights are constant within post-strata. So, within a post-stratum all starting g-weights fall either inside or outside the interval. If they fall outside the interval, then according to step 5 and 6 the corresponding q-factors will be constant within post-strata. The elaboration of step 3 in case of post-stratification shows that the g-weights are not affected by such q-factors; they remain the same after each iteration.

From the model assisted point of view the Huang-Fuller algorithm can be motivated. After convergence the resulting estimates can be considered as generalized regression estimates with modified (-factors, namely qk(k (in the strict sense the resulting estimates are not generalized regression estimates, because the modified (-factors are sample dependent). From the model assisted point of view the original (-factors are interpreted as inverse values for the model variances, see section 3. The fact that some regression weights are negative suggests that the model be misspecified. The Huang-Fuller algorithm tries to fit the model (after data collection) via a modification of the model variances.

10. Calibration estimationUse of auxiliary information by means of the general regression estimator can be justified by a regression relationship between the target variable on the one hand and the auxiliary variables on the other hand. It is shown that the general regression estimator implicitly defines weights by means of which population totals of study variables can be estimated. In this section we show that a different route can obtain the general regression estimator, namely by focusing on the weights instead of the linear regression relationship. This route offers us 1) a way to generalize the general regression estimator and 2) a tool to restrict the g-weights alternatively.

Denote

. The general regression weights given by (7.3) can also be obtained by minimizing

subject to

with respect to w1,...,wn, or equivalently, by minimizing

,

(10.1)

with respect to w1,...,wn and (1,...,(p. Here, ( = ((1,...,(p )t is a p-vector of Lagrange multipliers. By differentiating (10.1) with respect to w1,...,wn and setting the derivative at 0, we obtain

, from which it follows that

.

(10.2)

Differentiating (10.1) with respect to ( , setting the derivative at 0, and inserting (10.2) we obtain

,

which gives

,

(10.3)

provided the inverse exists. The resulting weights can be obtained by inserting (10.3) into (10.2). Indeed, the resulting weights coincide with (7), so the resulting estimator is just the general regression estimator. Now, the minimization problem (10.1) and hence the generalized regression estimator can be generalized as follows.

Let G be a real valued function with the properties: G is positive, strictly convex, G(1) = G((1) = 0 and G(((1) = 1. Extending (7.3), a calibration estimator for the population total of Y is defined as

,

where the calibration weights are obtained by minimizing

(10.4)

with respect to w1,...,wn, (1,...,(p. Roughly, a calibration estimator uses calibration weights, which are as close as possible, according to a certain distance measure, to the original sampling weights dk. For the specific distance function

the calibration estimator reduces to the general regression estimator.

Differentiating (10.4) with respect to wk we obtain

, and solving for wk we obtain

,

(10.5)

where F is the inverse function of G(. Note that the existence of F is guaranteed since G is strictly convex, and hence G( is strictly increasing. Differentiation (10.4) with respect to ( and inserting (10.5) we obtain

.

(10.6)

This is a system of p equations and p unknowns, which should be solved for (. Let

and

.

Then, according to the Newton-Raphson algorithm, a solution may be found by

.

Often, (10.3) is taken as a starting value. According to (10.5) the g-weights (obtained by calibration) are proportional to F. Therefore, the range of the g-weights is restricted by the range of F. Now, instead of (10.4) one could define calibration weights by means of (10.5) and (10.6) with an appropriate F-function. (F should be monotone increasing, F(0) = 1 and F((0) = 1, and (10.6) should have a solution, i.e. the range of F should not be too tight.)

Besides

, which corresponds to the general regression estimator, we will give two more F-functions, namely

and

,

where L and U are defined as in section 9. The first F-function is motivated by the desire to restrict the regression weights. It is called the truncated linear method. The second F-function corresponds to the multiplicative method (or raking method) as will be shown in section 12 for two-way tables. Note that the second F-function is bounded by zero from below.

We close this section with an important property of calibration estimators: under certain regularity conditions the calibration estimator is asymptotically equivalent to the general regression estimator. In particular, they have the same asymptotic design expectations and design variances. A heuristic argument is the following. For large sample sizes

is close to

(since

is a consistent estimator for

). Then, by (10.6) the F-value should be close to 1, and ( should be close to 0. But, since F(0) = F((0) = 1 for all F-functions, they have the same behavior in the neighborhood of 0. It follows that all F-functions can be approximated by

, i.e. the F-function which corresponds to the general regression estimator.

11 Calibration estimators for post-stratificationAn important special case to consider is the calibration estimator which corresponds to (complete) post-stratification: p = A,

, and

, see section 3. Then

if element k belongs to the h-th post-stratum, and (10.6) can be elaborated as

,

.

It follows that

, where

,

. So, in case of post-stratification the calibration weights are

if

,

regardless of the function F. The resulting calibration estimator corresponds to the well-known post-stratification estimator. Note that, strictly speaking, the calibration estimator is not defined for the post-stratification model if the upper bound of F is smaller than

. However, as already said, for post-stratification a restriction on the g-weights and hence on the F-function is not sensible.

12. Iterative proportional fitting for two-way tablesIn this section we consider estimating a two-way table with calibration on the marginal counts;

and

, where (1k is a r-vector with dummies denoting to which row element k belongs and (2k is a c-vector with dummies denoting to which column element k belongs. Let u = (u1,...,ur)t denote a vector of order r and v = (v1,...,vc)t a vector of order c. By letting

we have

whenever k belongs to the (i,j)-th cell. Let

denote the marginal row counts and

the marginal column counts. Denote further

,

i.e. the Horvitz-Thompson estimator for the population total of the (i,j)-th cell. Then, the calibration equations given by (10.6) are

, i = 1,...,r

(12.1)

and

, j = 1,...,c.

(12.2)

For the multiplicative method we have

. In this case (12.1) and (12.2) can be written as

, i = 1,...,r

(11.3)

and

, j = 1,...,c

(11.4)

respectively. A solution of (11.3) and (11.4) is obtained by carrying out until convergence the classical raking algorithm, often called iterative proportional fitting. First set exp(vj) = 1 and calculate exp(ui) according to (11.3). Then inserting this value in (11.4) we calculate a new value for exp(vj), which in turn can be used to calculate a new value for exp(ui), etc. After convergence, the population cell counts are estimated by

.

According to (10.5) the corresponding calibration weights are

if k belongs to the (i,j)-th cell.

13. Consistent calibration weights in cluster samplingAgain cluster sampling needs some special attention, because then (10.5) and (10.6) can be defined at two levels. We extend the method of Lematre and Dufour to obtain consistent calibration weights between persons and households. Define

,

,

, and

similarly as in section 8, and note that

for all

.

(13.1)

For calibration weights defined at the level of persons we insert zij and

in (10.5) instead of xij and (ij, and for calibration weights defined at the household level we insert zi and

instead of xi and (i. It follows from (10.5) and (13.1) that all calibration weights defined at the person level are the same within a cluster, which in turn is the same as the corresponding calibration weight defined at the cluster level. It follows from (10.6) that both the person weights as well as the household weights induce the known population totals if these weights are applied to the X-variables.

14. Double sampling/two-phase samplingThe sampling strategies discussed so far heavily depend on the use of auxiliary information. When such information is not available, one could consider to take a large preliminary sample in which only auxiliary variables are observed. We distinguish between double sampling for stratification and double sample for the (general) regression estimator.

Double sampling for stratification.

The population is to be stratified into L strata. The first (preliminary) sample is a simple random sample (without replacement) of size n1. Let

denote the proportion of the population falling in stratum h, and

the proportion of the first sample falling in stratum h. Then wh is a design unbiased estimator of Wh. This estimator is used as auxiliary information for the second sample, which is a stratified simple random (sub)sample from the first sample. In the following we assume that n2h = vhn1h, where 0 < vh ( 1 and we assume that the vh are chosen in advance, i.e. they are fixed. The population mean of Y is estimated by

,

where

is the estimated population mean of the h-stratum based on the second sample. Given the first sample,

is an unbiased estimate of

, i.e. the (unobservable) estimate for the population mean of the h-stratum based on the first sample. We have

,(14.1)

where

is the sample mean of the first sample. So,

is design unbiased. Note that we have conditioned on the first sample. In order to derive the design variance, we partition the set of all possible preliminary samples S1, denoted by (sr, into a set of all possible samples which would have been obtained by stratified simple random sampling where n1h elements are drawn in stratum h, denoted by (strat, and a set of remaining samples. The design variance of

is obtained by

The second term equals

,

and the first term can be elaborated as

.

Note that S1 given (strat can be considered (by construction) as a stratified simple random sample where n1h and hence also n2h and w1h are constant, and

are unbiased estimates for

. So,

EMBED Equation.2 .

(14.2)

Double sampling for the general regression estimatorIn some applications of double sampling the preliminary sample is used to provide auxiliary information for the general regression estimator based on second sample. The estimate of the population mean is

,

where the multiple regression coefficient is estimated from the second sample. Note that

. A first order Taylor series expansion in

gives

.

We have

and

, so . It follows that the general regression estimator in case of double sampling is ADU. The design variance of this estimator can be approximated by

.

There is a relationship between double sampling on the one hand and samples with non-response on the other hand. In both cases a sample S is drawn according to some (known) design, but the target variables are observed in a sub-sample. However, in case of double sampling, the sub-sample is drawn with known inclusion probabilities (they may depend on S), while in case of non-response the inclusion probabilities are unknown. The latter are called response probabilities. These probabilities may depend on personal circumstances, but also on the field work organization and the data collection method.

15. Dealing with (unit) non-responseThe greater the non-response rate, the more one has reason to worry about its harmful effect on the survey estimates. Strategies for dealing with non-response can be classified as follows (see Srndal et al., 1992)

Before and during data collection, effective measures are taken to reduce the non-response to insignificant levels,

Special, perhaps costly techniques for data collection and estimation are used that induce unbiased estimators,

Model assumptions about the non-response mechanism and about relations between variables are used to construct estimators that adjust for non-response.

We will only discuss a) sub-sampling of non-respondents and b) one specific response model. Both can be linked to double sampling.

Sub-sampling of non-respondentsOne approach to deal with non-response is to take a sub-sample of the non-respondents, and make every possible effort to obtain responses from all elements in this sub-sample. This idea is developed by Hansen and Hurwitz (1946). Assume that a simple random sample of size n1 is drawn without replacement in the first trial. Let wr and wnr denote the responding and non-responding sampling fractions, respectively. In the sub-sampling phase a simple random sample (without replacement) of size n2 is drawn from the non-respondents. This procedure resembles double sampling for stratification, with a subdivision of the initial sample into two strata of which one is completely observed ad the other is sub-sampled. However, there is an important difference, because the division of the population into the strata is not fixed (unless the fixed response model is used), but can be considered as a realization of Poisson sampling; whether unit k belongs to the responding stratum or not depends on the realization of a Bernoulli experiment with its personal response probability as success fraction. Let

denote the sample mean of the respondents in the first trial and

the sample mean of the sub-sample among the non-respondents. Then

(15.1)

is an unbiased estimate for the population mean. This is easily seen as follows. Given the realization of the Poisson sampling, the population is divided in two fixed strata, the first consisting of all units for which measurements would be obtained after the first trial, the second of units which no measurements would be obtained after the first trial. So, given the realization of the Poisson sampling, the complete sample, i.e. the first sample plus the sub-sample of non-respondents, can be considered as double sampling for stratification with H = 2, v1 = 1, and v2 the sub-sampling fraction. It follows from (14.1) that

.

Given the realization of the Poisson sampling, the variance can easily be obtained from (14.2):

.

Since,

, the unconditional variance of

is

.

Note that this variance cannot be evaluated without exact knowledge of the response behavior, however, it is possible to obtain an unbiased estimator for this expression.

The response homogeneity group modelIf there is full response, unbiased or nearly unbiased estimators can be constructed for a given sampling design. When the non-response is not negligible we should distinguish between the intended sampling design (developed by the statistician) and the realized sampling design, which may differ from the intended design due to non-response. By means of a response model both designs can be linked. A response model is a set of assumptions about the true unknown response behavior. According to the response homogeneity group model, the population is divided into G groups and it is assumed that within each group each individual has the same probability to respond if he/she fall into the sample. Furthermore it is assumed that different potential respondents will respond independently of each other. In the following we distinguish between 1) net and gross sample, 2) net and gross sample size, and 3) net and gross inclusion probabilities. It follows from the assumptions that

if k belongs to group gLet (k denote the first order gross inclusion probabilities, then the first order net inclusion probabilities are defined as

if k belongs to group g.

The groups are to be chosen so that the response homogeneity response model describes as accurately as possible the response behavior. If the response probabilities are known, then the complete theory discussed above can be applied straightforwardly. However, in general the response probabilities are unknown and have to be estimated. We consider two estimators for the g-th group

(15.2)

and

,

(15.3)

where rk is the realization of a Bernoulli experiment with E(rk) = (g if k belongs to the g-th group. Clearly, under the model assumptions, both estimators are (model) unbiased for (g. The first estimator is the maximum likelihood estimator for (g (the ordinary sample mean). It is the non-response fraction of the g-th group in the sample. The second estimator is a ratio estimator for the realized response fraction of the g-th group in the finite population. If (k are constant within a group (simple random sampling or stratified simple random sampling with groups as strata) then both estimators coincide. For convenience we will use the estimator given by (15.2). Based on the estimated net inclusion probabilities, we may define the Horvitz-Thompson estimator for the net sample

.

(15.4)

For simple random sampling and for stratified simple random sampling where each group corresponds to a stratum (15.4) reduces to

and

respectively. The design expectation and design variance of (15.4) can be obtained by the following reasoning. The net sample can be obtained in two-phases. In the fist phase the gross sample is drawn according to the intended sampling design. In the second phase the net sample is drawn from the gross sampling according to Poisson sampling with (conditional) inclusion probabilities the unknown (k.. Noting that

,

the Poisson sampling can be interpreted as stratified simple random sampling of sample strata sizes m1,...,mG, where mg are independent binomial distributed random variables with parameters E(mg) = (g and ngross,g. So, given the gross sample and the realization of the net samples sizes per group, the net sample can be regarded to be a stratified simple random sample from the gross sample (see section 14). For example,

.

Note that this expectation is independent of the realization of the net sample sizes per group. It follows that

is an unbiased estimator for the population total. The variance can be expressed as

.

Note that the conditional variance in the second term can be derived easily by considering the net sample as a stratified simple random sample from the gross sample. This second term is the increase in variance due to non-response. The first term is the ordinary design variance of the Horvitz-Thompson estimator in case of full response.

By means of the net inclusion probabilities, it is also possible to formulate the general regression estimator:

Obviously, this estimator is ADU under the response homogeneity group model. In case of post-stratification, where each post-stratum corresponds to a group (it is assumed that the group population totals are known), we obtain

.

Obviously, this post-stratification estimator corrects for bias due to non-response without using the net inclusion probabilities. This result justifies the general regression estimator as a tool to correct for bias due to non-response without formulating the response homogeneity group model explicitly. It is hoped that the auxiliary variables that are incorporated in the general regression estimator will also reflect the response homogeneity group model.

16. Aligning estimates between two sample surveysSo far, we have discussed a number of sampling designs and estimation procedures for situations were just one sample survey is involved. In this section we extend the weighting technique to weight two (or more) samples simultaneously. The sampling designs of both samples may differ. We will discuss two techniques; one is based on minimizing a distance function under a set of constraints, the other on the general regression estimator. First we need some terminology. We use the term target variable (denoted by Y) for those variables observed in either one survey (but not in both) for which the population totals are unknown, common variables (denoted by Z) for those variables observed in both survey but for which the population totals are unknown, and control variables (denoted by X) for those variables observed in both surveys but for which the population totals are known. All kind of variables may be interesting for publication purposes. The purpose of aligning estimates is to weight both samples such that the weights reproduce the known population totals of the control variables as well as identical estimates for the population totals of the common variables.

The composite constraints methodLetting the subscripts refer to sample 1 and 2 respectively, we consider the following minimization problem (compare 10.1): Minimize

subject to

,

, and

with respect to w1k and w2k . Similarly as (10.1), this minimization problem can be solved with Lagrange multipliers. The resulting weights reproduce the known population totals of the control variables by the first and second set of constraints. They are mutually consistent with respect to the common variables according to the third set of constraints.

The adjusted general regression estimatorAs already known, the general regression estimator implicitly defines weights which reproduce the known population totals of control variables if these control variables are used as auxiliary variables in the regression estimation. So, the consistency requirement with respect to the control variables is always fulfilled if one takes these control variables as auxiliary variables in the general regression estimator. For the common variables it is proposed to estimate the unknown population totals by pooling the two sample surveys, and then simultaneously using these common variables as additional regressors in the general regression estimator. Let

denote the (pooled) estimates of the population totals of the common variables. The adjusted general regression estimator for the population total of Y is defined as

, i = 1,2,

(16.1)

where the partial regression coefficients are simultaneously obtained from

, i = 1,2.

(16.2)

Both adjusted general regression estimators implicitly define a set of weights, which are reproductive with respect to the control variables and mutually consistent with respect to the common variables. Note that the definition of the adjusted general regression estimator is very similar as the definition of the general regression estimator for double sampling. However, there is an important conceptual difference. The purpose of the adjusted general regression estimator is to obtain mutually consistent weights between two separate sample surveys with some variables in common, while the purpose of double sampling is to reduce the design variance of an estimator within a particular (second phase) sample survey by gathering extra auxiliary information in a corresponding first phase sample.

In order to derive some properties of the adjusted general regression estimator it is convenient to introduce some matrix notation. We only consider the first sample; the other sample can be treated analogously. Denote

,

,

, and

.

The partial regression coefficients given by (16.2) can be written as

.

Using well-known theory about partial matrices (it is assumed that all inverse matrices exists), it can be shown that

and

,

where

,

,

and

.

It follows that the adjusted general regression estimator given by (15.1) can be rewritten as

,

(16.3)

where

and

are the ordinary general regression estimators for the population total of Y and Z respectively. Apparently, the adjusted general regression estimator is equal to the ordinary regression estimator plus an adjustment term. This adjustment term can be viewed as an attempt to further improve the ordinary general regression estimator. However, and probably more important, it is a means to achieve consistent estimates between the two samples with respect to the common variables. The adjusted regression estimator given by (16.3) implicitly defines the following vector of weights:

,

with

,

where

and l a vector of 1s. Noting that

, it is readily seen that indeed

and

. An important issue is the estimation of the population total of Z. A natural choice is

,

where

and

are the ordinary general regression estimators of sample 1 and 2 respectively, and P and Q matrices such that P + Q = I. We give three interesting choices

and

, where ( , 0 ( ( ( 1, is a crude measure of the amount of confidence in one estimator compared to the other,

the choice

and

gives minimal design variance of

for arbitrary vector a,

the choice

and

induces the same weighs as the use of composite constraints method.

The first choice is easy to implement. If we take

then this choice takes into account the difference in sample size. More generally, ( may depend on indicators for several survey errors, such as frame errors, sampling errors, non-response errors, and measurement errors. The second choice only deals with sampling errors, but in an optimal way. It takes into account sample size, sampling design, and use of auxiliary information. The third choice shows that the class of weights defined by the adjusted general regression estimator includes the weights defined by the composite constraints method. Furthermore, it reveals a weakness in the composite constraint method. Namely, it can be argued that P and Q according to the third choice both converge to

for large sample sizes of both samples (assuming that in both sample surveys the same set of control variables are used). Obviously such a choice and hence the composite constraints method does not account properly for differences in sample sizes.

References

Bethlehem, J.G. and Keller, W.J. (1987), Linear Weighting of Sample Survey data, Journal

of Official Statistics, 3, 141-153.

Cochran, W.G. (1977), Sampling Techniques, 3rd-ed. New York: Wiley.

Deville, J.C. and Srndal, C.E. (1992), Calibration Estimators in Survey Sampling, Journal of the American Statistical Association, 87, 376-382.

Deville, J.C., Srndal, C.E., and Sautory, O. (1993), Generalized Raking Procedures in Survey Sampling, Journal of the American Statistical Association, 88, 1013-1020.

Gouweleeuw, J.M. Heerschop, M.J., and van Huis, L.T. (1997), Consistent Estimation of kind of Activity Units and Local Units using the Calibration Method, Research paper, Department of Statistical Methods, Statistics Netherlands, Voorburg.

Knottnerus, P., Renssen, R.H., and Verboon, P. (1997), Sampling Design and EDI, Research paper, Department of Statistical Methods, Statistics Netherlands, Voorburg.

Koeijers, C.A.J. and Willeboordse, A.J. (eds), (1995), Reference Manual on Design and Implementation of Business Surveys, Statistics Netherlands.

Lematre, G. and Dufour, J. (1987), An integrated Method for Weighting Persons and Families, Survey Methodology, 13, 199-207.

Nieuwenbroek, N.J. (1993), An Integrated Method for weighting Characteristics of Persons and Households using the Linear Regression Estimator, Research Paper, Department of Statistical Methods, Statistics Netherlands, Heerlen.

Nieuwenbroek, N.J. (1997), General Regression Estimator in Bascula 3.0: Theoretical Background, Research paper, Department of Statistical Methods, Statistics Netherlands, Heerlen.

Renssen, R.H. and Nieuwenbroek, N.J. (1997), Aligning Estimates for Common Variables in Two or More Sample Surveys, , Journal of the American Statistical Association, 92,

368-374.

Srndal, C.E., Swensson, B., and Wretman J.H. (1992), Model Assisted Survey Sampling, New York: Wiley.

Zeelenberg, C. (1993), A Survey of Matrix Differentiation, Research Paper, Department of Statistical Methods, Statistics Netherlands, Voorburg.

Zieschang, K. D. (1990), Sample Weighting Methods and Estimation of Totals in the Consumer Expenditure Survey, Journal of the American Statistical Association, 85,

986-1001.

Skinner et al. (1989, chap 1) also consider the model based approach: inference proceeds with respect to the sampling distribution of statistics over repeated realizations y1,...,yN generated by a super-population model. We will not elaborate on this approach.

PAGE

_946293409.unknown

_946293486.unknown

_946293523.unknown

_946293542.unknown

_946293816.unknown

_946367897.unknown

_966081421.unknown

_966082475.unknown

_966083096.unknown

_966084361.unknown

_966084396.unknown

_966084414.unknown

_966083174.unknown

_966082638.unknown

_966082185.unknown

_966082377.unknown

_966081786.unknown

_946980800.unknown

_946984554.unknown

_946376852.unknown

_946449465.unknown

_946450260.unknown

_946376884.unknown

_946376818.unknown

_946304196.unknown

_946363325.unknown

_946367851.unknown

_946304263.unknown

_946304069.unknown

_946304162.unknown

_946293854.unknown

_946293552.unknown

_946293557.unknown

_946293559.unknown

_946293560.unknown

_946293558.unknown

_946293555.unknown

_946293556.unknown

_946293553.unknown

_946293547.unknown

_946293549.unknown

_946293550.unknown

_946293548.unknown

_946293544.unknown

_946293545.unknown

_946293543.unknown

_946293532.unknown

_946293537.unknown

_946293540.unknown

_946293541.unknown

_946293539.unknown

_946293535.unknown

_946293536.unknown

_946293534.unknown

_946293527.unknown

_946293530.unknown

_946293531.unknown

_946293528.unknown

_946293525.unknown

_946293526.unknown

_946293524.unknown

_946293505.unknown

_946293514.unknown

_946293518.unknown

_946293521.unknown

_946293522.unknown

_946293519.unknown

_946293516.unknown

_946293517.unknown

_946293515.unknown

_946293509.unknown

_946293511.unknown

_946293513.unknown

_946293510.unknown

_946293507.unknown

_946293508.unknown

_946293506.unknown

_946293495.unknown

_946293500.unknown

_946293502.unknown

_946293503.unknown

_946293501.unknown

_946293498.unknown

_946293499.unknown

_946293497.unknown

_946293491.unknown

_946293493.unknown

_946293494.unknown

_946293492.unknown

_946293489.unknown

_946293490.unknown

_946293488.unknown

_946293447.unknown

_946293466.unknown

_946293475.unknown

_946293480.unknown

_946293484.unknown

_946293485.unknown

_946293483.unknown

_946293477.unknown

_946293478.unknown

_946293476.unknown

_946293470.unknown

_946293473.unknown

_946293474.unknown

_946293472.unknown

_946293468.unknown

_946293469.unknown

_946293467.unknown

_946293456.unknown

_946293460.unknown

_946293464.unknown

_946293465.unknown

_946293463.unknown

_946293458.unknown

_946293459.unknown

_946293457.unknown

_946293451.unknown

_946293454.unknown

_946293455.unknown

_946293452.unknown

_946293449.unknown

_946293450.unknown

_946293448.unknown

_946293428.unknown

_946293438.unknown

_946293442.unknown

_946293445.unknown

_946293446.unknown

_946293443.unknown

_946293440.unknown

_946293441.unknown

_946293439.unknown

_946293433.unknown

_946293436.unknown

_946293437.unknown

_946293434.unknown

_946293430.unknown

_946293431.unknown

_946293429.unknown

_946293418.unknown

_946293423.unknown

_946293425.unknown

_946293427.unknown

_946293424.unknown

_946293420.unknown

_946293422.unknown

_946293419.unknown

_946293413.unknown

_946293415.unknown

_946293417.unknown

_946293414.unknown

_946293411.unknown

_946293412.unknown

_946293410.unknown

_946293337.unknown

_946293373.unknown

_946293391.unknown

_946293400.unknown

_946293404.unknown

_946293406.unknown

_946293408.unknown

_946293405.unknown

_946293402.unknown

_946293403.unknown

_946293401.unknown

_946293395.unknown

_946293398.unknown

_946293399.unknown

_946293396.unknown

_946293393.unknown

_946293394.unknown

_946293392.unknown

_946293382.unknown

_946293386.unknown

_946293389.unknown

_946293390.unknown

_946293387.unknown

_946293384.unknown

_946293385.unknown

_946293383.unknown

_946293377.unknown

_946293380.unknown

_946293381.unknown

_946293379.unknown

_946293375.unknown

_946293376.unknown

_946293374.unknown

_946293355.unknown

_946293364.unknown

_946293369.unknown

_946293371.unknown

_946293372.unknown

_946293370.unknown

_946293366.unknown

_946293368.unknown

_946293365.unknown

_946293360.unknown

_946293362.unknown

_946293363.unknown

_946293361.unknown

_946293358.unknown

_946293359.unknown

_946293357.unknown

_946293347.unknown

_946293351.unknown

_946293353.unknown

_946293354.unknown

_946293352.unknown

_946293349.unknown

_946293350.unknown

_946293348.unknown

_946293342.unknown

_946293344.unknown

_946293345.unknown

_946293343.unknown

_946293340.unknown

_946293341.unknown

_946293339.unknown

_946293299.unknown

_946293316.unknown

_946293326.unknown

_946293332.unknown

_946293335.unknown

_946293336.unknown

_946293334.unknown

_946293329.unknown

_946293331.unknown

_946293327.unknown

_946293322.unknown

_946293324.unknown

_946293325.unknown

_946293323.unknown

_946293319.unknown

_946293320.unknown

_946293318.unknown

_946293308.unknown

_946293312.unknown

_946293314.unknown

_946293315.unknown

_946293313.unknown

_946293310.unknown

_946293311.unknown

_946293309.unknown

_946293303.unknown

_946293306.unknown

_946293307.unknown

_946293304.unknown

_946293301.unknown

_946293302.unknown

_946293300.unknown

_946293281.unknown

_946293289.unknown

_946293294.unknown

_946293296.unknown

_946293298.unknown

_946293295.unknown

_946293292.unknown

_946293293.unknown

_946293291.unknown

_946293285.unknown

_946293287.unknown

_946293288.unknown

_946293286.unknown

_946293283.unknown

_946293284.unknown

_946293282.unknown

_946293272.unknown

_946293277.unknown

_946293279.unknown

_946293280.unknown

_946293278.unknown

_946293274.unknown

_946293276.unknown

_946293273.unknown

_946293268.unknown

_946293270.unknown

_946293271.unknown

_946293269.unknown

_946293264.unknown

_946293266.unknown

_946293267.unknown

_946293265.unknown

_946293262.unknown

_946293263.unknown

_946293260.unknown

_946293261.unknown

_946293258.unknown

_946293257.unknown

Course

Documents

Transcript of Course