Course
-
Upload
pruebaprueba123 -
Category
Documents
-
view
1 -
download
0
description
Transcript of Course
Advanced sampling theory
3
Statistics NetherlandsDivision Research and Development
Department of Statistical MethodsP.O. box 4481
6401 CZ Heerlen
The Netherlands
A Course in Sampling Theory
byRobbert H. Renssen*
* The views expressed in this paper are those of the author and do not necessarily reflect the policies of Statistics Netherlands.
Project no.:RSM-50351
BPA no.:2138-98-RSM-1
Date:16 maart 1998
A course in sampling theoryby
Robbert H. Renssen
Section Statistical Methods
Division Research and Development1. IntroductionThe purpose of sample surveys is to gather information about a certain finite population by estimating finite population parameters such as means, totals, or fractions. In sampling theory, observations obtained from the sampling units are regarded as fixed. The randomness is introduced because a probability sample is observed instead of the whole population. The population to be sampled (the sampled population) should coincide with the population about which information is wanted (the target population). Sometimes, for reasons of practicability or convenience, the sampled population is more restricted than the target population. If so, it should be remembered that the conclusions drawn from the sample only apply to the sampled population. Before selecting the sample, the population must be divided into parts that are called sampling units. In principle, these units must cover the whole population and they must not overlap, in the sense that each element in the population belongs to exactly one sampling unit. The construction of a list of sampling units, called a sampling frame, is often one of the major practical problems. Sampling frames are often found to be incomplete, or partly eligible, or contain an unknown amount of duplication.
A common starting point in a survey design is to concentrate first on the mere effects of sampling and ignore any frame imperfections or any other errors, which may occur. When developing a sample survey the following issues are important, assumed that the survey is based on a probability sample:
the sampling design and the sampling selection scheme, both applying before data collection,
the estimator by which a particular parameter will be estimated, applying after the data collection.
The sampling design is a set of specifications, which defines the target population, the sampling units, and the probabilities attached to the possible samples. The sample selection scheme describes the mechanical selection of a sample according to the chosen design. Which design fits best for a particular survey depends on the auxiliary information present in the frame. The more information is available before sampling the better the sampling design can be tailored to the survey objectives. The estimator is the mathematical function by which the estimate for a particular parameter is computed. The form of the parameter often induces the choice of an estimator. An estimator may contain auxiliary information, either from the sampling frame or from external sources. The combination of a sampling design and an estimator is called a sampling strategy.
2. The design based approach; definitions and notationsWe consider a finite population U of N elements/units and associate with each element k a value yk of a scalar target variable and a p-vector xk with values of p auxiliary variables. Note that both Y and X may be interesting for publication purposes. By means of some (complex) design a sample S of fixed size n is drawn from U. Given the sampling design we consider the set ( of all possible samples. Denote for each S ( (
: probability of a specific sample S and
: estimator of a (finite) population parameter ( by means of S.According to the design based approach inference proceeds with respect to the sampling distribution of statistics of repeated samples S generated by the sampling design (Skinner et al., 1989, chap.1). For example, the design expectation and design variance are defined as
and
.
Let
denote the number of times the element k is drawn for a specific S:
. Define the first order inclusion expectation of k as
, (k = 1,...,N).
The second order inclusion expectation of element k and l is defined as
.
In the following we will consider a number of well-known sampling designs and corresponding estimators.
3. Simple random sampling, model assisted approach, and simplified formBy simple random sampling each sample has the same probability to be drawn. We consider only simple random sampling without replacement. The population and sample mean of Y are denoted by respectively
and
,
where S is a simple random sample without replacement. The population and sample variance with respect to Y are
and
.
The set ( consists of
samples of which
contain element k and
element k as well as element l, k ( l. So,
and
for k ( l.
Note that
for k = l. A direct estimator for the population mean in case of simple random sampling (without use of auxiliary information) is the sample mean. The following proofs are illustrative:
and
where we have made use of the following identity:
.
The factor (N-n)/N is called the finite population correction factor. For simple random sampling, the general regression estimator (with use of auxiliary information) is defined as
, where
(3.1)
with (k > 0 a scalar. The second term of the general regression estimator can be considered as a correction for the sample mean. An important issue is the choice of (k. It is required that all (k are known. Srndal et al. (1992) suggested taking
, where the
can be interpreted as the variance of independent random variables
defined in a superpopulation model (, of which the
are supposed to be the outcomes. More precisely, the model ( has the following features:
are assumed to be realized values of independent random variables
,
, and
.
This model is only used to determine a specific choice of xk and (k. In other words; the model serves as a vehicle for finding an appropriate general regression estimator. Once the estimator is found, the model is of no longer use. The properties of the general regression estimator (expectation and variance) are still derived from a design based of view. Finding a suitable general regression estimator by means of a superpopulation model within the framework of the design based approach is called model assisted.
Note that
. By means of the first order of the Taylor series the general regression estimator can be linearized. The partial derivatives are
,
, and
.
The first order Taylor series expansion of f at
, where B is the population regression coefficient, i.e.
equals
The general regression estimator is Approximately Design Unbiased (ADU):
.
Furthermore, the design variance of the general regression estimator can be approximated by
,
where
and
. Note that
and
are respectively the sample mean and population mean of the residuals ek. An important simplified form of the general regression estimator can be derived from the first of following results. If there is a constant p-vector c such that
for all
then
and hence
and hence
.
We only proof the first result; the second result can be proved analogously. Under the stated assertion we have
,
which gives the first result. We may distinguish several special cases of the general regression estimator:
the ratio estimator: p = 1,
, and
,
the regression estimator: p = 2,
, and
,
post-stratification: p = A (number of post-strata),
, and
. Note that (k represents A dummy variables. Each dummy variable corresponds to a post-stratum. It equals 1 if the k-th element belongs to that post-stratum, otherwise it equals 0.
These estimators can all be written in the simplified form. Take
in case of the ratio estimator,
in case of the regression estimator, and
in case of post-stratification.
4. Stratified simple random samplingFor stratified sampling designs the population U is divided into H mutually exclusive strata: denoted by U1,...,UH . In each stratum a simple random sample Sh is drawn without replacement. The population and sample mean of Y in stratum h are denoted by respectively
and
.
Furthermore, the population and sample variance with respect to Y in stratum h are
and
.
The set (h consists of
samples Sh and so ( consists of
samples S. A direct estimator for the population mean of
is obtained by
. Note that the known stratum totals Nh are used in this estimator. The design expectation and design variance are easily derived:
and
.
An important issue by stratified designs is the allocation scheme, i.e. the allocation of the sample over the strata. Two well-known allocation schemes are proportional allocation and the so-called Neyman-allocation:
and
.
The Neyman-allocation minimizes the design variance of the direct estimator for fixed sample size n. For stratified designs we distinguish two important estimators, namely
the separated general regression estimator:
the combined general regression estimator:
, where
and
Special cases of the separated general regression estimator are the separated ratio estimator and the separated simple regression estimator and special cases of the combined general regression estimator are the combined ratio estimator and the combined regression estimator. Note that the separated general regression estimator can also be considered as a special case of the combined general regression estimator. Namely if is taken as an auxiliary variable for the combined general regression estimator then we obtain the separated general regression estimator with xk as auxiliary variable. Here, the operator ( denotes Kronecker product, see e.g. Zeelenberg (1993).
5. Cluster samplingFor cluster sampling, the population U is divided into M mutually exclusive clusters (also called primary units):
. Cluster i contains Mi units/elements (called secondary units). We will discuss two sampling designs
design 1: A simple random sample (or a stratified sample) of m clusters is drawn without replacement; the complete cluster is observed,
design 2: A simple random sample (or a stratified sample) of m secondary units is drawn without replacement; the complete corresponding cluster is observed.
The choice of one these designs may depend on e.g. the available information in the sampling frames. We note that the second design has a practical advantage for panel surveys with households as observational units. Namely, once the panel has been drawn it is easier to follow persons (according to the second design) than households (according to the first design), since the composition of a household may change after the first wave. If a household composition is changed then its first order inclusion expectation should be adjusted accordingly. In case of the second design only the current household composition is needed to do so, while in case of the first design the complete history of the household composition, starting from the first wave, has to be taken into account.
Since in both designs a cluster is observed completely, the Y-variables and the X-variables can be calculated at the cluster level as well as at the level of the secondary units. To be specific, we consider household sampling, where each household member is observed. We may distinguish between household characteristics (composition of household, size of households, region), and person characteristics (sex, age, marital status, region). Study variables (target or auxiliary) may concern both types of characteristics. This implies that some study variables are defined at the level of persons (indicated by two indices) while others are defined at the level of households (indicated by one index).
Design 1. The direct estimator for the population total is defined as (Y concerns person characteristics):
with
.
(5.1)
If Y concerns household characteristics then the direct estimator is
with
.
(5.2)
The star-notation indicates that a characteristic is derived (inherited) from the value of the other observation unit. Obviously, both estimators can be formulated at the level of persons as well as at the level of households. Note that the design expectations and the design variances of these estimators can be derived easily at the level of households. The general regression estimator is
,
(5.3)
where
can be defined at the cluster level or at the level of secondary units. At the cluster level we have
,
e.g. post-stratification with respect to households. Naturally, (i should concern household characteristics. For example, according to the model assisted approach, (i can be viewed as a model variance defined for clusters. At the person level we have
,
e.g. post-stratification with respect to persons. Here, (ij can be interpreted as a model variance for persons. We will discuss the second design in the next section.
6. The Horvitz-Thompson estimatorThe Horvitz-Thompson estimator can be used as a general tool to construct unbiased estimators for many sampling designs. Let (k denote the first order inclusion expectation of the k-th sampled unit and let yk denote the observation which corresponds to this unit, then the Horvitz-Thompson estimator for the population mean of Y is defined as
.
It follows that
.
If the units are sampled without replacement then the following expression can be derived for the design variance:
It is important to note that all direct estimators, which are discussed so far, are in fact Horvitz-Thompson estimators. It remains to discuss a direct estimator for the second design in case of cluster sampling: m secondary units are drawn by simple random sampling without replacement and the complete cluster (primary unit) is observed.
Design 2. In order to calculate the first order inclusion expectation of a secondary unit uij we divide the population U into two parts, namely the secondary units belonging to the i-th primary unit and the remaining secondary units. If m secondary units are drawn by simple random sampling without replacement and for each drawn secondary unit the complete cluster is observed, i.e. is in the sample, then according to the hypergeometric distribution we have
.
(6.1)
Now, we construct a sample of m clusters Sc, such that each drawn secondary unit corresponds to precisely one cluster in Sc, namely, the cluster it belongs to. If two distinct secondary units belonging to the same cluster are drawn, then this cluster is duplicated in Sc. By construction (6.1) is also the first order inclusion expectation of cluster ci with respect to Sc. Based on Sc and the Horvitz-Thompson formalism, we may construct the following unbiased estimators for the population total of Y (if Y concerns person characteristics)
or (if Y concerns household characteristics)
.
The design variance of both estimators can be derived easily, since the elements (clusters) in Sc can be considered as a simple random sample without replacement. If Y concerns person characteristics, then is observed at the i-th element, otherwise if Y concerns household characteristics is observed.
7. The general regression estimatorBased on Horvitz-Thompson estimators the general regression estimator for the population total is defined as:
with
(7.1)
For simple random sampling (7.1) corresponds to (3.1) and for stratified simple random sampling (7.1) corresponds to the combined general regression estimator. For cluster sampling the auxiliary information in (7.1) may be used at the cluster level or at the level of secondary units, see the next section.
It is convenient and common practice to present the general regression estimator in terms of weights:
,
(7.2)
with
.
(7.3)
Often, wk are called final weights and 1/(k inclusion weights. The correction weights gk are often called g-weights. Weighting offers a way to estimate population means for study variables without needing this variable in advance. One only has to determine the vector of auxiliary variables (determine the weighting model) to calculate the weights according to (7.3). Afterwards one may calculate (7.2) for every arbitrary variable, which is observed in the sample. For each variable the result is a general regression estimation with a predetermined weighting model.
8. Consistent weighting between persons and householdsCluster designs need special attention because one can weight at two levels: weighting at the level of persons (the index k in (7.3) stands for persons) or at the level of households (the index k in (7.3) stands for households). Again we distinguish between person characteristics and household characteristics. Study variables (target or auxiliary) may concern both types of characteristics. This implies that some study variables are defined at the level of persons while others are defined at the level of households. The main issue of consistent weighting between persons and households is to translate persons characteristics into households scores (or vice versa), such that both types of characteristics can be used for either weighting procedure. In addition, all person weights within a household should be the same and equal to the household weight, i.e.
for all
. We only discuss the method of Lematre and Dufour (1987).
First note that
for all
(by design). Let
if X concerns a household characteristic and
if X concerns a person characteristic. Define the following weights at the household level:
,
(8.1)
where
is the Horvitz-Thompson estimator for the population total of X (defined at the household level). Furthermore, let
for all
if X concerns a household characteristic and
for all
if X concerns a person characteristic, and define the following weights at the person level:
.
(8.2)
By construction we have
for all
, so it follows that
is a Horvitz-Thompson estimator for the population total of X also (defined at the level of persons). Since the first order inclusion probabilities and the auxiliary variables are equal within households, the weights wij given by (8.2) must be equal within households if (ij is taken equal within households. Furthermore, person weights should also represent household weights, i.e. (8.1) should equal (8.2). Now, these demands are fulfilled if
and
for all
are inserted in (8.1) and (8.2) respectively. Two choices for
are
.
If X concerns purely household characteristics, then the first choice is interesting, while the second choice should be considered if X concerns purely person characteristics. In Nieuwenbroek (1993) these choices are motivated from a model assisted point of view. For example, suppose that X concerns person characteristics. If (ij = 1 is a suitable choice to weight at the level of persons, then one should take
and
for consistent weighting.
9. Bounding g-weights; the Huang and Fuller algorithm There is no guarantee that the general regression weights given by (7.3) are strictly positive. Apart from the fact that negative weights may induce negative population totals, many users of statistics are reluctant to work with negative weights. The problem with respect to unacceptable weights tends to increase when a very extended weighting model is used in comparison with the sample size. Several techniques have been developed to force the weights within a certain interval. The use of calibration estimators to prevent extreme weights will be discussed in the next section. In this section an algorithm largely based on Huang and Fuller (1978) is given. According to this algorithm the correction weights
are forced within a certain interval [L,U], with 0 < L < 1< U, by an iterative process (see Nieuwenbroek (1997):
Step 1. Choose the lower and upper bounds L and U for the g-weights, and the maximum number of iterations (max.
Step 2. Set ( = 0 and initialize
for all k( S.
Step 3. Calculate the g-weights:
Step 4. If all g(()-weights are within the interval or if ( = (max, the process stops; otherwise continue with step 5.
Step 5. Set ( = ( + 1 and calculate the distance
if
and
if
Note that
.
Step 6. Set
if
if
if
Note that if
falls outside the interval, i.e.
or
, then
Step 7. Repeat from step 3.
Clearly, with the help of q-factors the g-weights are adjusted such that, hopefully, they fall inside the interval after the last iteration. Nieuwenbroek (1997) strongly advises to be carefully with the specification of the interval acceptance. In particular tight bounds may cause problems. It is illustrative to show the g-weights for some particular weighting models in case of simple random sampling:
post-stratification:
if k belongs the h-th post-stratum,
ratio estimator:
with ,
simple regression:
with
.
For the weighting models, which correspond to post-stratification, and the ratio estimator, a restriction on the g-weights is not sensible. For example, in case of post-stratification the starting g-weights are constant within post-strata. So, within a post-stratum all starting g-weights fall either inside or outside the interval. If they fall outside the interval, then according to step 5 and 6 the corresponding q-factors will be constant within post-strata. The elaboration of step 3 in case of post-stratification shows that the g-weights are not affected by such q-factors; they remain the same after each iteration.
From the model assisted point of view the Huang-Fuller algorithm can be motivated. After convergence the resulting estimates can be considered as generalized regression estimates with modified (-factors, namely qk(k (in the strict sense the resulting estimates are not generalized regression estimates, because the modified (-factors are sample dependent). From the model assisted point of view the original (-factors are interpreted as inverse values for the model variances, see section 3. The fact that some regression weights are negative suggests that the model be misspecified. The Huang-Fuller algorithm tries to fit the model (after data collection) via a modification of the model variances.
10. Calibration estimationUse of auxiliary information by means of the general regression estimator can be justified by a regression relationship between the target variable on the one hand and the auxiliary variables on the other hand. It is shown that the general regression estimator implicitly defines weights by means of which population totals of study variables can be estimated. In this section we show that a different route can obtain the general regression estimator, namely by focusing on the weights instead of the linear regression relationship. This route offers us 1) a way to generalize the general regression estimator and 2) a tool to restrict the g-weights alternatively.
Denote
. The general regression weights given by (7.3) can also be obtained by minimizing
subject to
with respect to w1,...,wn, or equivalently, by minimizing
,
(10.1)
with respect to w1,...,wn and (1,...,(p. Here, ( = ((1,...,(p )t is a p-vector of Lagrange multipliers. By differentiating (10.1) with respect to w1,...,wn and setting the derivative at 0, we obtain
, from which it follows that
.
(10.2)
Differentiating (10.1) with respect to ( , setting the derivative at 0, and inserting (10.2) we obtain
,
which gives
,
(10.3)
provided the inverse exists. The resulting weights can be obtained by inserting (10.3) into (10.2). Indeed, the resulting weights coincide with (7), so the resulting estimator is just the general regression estimator. Now, the minimization problem (10.1) and hence the generalized regression estimator can be generalized as follows.
Let G be a real valued function with the properties: G is positive, strictly convex, G(1) = G((1) = 0 and G(((1) = 1. Extending (7.3), a calibration estimator for the population total of Y is defined as
,
where the calibration weights are obtained by minimizing
(10.4)
with respect to w1,...,wn, (1,...,(p. Roughly, a calibration estimator uses calibration weights, which are as close as possible, according to a certain distance measure, to the original sampling weights dk. For the specific distance function
the calibration estimator reduces to the general regression estimator.
Differentiating (10.4) with respect to wk we obtain
, and solving for wk we obtain
,
(10.5)
where F is the inverse function of G(. Note that the existence of F is guaranteed since G is strictly convex, and hence G( is strictly increasing. Differentiation (10.4) with respect to ( and inserting (10.5) we obtain
.
(10.6)
This is a system of p equations and p unknowns, which should be solved for (. Let
and
.
Then, according to the Newton-Raphson algorithm, a solution may be found by
.
Often, (10.3) is taken as a starting value. According to (10.5) the g-weights (obtained by calibration) are proportional to F. Therefore, the range of the g-weights is restricted by the range of F. Now, instead of (10.4) one could define calibration weights by means of (10.5) and (10.6) with an appropriate F-function. (F should be monotone increasing, F(0) = 1 and F((0) = 1, and (10.6) should have a solution, i.e. the range of F should not be too tight.)
Besides
, which corresponds to the general regression estimator, we will give two more F-functions, namely
and
,
where L and U are defined as in section 9. The first F-function is motivated by the desire to restrict the regression weights. It is called the truncated linear method. The second F-function corresponds to the multiplicative method (or raking method) as will be shown in section 12 for two-way tables. Note that the second F-function is bounded by zero from below.
We close this section with an important property of calibration estimators: under certain regularity conditions the calibration estimator is asymptotically equivalent to the general regression estimator. In particular, they have the same asymptotic design expectations and design variances. A heuristic argument is the following. For large sample sizes
is close to
(since
is a consistent estimator for
). Then, by (10.6) the F-value should be close to 1, and ( should be close to 0. But, since F(0) = F((0) = 1 for all F-functions, they have the same behavior in the neighborhood of 0. It follows that all F-functions can be approximated by
, i.e. the F-function which corresponds to the general regression estimator.
11 Calibration estimators for post-stratificationAn important special case to consider is the calibration estimator which corresponds to (complete) post-stratification: p = A,
, and
, see section 3. Then
if element k belongs to the h-th post-stratum, and (10.6) can be elaborated as
,
.
It follows that
, where
,
. So, in case of post-stratification the calibration weights are
if
,
regardless of the function F. The resulting calibration estimator corresponds to the well-known post-stratification estimator. Note that, strictly speaking, the calibration estimator is not defined for the post-stratification model if the upper bound of F is smaller than
. However, as already said, for post-stratification a restriction on the g-weights and hence on the F-function is not sensible.
12. Iterative proportional fitting for two-way tablesIn this section we consider estimating a two-way table with calibration on the marginal counts;
and
, where (1k is a r-vector with dummies denoting to which row element k belongs and (2k is a c-vector with dummies denoting to which column element k belongs. Let u = (u1,...,ur)t denote a vector of order r and v = (v1,...,vc)t a vector of order c. By letting
we have
whenever k belongs to the (i,j)-th cell. Let
denote the marginal row counts and
the marginal column counts. Denote further
,
i.e. the Horvitz-Thompson estimator for the population total of the (i,j)-th cell. Then, the calibration equations given by (10.6) are
, i = 1,...,r
(12.1)
and
, j = 1,...,c.
(12.2)
For the multiplicative method we have
. In this case (12.1) and (12.2) can be written as
, i = 1,...,r
(11.3)
and
, j = 1,...,c
(11.4)
respectively. A solution of (11.3) and (11.4) is obtained by carrying out until convergence the classical raking algorithm, often called iterative proportional fitting. First set exp(vj) = 1 and calculate exp(ui) according to (11.3). Then inserting this value in (11.4) we calculate a new value for exp(vj), which in turn can be used to calculate a new value for exp(ui), etc. After convergence, the population cell counts are estimated by
.
According to (10.5) the corresponding calibration weights are
if k belongs to the (i,j)-th cell.
13. Consistent calibration weights in cluster samplingAgain cluster sampling needs some special attention, because then (10.5) and (10.6) can be defined at two levels. We extend the method of Lematre and Dufour to obtain consistent calibration weights between persons and households. Define
,
,
, and
similarly as in section 8, and note that
for all
.
(13.1)
For calibration weights defined at the level of persons we insert zij and
in (10.5) instead of xij and (ij, and for calibration weights defined at the household level we insert zi and
instead of xi and (i. It follows from (10.5) and (13.1) that all calibration weights defined at the person level are the same within a cluster, which in turn is the same as the corresponding calibration weight defined at the cluster level. It follows from (10.6) that both the person weights as well as the household weights induce the known population totals if these weights are applied to the X-variables.
14. Double sampling/two-phase samplingThe sampling strategies discussed so far heavily depend on the use of auxiliary information. When such information is not available, one could consider to take a large preliminary sample in which only auxiliary variables are observed. We distinguish between double sampling for stratification and double sample for the (general) regression estimator.
Double sampling for stratification.
The population is to be stratified into L strata. The first (preliminary) sample is a simple random sample (without replacement) of size n1. Let
denote the proportion of the population falling in stratum h, and
the proportion of the first sample falling in stratum h. Then wh is a design unbiased estimator of Wh. This estimator is used as auxiliary information for the second sample, which is a stratified simple random (sub)sample from the first sample. In the following we assume that n2h = vhn1h, where 0 < vh ( 1 and we assume that the vh are chosen in advance, i.e. they are fixed. The population mean of Y is estimated by
,
where
is the estimated population mean of the h-stratum based on the second sample. Given the first sample,
is an unbiased estimate of
, i.e. the (unobservable) estimate for the population mean of the h-stratum based on the first sample. We have
,(14.1)
where
is the sample mean of the first sample. So,
is design unbiased. Note that we have conditioned on the first sample. In order to derive the design variance, we partition the set of all possible preliminary samples S1, denoted by (sr, into a set of all possible samples which would have been obtained by stratified simple random sampling where n1h elements are drawn in stratum h, denoted by (strat, and a set of remaining samples. The design variance of
is obtained by
The second term equals
,
and the first term can be elaborated as
.
Note that S1 given (strat can be considered (by construction) as a stratified simple random sample where n1h and hence also n2h and w1h are constant, and
are unbiased estimates for
. So,
EMBED Equation.2 .
(14.2)
Double sampling for the general regression estimatorIn some applications of double sampling the preliminary sample is used to provide auxiliary information for the general regression estimator based on second sample. The estimate of the population mean is
,
where the multiple regression coefficient is estimated from the second sample. Note that
. A first order Taylor series expansion in
gives
.
We have
and
, so . It follows that the general regression estimator in case of double sampling is ADU. The design variance of this estimator can be approximated by
.
There is a relationship between double sampling on the one hand and samples with non-response on the other hand. In both cases a sample S is drawn according to some (known) design, but the target variables are observed in a sub-sample. However, in case of double sampling, the sub-sample is drawn with known inclusion probabilities (they may depend on S), while in case of non-response the inclusion probabilities are unknown. The latter are called response probabilities. These probabilities may depend on personal circumstances, but also on the field work organization and the data collection method.
15. Dealing with (unit) non-responseThe greater the non-response rate, the more one has reason to worry about its harmful effect on the survey estimates. Strategies for dealing with non-response can be classified as follows (see Srndal et al., 1992)
Before and during data collection, effective measures are taken to reduce the non-response to insignificant levels,
Special, perhaps costly techniques for data collection and estimation are used that induce unbiased estimators,
Model assumptions about the non-response mechanism and about relations between variables are used to construct estimators that adjust for non-response.
We will only discuss a) sub-sampling of non-respondents and b) one specific response model. Both can be linked to double sampling.
Sub-sampling of non-respondentsOne approach to deal with non-response is to take a sub-sample of the non-respondents, and make every possible effort to obtain responses from all elements in this sub-sample. This idea is developed by Hansen and Hurwitz (1946). Assume that a simple random sample of size n1 is drawn without replacement in the first trial. Let wr and wnr denote the responding and non-responding sampling fractions, respectively. In the sub-sampling phase a simple random sample (without replacement) of size n2 is drawn from the non-respondents. This procedure resembles double sampling for stratification, with a subdivision of the initial sample into two strata of which one is completely observed ad the other is sub-sampled. However, there is an important difference, because the division of the population into the strata is not fixed (unless the fixed response model is used), but can be considered as a realization of Poisson sampling; whether unit k belongs to the responding stratum or not depends on the realization of a Bernoulli experiment with its personal response probability as success fraction. Let
denote the sample mean of the respondents in the first trial and
the sample mean of the sub-sample among the non-respondents. Then
(15.1)
is an unbiased estimate for the population mean. This is easily seen as follows. Given the realization of the Poisson sampling, the population is divided in two fixed strata, the first consisting of all units for which measurements would be obtained after the first trial, the second of units which no measurements would be obtained after the first trial. So, given the realization of the Poisson sampling, the complete sample, i.e. the first sample plus the sub-sample of non-respondents, can be considered as double sampling for stratification with H = 2, v1 = 1, and v2 the sub-sampling fraction. It follows from (14.1) that
.
Given the realization of the Poisson sampling, the variance can easily be obtained from (14.2):
.
Since,
, the unconditional variance of
is
.
Note that this variance cannot be evaluated without exact knowledge of the response behavior, however, it is possible to obtain an unbiased estimator for this expression.
The response homogeneity group modelIf there is full response, unbiased or nearly unbiased estimators can be constructed for a given sampling design. When the non-response is not negligible we should distinguish between the intended sampling design (developed by the statistician) and the realized sampling design, which may differ from the intended design due to non-response. By means of a response model both designs can be linked. A response model is a set of assumptions about the true unknown response behavior. According to the response homogeneity group model, the population is divided into G groups and it is assumed that within each group each individual has the same probability to respond if he/she fall into the sample. Furthermore it is assumed that different potential respondents will respond independently of each other. In the following we distinguish between 1) net and gross sample, 2) net and gross sample size, and 3) net and gross inclusion probabilities. It follows from the assumptions that
if k belongs to group gLet (k denote the first order gross inclusion probabilities, then the first order net inclusion probabilities are defined as
if k belongs to group g.
The groups are to be chosen so that the response homogeneity response model describes as accurately as possible the response behavior. If the response probabilities are known, then the complete theory discussed above can be applied straightforwardly. However, in general the response probabilities are unknown and have to be estimated. We consider two estimators for the g-th group
(15.2)
and
,
(15.3)
where rk is the realization of a Bernoulli experiment with E(rk) = (g if k belongs to the g-th group. Clearly, under the model assumptions, both estimators are (model) unbiased for (g. The first estimator is the maximum likelihood estimator for (g (the ordinary sample mean). It is the non-response fraction of the g-th group in the sample. The second estimator is a ratio estimator for the realized response fraction of the g-th group in the finite population. If (k are constant within a group (simple random sampling or stratified simple random sampling with groups as strata) then both estimators coincide. For convenience we will use the estimator given by (15.2). Based on the estimated net inclusion probabilities, we may define the Horvitz-Thompson estimator for the net sample
.
(15.4)
For simple random sampling and for stratified simple random sampling where each group corresponds to a stratum (15.4) reduces to
and
respectively. The design expectation and design variance of (15.4) can be obtained by the following reasoning. The net sample can be obtained in two-phases. In the fist phase the gross sample is drawn according to the intended sampling design. In the second phase the net sample is drawn from the gross sampling according to Poisson sampling with (conditional) inclusion probabilities the unknown (k.. Noting that
,
the Poisson sampling can be interpreted as stratified simple random sampling of sample strata sizes m1,...,mG, where mg are independent binomial distributed random variables with parameters E(mg) = (g and ngross,g. So, given the gross sample and the realization of the net samples sizes per group, the net sample can be regarded to be a stratified simple random sample from the gross sample (see section 14). For example,
.
Note that this expectation is independent of the realization of the net sample sizes per group. It follows that
is an unbiased estimator for the population total. The variance can be expressed as
.
Note that the conditional variance in the second term can be derived easily by considering the net sample as a stratified simple random sample from the gross sample. This second term is the increase in variance due to non-response. The first term is the ordinary design variance of the Horvitz-Thompson estimator in case of full response.
By means of the net inclusion probabilities, it is also possible to formulate the general regression estimator:
Obviously, this estimator is ADU under the response homogeneity group model. In case of post-stratification, where each post-stratum corresponds to a group (it is assumed that the group population totals are known), we obtain
.
Obviously, this post-stratification estimator corrects for bias due to non-response without using the net inclusion probabilities. This result justifies the general regression estimator as a tool to correct for bias due to non-response without formulating the response homogeneity group model explicitly. It is hoped that the auxiliary variables that are incorporated in the general regression estimator will also reflect the response homogeneity group model.
16. Aligning estimates between two sample surveysSo far, we have discussed a number of sampling designs and estimation procedures for situations were just one sample survey is involved. In this section we extend the weighting technique to weight two (or more) samples simultaneously. The sampling designs of both samples may differ. We will discuss two techniques; one is based on minimizing a distance function under a set of constraints, the other on the general regression estimator. First we need some terminology. We use the term target variable (denoted by Y) for those variables observed in either one survey (but not in both) for which the population totals are unknown, common variables (denoted by Z) for those variables observed in both survey but for which the population totals are unknown, and control variables (denoted by X) for those variables observed in both surveys but for which the population totals are known. All kind of variables may be interesting for publication purposes. The purpose of aligning estimates is to weight both samples such that the weights reproduce the known population totals of the control variables as well as identical estimates for the population totals of the common variables.
The composite constraints methodLetting the subscripts refer to sample 1 and 2 respectively, we consider the following minimization problem (compare 10.1): Minimize
subject to
,
, and
with respect to w1k and w2k . Similarly as (10.1), this minimization problem can be solved with Lagrange multipliers. The resulting weights reproduce the known population totals of the control variables by the first and second set of constraints. They are mutually consistent with respect to the common variables according to the third set of constraints.
The adjusted general regression estimatorAs already known, the general regression estimator implicitly defines weights which reproduce the known population totals of control variables if these control variables are used as auxiliary variables in the regression estimation. So, the consistency requirement with respect to the control variables is always fulfilled if one takes these control variables as auxiliary variables in the general regression estimator. For the common variables it is proposed to estimate the unknown population totals by pooling the two sample surveys, and then simultaneously using these common variables as additional regressors in the general regression estimator. Let
denote the (pooled) estimates of the population totals of the common variables. The adjusted general regression estimator for the population total of Y is defined as
, i = 1,2,
(16.1)
where the partial regression coefficients are simultaneously obtained from
, i = 1,2.
(16.2)
Both adjusted general regression estimators implicitly define a set of weights, which are reproductive with respect to the control variables and mutually consistent with respect to the common variables. Note that the definition of the adjusted general regression estimator is very similar as the definition of the general regression estimator for double sampling. However, there is an important conceptual difference. The purpose of the adjusted general regression estimator is to obtain mutually consistent weights between two separate sample surveys with some variables in common, while the purpose of double sampling is to reduce the design variance of an estimator within a particular (second phase) sample survey by gathering extra auxiliary information in a corresponding first phase sample.
In order to derive some properties of the adjusted general regression estimator it is convenient to introduce some matrix notation. We only consider the first sample; the other sample can be treated analogously. Denote
,
,
, and
.
The partial regression coefficients given by (16.2) can be written as
.
Using well-known theory about partial matrices (it is assumed that all inverse matrices exists), it can be shown that
and
,
where
,
,
and
.
It follows that the adjusted general regression estimator given by (15.1) can be rewritten as
,
(16.3)
where
and
are the ordinary general regression estimators for the population total of Y and Z respectively. Apparently, the adjusted general regression estimator is equal to the ordinary regression estimator plus an adjustment term. This adjustment term can be viewed as an attempt to further improve the ordinary general regression estimator. However, and probably more important, it is a means to achieve consistent estimates between the two samples with respect to the common variables. The adjusted regression estimator given by (16.3) implicitly defines the following vector of weights:
,
with
,
where
and l a vector of 1s. Noting that
, it is readily seen that indeed
and
. An important issue is the estimation of the population total of Z. A natural choice is
,
where
and
are the ordinary general regression estimators of sample 1 and 2 respectively, and P and Q matrices such that P + Q = I. We give three interesting choices
and
, where ( , 0 ( ( ( 1, is a crude measure of the amount of confidence in one estimator compared to the other,
the choice
and
gives minimal design variance of
for arbitrary vector a,
the choice
and
induces the same weighs as the use of composite constraints method.
The first choice is easy to implement. If we take
then this choice takes into account the difference in sample size. More generally, ( may depend on indicators for several survey errors, such as frame errors, sampling errors, non-response errors, and measurement errors. The second choice only deals with sampling errors, but in an optimal way. It takes into account sample size, sampling design, and use of auxiliary information. The third choice shows that the class of weights defined by the adjusted general regression estimator includes the weights defined by the composite constraints method. Furthermore, it reveals a weakness in the composite constraint method. Namely, it can be argued that P and Q according to the third choice both converge to
for large sample sizes of both samples (assuming that in both sample surveys the same set of control variables are used). Obviously such a choice and hence the composite constraints method does not account properly for differences in sample sizes.
References
Bethlehem, J.G. and Keller, W.J. (1987), Linear Weighting of Sample Survey data, Journal
of Official Statistics, 3, 141-153.
Cochran, W.G. (1977), Sampling Techniques, 3rd-ed. New York: Wiley.
Deville, J.C. and Srndal, C.E. (1992), Calibration Estimators in Survey Sampling, Journal of the American Statistical Association, 87, 376-382.
Deville, J.C., Srndal, C.E., and Sautory, O. (1993), Generalized Raking Procedures in Survey Sampling, Journal of the American Statistical Association, 88, 1013-1020.
Gouweleeuw, J.M. Heerschop, M.J., and van Huis, L.T. (1997), Consistent Estimation of kind of Activity Units and Local Units using the Calibration Method, Research paper, Department of Statistical Methods, Statistics Netherlands, Voorburg.
Knottnerus, P., Renssen, R.H., and Verboon, P. (1997), Sampling Design and EDI, Research paper, Department of Statistical Methods, Statistics Netherlands, Voorburg.
Koeijers, C.A.J. and Willeboordse, A.J. (eds), (1995), Reference Manual on Design and Implementation of Business Surveys, Statistics Netherlands.
Lematre, G. and Dufour, J. (1987), An integrated Method for Weighting Persons and Families, Survey Methodology, 13, 199-207.
Nieuwenbroek, N.J. (1993), An Integrated Method for weighting Characteristics of Persons and Households using the Linear Regression Estimator, Research Paper, Department of Statistical Methods, Statistics Netherlands, Heerlen.
Nieuwenbroek, N.J. (1997), General Regression Estimator in Bascula 3.0: Theoretical Background, Research paper, Department of Statistical Methods, Statistics Netherlands, Heerlen.
Renssen, R.H. and Nieuwenbroek, N.J. (1997), Aligning Estimates for Common Variables in Two or More Sample Surveys, , Journal of the American Statistical Association, 92,
368-374.
Srndal, C.E., Swensson, B., and Wretman J.H. (1992), Model Assisted Survey Sampling, New York: Wiley.
Zeelenberg, C. (1993), A Survey of Matrix Differentiation, Research Paper, Department of Statistical Methods, Statistics Netherlands, Voorburg.
Zieschang, K. D. (1990), Sample Weighting Methods and Estimation of Totals in the Consumer Expenditure Survey, Journal of the American Statistical Association, 85,
986-1001.
Skinner et al. (1989, chap 1) also consider the model based approach: inference proceeds with respect to the sampling distribution of statistics over repeated realizations y1,...,yN generated by a super-population model. We will not elaborate on this approach.
PAGE
_946293409.unknown
_946293486.unknown
_946293523.unknown
_946293542.unknown
_946293816.unknown
_946367897.unknown
_966081421.unknown
_966082475.unknown
_966083096.unknown
_966084361.unknown
_966084396.unknown
_966084414.unknown
_966083174.unknown
_966082638.unknown
_966082185.unknown
_966082377.unknown
_966081786.unknown
_946980800.unknown
_946984554.unknown
_946376852.unknown
_946449465.unknown
_946450260.unknown
_946376884.unknown
_946376818.unknown
_946304196.unknown
_946363325.unknown
_946367851.unknown
_946304263.unknown
_946304069.unknown
_946304162.unknown
_946293854.unknown
_946293552.unknown
_946293557.unknown
_946293559.unknown
_946293560.unknown
_946293558.unknown
_946293555.unknown
_946293556.unknown
_946293553.unknown
_946293547.unknown
_946293549.unknown
_946293550.unknown
_946293548.unknown
_946293544.unknown
_946293545.unknown
_946293543.unknown
_946293532.unknown
_946293537.unknown
_946293540.unknown
_946293541.unknown
_946293539.unknown
_946293535.unknown
_946293536.unknown
_946293534.unknown
_946293527.unknown
_946293530.unknown
_946293531.unknown
_946293528.unknown
_946293525.unknown
_946293526.unknown
_946293524.unknown
_946293505.unknown
_946293514.unknown
_946293518.unknown
_946293521.unknown
_946293522.unknown
_946293519.unknown
_946293516.unknown
_946293517.unknown
_946293515.unknown
_946293509.unknown
_946293511.unknown
_946293513.unknown
_946293510.unknown
_946293507.unknown
_946293508.unknown
_946293506.unknown
_946293495.unknown
_946293500.unknown
_946293502.unknown
_946293503.unknown
_946293501.unknown
_946293498.unknown
_946293499.unknown
_946293497.unknown
_946293491.unknown
_946293493.unknown
_946293494.unknown
_946293492.unknown
_946293489.unknown
_946293490.unknown
_946293488.unknown
_946293447.unknown
_946293466.unknown
_946293475.unknown
_946293480.unknown
_946293484.unknown
_946293485.unknown
_946293483.unknown
_946293477.unknown
_946293478.unknown
_946293476.unknown
_946293470.unknown
_946293473.unknown
_946293474.unknown
_946293472.unknown
_946293468.unknown
_946293469.unknown
_946293467.unknown
_946293456.unknown
_946293460.unknown
_946293464.unknown
_946293465.unknown
_946293463.unknown
_946293458.unknown
_946293459.unknown
_946293457.unknown
_946293451.unknown
_946293454.unknown
_946293455.unknown
_946293452.unknown
_946293449.unknown
_946293450.unknown
_946293448.unknown
_946293428.unknown
_946293438.unknown
_946293442.unknown
_946293445.unknown
_946293446.unknown
_946293443.unknown
_946293440.unknown
_946293441.unknown
_946293439.unknown
_946293433.unknown
_946293436.unknown
_946293437.unknown
_946293434.unknown
_946293430.unknown
_946293431.unknown
_946293429.unknown
_946293418.unknown
_946293423.unknown
_946293425.unknown
_946293427.unknown
_946293424.unknown
_946293420.unknown
_946293422.unknown
_946293419.unknown
_946293413.unknown
_946293415.unknown
_946293417.unknown
_946293414.unknown
_946293411.unknown
_946293412.unknown
_946293410.unknown
_946293337.unknown
_946293373.unknown
_946293391.unknown
_946293400.unknown
_946293404.unknown
_946293406.unknown
_946293408.unknown
_946293405.unknown
_946293402.unknown
_946293403.unknown
_946293401.unknown
_946293395.unknown
_946293398.unknown
_946293399.unknown
_946293396.unknown
_946293393.unknown
_946293394.unknown
_946293392.unknown
_946293382.unknown
_946293386.unknown
_946293389.unknown
_946293390.unknown
_946293387.unknown
_946293384.unknown
_946293385.unknown
_946293383.unknown
_946293377.unknown
_946293380.unknown
_946293381.unknown
_946293379.unknown
_946293375.unknown
_946293376.unknown
_946293374.unknown
_946293355.unknown
_946293364.unknown
_946293369.unknown
_946293371.unknown
_946293372.unknown
_946293370.unknown
_946293366.unknown
_946293368.unknown
_946293365.unknown
_946293360.unknown
_946293362.unknown
_946293363.unknown
_946293361.unknown
_946293358.unknown
_946293359.unknown
_946293357.unknown
_946293347.unknown
_946293351.unknown
_946293353.unknown
_946293354.unknown
_946293352.unknown
_946293349.unknown
_946293350.unknown
_946293348.unknown
_946293342.unknown
_946293344.unknown
_946293345.unknown
_946293343.unknown
_946293340.unknown
_946293341.unknown
_946293339.unknown
_946293299.unknown
_946293316.unknown
_946293326.unknown
_946293332.unknown
_946293335.unknown
_946293336.unknown
_946293334.unknown
_946293329.unknown
_946293331.unknown
_946293327.unknown
_946293322.unknown
_946293324.unknown
_946293325.unknown
_946293323.unknown
_946293319.unknown
_946293320.unknown
_946293318.unknown
_946293308.unknown
_946293312.unknown
_946293314.unknown
_946293315.unknown
_946293313.unknown
_946293310.unknown
_946293311.unknown
_946293309.unknown
_946293303.unknown
_946293306.unknown
_946293307.unknown
_946293304.unknown
_946293301.unknown
_946293302.unknown
_946293300.unknown
_946293281.unknown
_946293289.unknown
_946293294.unknown
_946293296.unknown
_946293298.unknown
_946293295.unknown
_946293292.unknown
_946293293.unknown
_946293291.unknown
_946293285.unknown
_946293287.unknown
_946293288.unknown
_946293286.unknown
_946293283.unknown
_946293284.unknown
_946293282.unknown
_946293272.unknown
_946293277.unknown
_946293279.unknown
_946293280.unknown
_946293278.unknown
_946293274.unknown
_946293276.unknown
_946293273.unknown
_946293268.unknown
_946293270.unknown
_946293271.unknown
_946293269.unknown
_946293264.unknown
_946293266.unknown
_946293267.unknown
_946293265.unknown
_946293262.unknown
_946293263.unknown
_946293260.unknown
_946293261.unknown
_946293258.unknown
_946293257.unknown