JOINT UN-ECE/EUROSTAT MEETING ON POPULATION AND HOUSING CENSUSES GENEVA, 7-9 JULY 2010
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results...
-
Upload
joan-willis -
Category
Documents
-
view
217 -
download
1
Transcript of Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results...
Joint UNECE/Eurostat Meeting on Population and Housing Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008)Censuses (13-15 May 2008)
Sample results expected accuracy in the Sample results expected accuracy in the Italian Population and Housing CensusItalian Population and Housing Census
Giancarlo Carbonetti, Marco FortiniIstat – Italian National Statistical Institute
General Censuses Directorate
May 13th 2008
Joint UNECE Eurostat Meeting
2
OutlineOutline
Introduction
Some aspects related to the use of samples of households for long form enumerations
Sampling strategies
Simulation study
Some results
Conclusions
Joint UNECE Eurostat Meeting
3
Introduction - 1Introduction - 1
Main critical issue of the last Census
Huge organizational (and economical) effort of Municipal Census Offices sudden and time-concentrated increase of workload for largest municipalities, massive network of enumerators
and coordinators to be trained and managed lack of adequately skilled resources, high turn over rates
Main objectives for the next Census
to improve the census operations efficiency to reduce the municipalities workload to keep an high level of quality
Joint UNECE Eurostat Meeting
4
Introduction - 2Introduction - 2
Innovations proposed to reach the objectives the use of population registers mail out of census forms mixed mode of data collection mainly based on mail and web
Expected consequences with the innovations the increase of “back office” work the reduction of enumerators number (“front office” work)
How it is possible increasing the response rates
A proposal: the use of a “short form” version of the questionnaire is considered to reach high response rates.
Joint UNECE Eurostat Meeting
5
Introduction - 3Introduction - 3
Consequences of the use of short form
increasing the response rates reducing as much as possible the response time delay
This approach risks information loss!!!
How to preserve the richness of the census information
by a selection of a sample of households to which a “long form” version of the questionnaire is supplied
Strategy: the simultaneously use of short and long forms.
Joint UNECE Eurostat Meeting
6
Some aspects related to the use of samples of households Some aspects related to the use of samples of households for long form enumeration - 1for long form enumeration - 1
Which type of information can be surveyed by means of a sample of long forms and which must be collected on the whole population?
The overall set of census variables is partitioned into two subsets the demographic variables (gender, date of birth, marital
status, nationality, …) the remaining variables (educational level, occupational status,
commuting)
Short form accounts for merely the first set of variables whereas long form accounts for the whole set
Joint UNECE Eurostat Meeting
7
Some aspects related to the use of samples of households Some aspects related to the use of samples of households for long form enumeration - 2for long form enumeration - 2
Which is the population municipality threshold under which the sampling strategy cannot be adopted?
An option we are taking into consideration is to sample in municipalities with more than 5,000 inhabitants long forms will be submitted to a sample of households short forms will be administered to remaining households
In municipalities smaller than 5,000 inhabitants long forms will be submitted to the whole population
Joint UNECE Eurostat Meeting
8
Some aspects related to the use of samples of households Some aspects related to the use of samples of households for long form enumeration - 3for long form enumeration - 3
Which domains have to be considered to plan the sample and to produce accurate estimates?
New “census domains” have been defined an appropriate methodology was adopted to build up census
domains by aggregating the smallest census areas the new “areas” are referred to sub-municipal level
Accuracy of sampling estimates for different territorial levels a similar precision is expected for estimates among areas higher precision is expected for larger territorial reference
(from sub-municipal to nationwide level)
Joint UNECE Eurostat Meeting
9
Some aspects related to the use of samples of households Some aspects related to the use of samples of households for long form enumeration - 4for long form enumeration - 4
Which statistical methodology performs the most accurate estimation?
… in terms of …
sampling design
use of appropriate lists
efficient estimation methods
sampling error assessment
The answer to this question is the aim of the study of which some results will be presented.
Joint UNECE Eurostat Meeting
10
Sampling strategiesSampling strategies
Two different sampling designs have been tested Simple Random Sampling of HOUseholds from Administrative
Registers (SRSHOU) managed by municipalities Area Frame Sampling based on a Simple Random Sampling of
ENumeration Areas (SRSENA) which implies a complete data collection of households dwelling in the selected enumeration areas (from Digital Geocoded Database)
Different studies have been conducted
To compare the two different approaches (with a sampling ratio of about one third of the whole population considered)
To evaluate in the SRSHOU the improvement of the estimates precision for increasing sampling ratio (10%, 15%, 20%, 33%)
To introduce some stratifications of the units involved
Joint UNECE Eurostat Meeting
11
Simulation study - 1Simulation study - 1
Main features of the sampling designs Domains: the “new areas” referred to sub-municipal districts Target variables: “variables” related to cross-classification of
educational level, employment status and commuting with demographic variables
Sampling units: “households” or “enumeration area” Estimator: “calibrated estimators” by using final weights
properly modified so to make the sample more representative
The sampling strategies were compared to each other through Monte Carlo sampling replications (carried out on 2001 Italian Census data) in order to assess the sampling error defined by the coefficient of variation (CV) which represents an accuracy measurement of the sampling estimates.
Joint UNECE Eurostat Meeting
12
Simulation study - 2Simulation study - 2
Geographical area
Classes of population size (a)Total
10,000-20,000 20,000-100,000more than 100,000
North 4 6 6 16
Center 2 3 3 8
South 4 6 6 16
Total 10 15 15 40(a) It has been considered the legal (official) population date referred to the 2001 Census of Population.
Because of the strong differences among the Italian municipalities, 40 of them with different population size and from different regions of Italy were considered
Joint UNECE Eurostat Meeting
13
Simulation study - 3Simulation study - 3
Sampled Units Universe %
Areas 497 3,347(*) 14.85%
Enumeration areas 30,890 382,534 8.08%
Households 2,243,511 21,810,676 10.29%
Individuals 5,537,582 56,594,021 9.78%
(*) Estimated number
Amount of units involved by the simulation studyAmount of units involved by the simulation study
Joint UNECE Eurostat Meeting
14
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50 55
p%
cv
%
Scatter plot of cvScatter plot of cv and p (estimates) for each census area. and p (estimates) for each census area. SRSHOU design (sampling ratio=33%). City of Perugia.SRSHOU design (sampling ratio=33%). City of Perugia.
1%2%
3%
Joint UNECE Eurostat Meeting
15
Distribution of median Distribution of median cvcv for classes of for classes of pp for SRSHOU design and for SRSHOU design and SRSENA design (both with sampling ratio=33%). SRSENA design (both with sampling ratio=33%). Comparison of 4 municipalities.Comparison of 4 municipalities.
Classes of p
Milano (111 areas) Bologna (32 areas) Padova (18 areas) Livorno (13 areas)
SRSHOU SRSENA SRSHOU SRSENA SRSHOU SRSENA SRSHOU SRSENA
< 0.05% 97.78 94.12 96.52 94.31 99.65 98.34 102.21 101.61
0.05%├0.1% 51.61 51.59 50.67 49.54 51.70 54.13 50.69 52.06
0.1%├0.25% 34.67 34.92 35.00 35.20 35.37 36.03 35.08 35.67
0.25%├0.5% 22.96 24.38 24.17 24.73 25.58 26.45 23.70 24.37
0.5%├1% 16.86 18.71 16.85 18.32 16.95 17.81 17.16 18.72
1%├2.5% 10.61 12.21 10.74 11.95 11.07 12.00 11.34 12.90
2.5%├5% 7.02 8.53 7.07 8.25 7.35 8.48 7.17 9.00
5%├10% 4.84 5.97 4.75 5.74 5.05 5.85 4.88 6.39
10%├15% 3.17 4.41 3.09 4.09 3.19 4.37 3.06 4.82
15%├20% 2.44 3.46 2.38 3.12 2.44 3.14 2.44 3.39
20%├30% 1.89 2.61 1.92 2.48 2.08 2.73 2.05 2.88
≥ 30% 1.35 1.78 1.32 1.60 1.39 1.72 1.40 2.00
THIS IS DUE TO THE CLUSTER EFFECT
Joint UNECE Eurostat Meeting
16
Loss of efficiency (in terms of CV for classes of p) of estimation with Loss of efficiency (in terms of CV for classes of p) of estimation with SRSENA with respect to SRSHOU design (both with sampling SRSENA with respect to SRSHOU design (both with sampling ratio=33%). Comparison of 4 municipalities.ratio=33%). Comparison of 4 municipalities.
Classes of p
Milano (111 areas)
Bologna (32 areas)
Padova (18 areas)
Livorno (13 areas)
< 0.05% 3.65 2.21 1.31 0.60
0.05%├0.1% 0.03 1.13 -2.43 -1.37
0.1%├0.25% -0.25 -0.20 -0.66 -0.59
0.25%├0.5% -1.42 -0.56 -0.87 -0.68
0.5%├1% -1.85 -1.47 -0.87 -1.56
1%├2.5% -1.60 -1.22 -0.93 -1.56
2.5%├5% -1.51 -1.18 -1.13 -1.83
5%├10% -1.13 -0.99 -0.80 -1.51
10%├15% -1.24 -1.01 -1.18 -1.76
15%├20% -1.02 -0.73 -0.70 -0.95
20%├30% -0.72 -0.56 -0.65 -0.82
≥ 30% -0.43 -0.29 -0.33 -0.60
[CV(SRSHOU_s.r. 33%)-CV(SRSENA_s.r. 33%)]
Joint UNECE Eurostat Meeting
17
Distribution of median Distribution of median cvcv for classes of for classes of pp. Comparison of 4 different . Comparison of 4 different sampling ratios with the SRSHOU design.sampling ratios with the SRSHOU design.
Classes of p
sampling ratio= 10%
sampling ratio= 15%
sampling ratio= 20%
sampling ratio= 33%
170 areas 140 areas 111 areas 204 areas
< 0.05% 220.51 157.20 142.00 98.21
0.05%├0.1% 111.48 87.22 74.20 51.14
0.1%├0.25% 75.57 59.83 49.97 34.76
0.25%├0.5% 50.70 39.92 33.97 23.44
0.5%├1% 35.54 28.10 23.74 16.56
1%├2.5% 23.62 18.56 15.33 10.68
2.5%├5% 15.50 12.29 10.09 7.04
5%├10% 10.46 8.26 6.93 4.82
10%├15% 7.06 5.40 4.40 3.13
15%├20% 5.57 4.27 3.54 2.42
20%├30% 4.50 3.48 2.84 1.93
≥ 30% 3.20 2.42 1.94 1.34
Joint UNECE Eurostat Meeting
18
Gain of efficiency (in terms of CV for classes of p) of estimation with Gain of efficiency (in terms of CV for classes of p) of estimation with SRSHOU design by increasing sampling ratio from 10% to 33% .SRSHOU design by increasing sampling ratio from 10% to 33% .
Classes of pincreasing s.r.
from 10% to 15%increasing s.r.
from 10% to 20%increasing s.r.
from 10% to 33%
< 0.05% 28.71 35.60 55.46
0.05%├0.1% 21.76 33.44 54.13
0.1%├0.25% 20.83 33.88 54.00
0.25%├0.5% 21.26 33.00 53.77
0.5%├1% 20.93 33.20 53.40
1%├2.5% 21.42 35.10 54.78
2.5%├5% 20.71 34.90 54.58
5%├10% 21.03 33.75 53.92
10%├15% 23.51 37.68 55.67
15%├20% 23.34 36.45 56.55
20%├30% 22.67 36.89 57.11
≥ 30% 24.38 39.38 58.13
[CV(SRSHOU_s.r. 10%)-CV(SRSHOU_s.r. N%)]x100/[CV(SRSHOU_s.r. 10%)]
Gain between 21-23 percent
Gain between 33-38 percent
Gain between 53-58 percent
Joint UNECE Eurostat Meeting
19
Distribution of median Distribution of median cvcv for five classes of for five classes of pp and three classes of and three classes of area (according to population size). Comparison of 4 different area (according to population size). Comparison of 4 different sampling ratios with the SRSHOU design.sampling ratios with the SRSHOU design.
Classes of p
Population by area
(thousands)
Sampling ratio
10% 15% 20% 33%
0.1%├0.25%
<10 90.00 71.27 59.48 40.20
10├12 76.23 60.03 50.45 34.41
≥ 12 66.65 53.04 43.51 30.58
0.5%├1%
< 10 43.11 33.50 28.97 19.53
10├12 35.08 27.46 22.99 16.48
≥ 12 31.25 24.95 20.97 14.85
2.5%├5%
< 10 19.12 14.68 12.22 8.25
10├12 15.58 12.36 9.89 7.08
≥ 12 14.00 10.98 9.06 6.35
10%├15%
< 10 8.78 6.44 5.22 3.67
10├12 7.00 5.46 4.41 3.13
≥ 12 6.29 4.79 3.89 2.83
20%├30%
< 10 5.46 4.16 3.44 2.27
10├12 4.57 3.42 2.94 2.01
≥ 12 4.05 3.20 2.59 1.77
Joint UNECE Eurostat Meeting
20
Median Median CV CV for some classes of for some classes of pp and for three classes of area (according to and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size less than 10,000 inhabitants.SRSHOU design. Graph referred to area size less than 10,000 inhabitants.
Area size<10,000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
classes of p
med
ian
cv%
s.r.=10%
s.r.=15%
s.r.=20%
s.r.=33%
<10 10├12 ≥ 12
10% 90.0 76.2 66.715% 71.3 60.0 53.020% 59.5 50.5 43.533% 40.2 34.4 30.610% 43.1 35.1 31.315% 33.5 27.5 25.020% 29.0 23.0 21.033% 19.5 16.5 14.910% 19.1 15.6 14.015% 14.7 12.4 11.020% 12.2 9.9 9.133% 8.3 7.1 6.410% 8.8 7.0 6.315% 6.4 5.5 4.820% 5.2 4.4 3.933% 3.7 3.1 2.810% 5.5 4.6 4.115% 4.2 3.4 3.220% 3.4 2.9 2.633% 2.3 2.0 1.8
0.5-1%
2.5-5%
10-15%
20-30%
Classes of p
s.r.Area size (thousands)
0.1-0.25%
Joint UNECE Eurostat Meeting
21
Area size between 10-12,000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
classes of p
med
ian
cv%
s.r.=10%
s.r.=15%
s.r.=20%
s.r.=33%
<10 10├12 ≥ 12
10% 90.0 76.2 66.715% 71.3 60.0 53.020% 59.5 50.5 43.533% 40.2 34.4 30.610% 43.1 35.1 31.315% 33.5 27.5 25.020% 29.0 23.0 21.033% 19.5 16.5 14.910% 19.1 15.6 14.015% 14.7 12.4 11.020% 12.2 9.9 9.133% 8.3 7.1 6.410% 8.8 7.0 6.315% 6.4 5.5 4.820% 5.2 4.4 3.933% 3.7 3.1 2.810% 5.5 4.6 4.115% 4.2 3.4 3.220% 3.4 2.9 2.633% 2.3 2.0 1.8
0.5-1%
2.5-5%
10-15%
20-30%
Classes of p
s.r.Area size (thousands)
0.1-0.25%
Median Median CV CV for some classes of for some classes of pp and for three classes of area (according to and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size between 10,000 and 12,000 inhabitants.design. Graph referred to area size between 10,000 and 12,000 inhabitants.
The gain of efficiency (in terms of CV) for census areas with size between 10,000 and 12,000 with respect to census areas with less than 10,000 is about 14-20 percent. Similar results are obtained for all tested sampling ratios.
Joint UNECE Eurostat Meeting
22
Area size>12,000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
classes of p
med
ian
cv%
s.r.=10%
s.r.=15%
s.r.=20%
s.r.=33%
<10 10├12 ≥ 12
10% 90.0 76.2 66.715% 71.3 60.0 53.020% 59.5 50.5 43.533% 40.2 34.4 30.610% 43.1 35.1 31.315% 33.5 27.5 25.020% 29.0 23.0 21.033% 19.5 16.5 14.910% 19.1 15.6 14.015% 14.7 12.4 11.020% 12.2 9.9 9.133% 8.3 7.1 6.410% 8.8 7.0 6.315% 6.4 5.5 4.820% 5.2 4.4 3.933% 3.7 3.1 2.810% 5.5 4.6 4.115% 4.2 3.4 3.220% 3.4 2.9 2.633% 2.3 2.0 1.8
0.5-1%
2.5-5%
10-15%
20-30%
Classes of p
s.r.Area size (thousands)
0.1-0.25%
Median Median CV CV for some classes of for some classes of pp and for three classes of area (according to and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size more than 12,000 inhabitants.SRSHOU design. Graph referred to area size more than 12,000 inhabitants.
The gain of efficiency (in terms of CV) for census areas with size more than 12,000 with respect to census areas with less than 10,000 is about 22-28 percent. As before, similar results are obtained for all tested sampling ratios.
Joint UNECE Eurostat Meeting
23
Distribution of the estimates referred to areas larger than 12,000 Distribution of the estimates referred to areas larger than 12,000 inhabitants for classes of inhabitants for classes of cvcv. Comparison of percentage frequencies . Comparison of percentage frequencies for 4 different sampling ratios with the SRSHOU design.for 4 different sampling ratios with the SRSHOU design.
Classes of coefficient of variation %
Sampling ratio
10% 15% 20% 33%
< 2% 0.57 2.69 6.39 13.14
2%├5% 13.04 17.53 18.40 23.64
5%├10% 16.18 18.02 26.28 28.64
10%├20% 29.14 30.16 23.54 16.20
20%├50% 25.09 19.71 16.75 13.32
50%├100% 9.32 7.21 5.69 3.44
100%├200% 4.40 3.65 2.00 1.61
≥ 200% 2.25 1.03 0.95 -
HA – high accuracy
MA – medium accuracy
LA – low accuracy
Joint UNECE Eurostat Meeting
24
Distribution of the estimates referred to areas larger than 12,000 Distribution of the estimates referred to areas larger than 12,000 inhabitants for classes of inhabitants for classes of cvcv. Comparison of percentage frequencies . Comparison of percentage frequencies for 4 different sampling ratios with the SRSHOU design - 2for 4 different sampling ratios with the SRSHOU design - 2
Classes of cv%Sampling ratio
10% 15% 20% 33%
< 10% 29.80 38.25 51.07 65.42
10%├50% 54.23 49.87 40.29 29.52
≥ 50% 15.97 11.89 8.64 5.05
0
10
20
30
40
50
60
70
10% 15% 20% 33%Sampling ratio
% o
f e
sti
ma
tes cv%<10%
cv% in 10-50%
cv%>50%
HA - high accuracy
MA - medium accuracy
LA - low accuracy
Joint UNECE Eurostat Meeting
25
Generic sampled area a aa p̂CVXp̂
Territory RS given by aggregation of K sampled areas
aRR p̂CVK
1p̂CVXp̂
SS
100K
11red%
Percentage expected
reduction of CV in RS
Estimates of Estimates of pp referred to territory given by aggregation of areas. referred to territory given by aggregation of areas.
Territory R given by aggregation of sampled areas and not sampled areas
aRR p̂CVK
γp̂CVXp̂
100K
γ1red%
NNγ S
Quote of sub-population of R elegible for drawing the LF sample.
Percentage expected
reduction of CV in R
Joint UNECE Eurostat Meeting
26
Conclusions - 1Conclusions - 1
As expected, the most accurate estimates were obtained for: simple random sampling of households from administrative
registers largest sampling ratio
Better efficiency of estimates for largest areas (>12,000 inhabitants) this result could represent a suggestion for planning the sampling
design by defining larger census areas (of about 15,000 people)
The estimates referred to large domains given by aggregation of areas show high accuracy the accuracy increases with the domain’s number in case in which a part of the large domain is totally surveyed, the
estimates show a further increasing in accuracy
Joint UNECE Eurostat Meeting
27
Conclusions - 2Conclusions - 2
However area frame sampling is only slightly less efficient than SRSHOU, thus it could be adopted where reliable administrative registers are not available
Sampling ratio will be chosen considering trade-off between: needed financial savings accuracy required at different territorial domains
Further analyses will be conducted on small area estimation techniques to produce more accurate estimates for: smallest territorial levels rare populations
Joint UNECE Eurostat Meeting
28
Thank you for your attention and …
Joint UNECE Eurostat Meeting
29
… have a good lunch!!!
Joint UNECE Eurostat Meeting
30
Joint UNECE Eurostat Meeting
31
Simulation study - 4Simulation study - 4
Cross-classification cells educational level, employment status, commuting and gender 90 simple estimation cells
Calibration constraints defined by cross-classifying gender by age, and gender by marital status
Computational algorithm implemented by SAS code for each municipality and for each alternative sampling design: step 1) selection of a sample (of households or enumeration areas) step 2) computation of final weights step 3) estimation of the relative frequency p for each target cell step 4) iteration of steps 1), 2) and 3) for 1,000 sampling replications step 5) computation of sampling distribution mean and standard error
for each one of the 90 frequency cells
Joint UNECE Eurostat Meeting
32
Evaluation criterion: the coefficient of variationEvaluation criterion: the coefficient of variation
In order to compare the sampling strategies has been considered as evaluation criterion the coefficient of variation CV :
which represents an accuracy measurement of the sampling estimates.
Consequently, the percentage maximum expected error can be computed: Δ% ≈ 1.96 · CV
which is implied (with a probability of 0.95) by the estimation method.
100p̂E
p̂σcv
x
xx
The distribution of the empirical CV’s for all the 90 target cells was determined.
After having classified the target cells depending on their value p , CV’s distribution related to the cells in the same p group has been studied.
Joint UNECE Eurostat Meeting
33
Estimate referred to the generic sampled area a
aa p̂CVXp̂
Estimate referred to the territory RS given by aggregation of K sampled areas
S
S
RaaaR p̂WXp̂
SRaa NNW where
aR p̂CVK
1p̂CV
S
100K
11red%
Percentage expected reduction of CV
40%
50%
60%
70%
80%
90%
100%
0 500 1000 1500 2000 2500 3000
Numero di aree KRiduzione percentuale "attesa" del CV
for K>5 → red%>50%
for K>30 → red%>80%
for K>100 → red%>90%
Number of areas K─ Percentage expected reduction of CV
Estimates of Estimates of pp referred to territory given by aggregation of areas. referred to territory given by aggregation of areas. Case 1Case 1: aggregation of sampled areas.: aggregation of sampled areas.
Joint UNECE Eurostat Meeting
34
80%
85%
90%
95%
100%
0 500 1000 1500 2000 2500 3000
Numero di aree KLFc_50% LFc_60% LFc_70% LFc_100%
Territory RS referred to Sampled areas: long form to a sample of households.
Territory RNS of Not Sampled areas: long form to all the households.
NSS RRR
NNγ S
NSSR pγ1p̂γp̂
100K
γ1red%
aR p̂CVK
1γp̂CV
100 400
Sub-population of R elegible for drawing the LF sample.
Number of areas K
Estimates of Estimates of pp referred to territory given by aggregation of areas. referred to territory given by aggregation of areas. Case 2Case 2: aggregation of sampled and not sampled areas.: aggregation of sampled and not sampled areas.
─ γ=1 ─ γ=0.7 ─ γ=0.6 ─ γ=0.5
Percentage expected reduction of CV