Selective Editing with Categorical Variables€¦ · Selective Editing with Categorical Variables...
Transcript of Selective Editing with Categorical Variables€¦ · Selective Editing with Categorical Variables...
-
Selective Editing with CategoricalVariables
Use of R in Official Statistics 2019
M.R. González-García, J. Poch, D. Salgado , T. Vázquez-Gutiérrez
Statistics Spain (INE)
INS, Bucharest, 20-21 May, 2019
-
Overview
1. Statement of the problem
2. The methodological approach
3. Application of random forests
4. Results
5. Implementation
6. Preliminary conclusions
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 2 / 13
-
Statement of the problem. I
National/European Health Survey :
• 3/2 Types of Questionnaires (Adult, Household,Underage).
• Around 450 + 50 + 250 questionnaire items.
• Estimators:
Ŷ targetd =∑
k∈s
ωks(x) · δDomainkUd · δTarget
k
• Focus on domain variable δDomainkUd expressing OCCUPATION.
• Traditional editing work:
systematic interactive editing on each questionnaire taking intoaccount other nuclear sociodemographic variables such asage, gender, professional situation, economical activity, incomeinterval. . .
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 3 / 13
-
Statement of the problem. II
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 4 / 13
-
The methodological approach
sk = ωk ·∣
∣yk − ŷk∣
∣
x
|yk−ŷk |σk
→ 0
sk =√
2π· ωk ·
M
− 12 ,12 ,−
(
σ2k
σ2k+ν2
k
)2(yl−ŷk )
2
2ν2k
1+ 1−pp
√
ν2k +σ
2k
ν2k
sk = ωk · P(
yk 6= y(0)k
)
←−−−−−−−−−
yk=ŷk+νkyk=y
(0)k +σk
−−−−−−−−−→
P
(
yk 6=y(0)k
)
=∑M
m=1 wm·IRm (xk ;Zcross
k )
sk = E[
ωk
∣
∣yk − y(0)k
∣
∣
∣
∣Zcrossk]
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 5 / 13
-
Application of Random Forests
• Questionnaire for adults .
• Data from year 2011: raw and (manually) edited .
• Sample size: 18486 units.
• Train data set (80%) – test data set(20%) 50 times
• Target variable: occupation code measurement error (1-,2-, 3-digit codes)
• Predictors (raw values):Age, Sex, Stratum, Proxy, adultNACE, adultOccup,pensionNACE, currBusNACE, precBusNACE, pensionProfSit,lastProfSit, incomeInterval, studyDegree,samplingWeight
• model1
-
Results. I
1 train + 1 test datasets
AUC for Classification with Random Forests Spanish National Occupation Classification List −− 1 Digit
Specificity
Sen
sitiv
ity
1.0 0.5 0.0
0.0
0.2
0.4
0.6
0.8
1.0
Without Sampling Weight
With Sampling Weight
AUC without:0.8
AUC with:0.77
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 7 / 13
-
Results. I
1 train + 1 test datasets
AUC for Classification with Random Forests Spanish National Occupation Classification List −− 2 Digits
Specificity
Sen
sitiv
ity
1.0 0.5 0.0
0.0
0.2
0.4
0.6
0.8
1.0
Without Sampling Weight
With Sampling Weight
AUC without:0.82
AUC with:0.77
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 7 / 13
-
Results. I
1 train + 1 test datasets
AUC for Classification with Random Forests Spanish National Occupation Classification List −− 3 Digits
Specificity
Sen
sitiv
ity
1.0 0.5 0.0
0.0
0.2
0.4
0.6
0.8
1.0
Without Sampling Weight
With Sampling Weight
AUC without:0.82
AUC with:0.76
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 7 / 13
-
Results. I
50 train + 50 test datasets
0.775
0.800
0.825
0.850
1 2 3
Number of Occupation Code Digits
AU
C
Without Sampling Weights
With Sampling Weights
AUC for Classification with Random Forests Spanish National Occupation Classification List
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 7 / 13
-
Results. II
1 train + 1 test datasets
PROXY_0
F7_2_3
F7_2_1
F7_2_2
F9
SEXOa
F16m_2_3
F16m_2_2
F16a_2_1
F16m_2_1
F16a_2_3
F16a_2_2
D28
CNAE_AS_3
F18
CNAE_AS_1
CNAE_AS_2
CNO_AS_3
CNO_AS_2
A10_i
ESTRATO
CNO_AS_1
EDADa
FACTORADULTO
0 50 100 150 200
Mean Decrease in Gini Index
Var
iabl
es
Mean Decrease in Node ImpuritySpanish National Occupation Classification List −− 1 Digit
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13
-
Results. II
1 train + 1 test datasets
PROXY_0
F7_2_3
F7_2_1
F7_2_2
F9
SEXOa
F16m_2_3
F16m_2_1
F16a_2_1
F16m_2_2
F16a_2_3
F16a_2_2
F18
D28
CNAE_AS_3
CNAE_AS_1
CNAE_AS_2
CNO_AS_3
CNO_AS_2
A10_i
ESTRATO
CNO_AS_1
EDADa
FACTORADULTO
0 100 200
Mean Decrease in Gini Index
Var
iabl
es
Mean Decrease in Node ImpuritySpanish National Occupation Classification List −− 2 Digits
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13
-
Results. II
1 train + 1 test datasets
PROXY_0
F7_2_3
F7_2_1
F7_2_2
F9
SEXOa
F16m_2_3
F16a_2_1
F16m_2_1
F16a_2_3
F16m_2_2
F16a_2_2
F18
D28
CNAE_AS_3
CNAE_AS_1
CNAE_AS_2
CNO_AS_2
CNO_AS_3
A10_i
ESTRATO
CNO_AS_1
EDADa
FACTORADULTO
0 100 200 300
Mean Decrease in Gini Index
Var
iabl
es
Mean Decrease in Node ImpuritySpanish National Occupation Classification List −− 3 Digits
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13
-
Results. II
50 train + 50 test datasets
PROXY_0
F7_2_3
F7_2_1
F7_2_2
F9
SEXOa
F16m_2_3
F16a_2_1
F16m_2_1
F16m_2_2
F16a_2_3
F16a_2_2
D28
CNAE_AS_3
CNAE_AS_1
F18
CNAE_AS_2
CNO_AS_3
CNO_AS_2
A10_i
ESTRATO
CNO_AS_1
EDADa
FACTORADULTO
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200Variable Importance (Mean Decrease in Gini Index)
Var
iabl
e
Variable Importance in the Classification of Influential Errors in the Spanish National Occupation Classification List −− 1−Digit Code
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13
-
Results. II
50 train + 50 test datasets
PROXY_0
F7_2_1
F7_2_3
F7_2_2
F9
SEXOa
F16m_2_3
F16a_2_1
F16a_2_3
F16m_2_1
F16m_2_2
F16a_2_2
D28
F18
CNAE_AS_3
CNAE_AS_1
CNAE_AS_2
CNO_AS_3
CNO_AS_2
A10_i
ESTRATO
CNO_AS_1
EDADa
FACTORADULTO
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260Variable Importance (Mean Decrease in Gini Index)
Var
iabl
e
Variable Importance in the Classification of Influential Errors in the Spanish National Occupation Classification List −− 2−Digit Code
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13
-
Results. II
50 train + 50 test datasets
PROXY_0
F7_2_3
F7_2_1
F7_2_2
F9
SEXOa
F16m_2_3
F16a_2_1
F16a_2_3
F16m_2_1
F16m_2_2
F16a_2_2
D28
F18
CNAE_AS_3
CNAE_AS_1
CNAE_AS_2
CNO_AS_3
CNO_AS_2
A10_i
ESTRATO
CNO_AS_1
EDADa
FACTORADULTO
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300Variable Importance (Mean Decrease in Gini Index)
Var
iabl
e
Variable Importance in the Classification of Influential Errors in the Spanish National Occupation Classification List −− 3−Digit Code
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13
-
Results. III
1 train + 1 test datasets
0.00
0.25
0.50
0.75
0
1000
2000
3000
0.00 0.25 0.50 0.75 1.00
Fraction of Edited Questionnaires
Fra
ctio
n of
Tot
al C
ases
Total Num
ber of Cases
Variable
Error
NonError
True Positive
False Negative
True Negative
False Positive
Weights
No
Yes
Classification of Cases in Error Detection (Classifier + Cutoff) Spanish National Occupation Classification List −− 1 Digit
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 9 / 13
-
Results. III
1 train + 1 test datasets
0.00
0.25
0.50
0.75
0
1000
2000
3000
0.00 0.25 0.50 0.75 1.00
Fraction of Edited Questionnaires
Fra
ctio
n of
Tot
al C
ases
Total Num
ber of Cases
Variable
Error
NonError
True Positive
False Negative
True Negative
False Positive
Weights
No
Yes
Classification of Cases in Error Detection (Classifier + Cutoff) Spanish National Occupation Classification List −− 2 Digits
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 9 / 13
-
Results. III
1 train + 1 test datasets
0.00
0.25
0.50
0.75
0
1000
2000
3000
0.00 0.25 0.50 0.75 1.00
Fraction of Edited Questionnaires
Fra
ctio
n of
Tot
al C
ases
Total Num
ber of Cases
Variable
Error
NonError
True Positive
False Negative
True Negative
False Positive
Weights
No
Yes
Classification of Cases in Error Detection (Classifier + Cutoff) Spanish National Occupation Classification List −− 3 Digits
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 9 / 13
-
Results. III
50 train + 50 test datasets
With Sampling Weights Without Sampling Weights
1 2 3 1 2 3
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Number of Classification Code Digits
Sam
ple
Per
cent
age
Variable
True Positive
False Negative
True Negative
False Positive
Classification of Influential Errors in the Spanish National Occupation Classification List
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 9 / 13
-
Results. IV
1 train + 1 test datasets
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Fraction of Sample of Edited Questionnaires
variable
Error
NonError
Accuracy
Precision
Recall
Specificity
F−measure
Weights
No
Yes
Error Detection Quality Indicators (Classifier + Cutoff) Spanish National Occupation Classification List −− 1 Digit
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13
-
Results. IV
1 train + 1 test datasets
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Fraction of Sample of Edited Questionnaires
variable
Error
NonError
Accuracy
Precision
Recall
Specificity
F−measure
Weights
No
Yes
Error Detection Quality Indicators (Classifier + Cutoff) Spanish National Occupation Classification List −− 2 Digits
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13
-
Results. IV
1 train + 1 test datasets
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Fraction of Sample of Edited Questionnaires
variable
Error
NonError
Accuracy
Precision
Recall
Specificity
F−measure
Weights
No
Yes
Error Detection Quality Indicators (Classifier + Cutoff) Spanish National Occupation Classification List −− 3 Digits
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13
-
Results. IV
50 train + 50 test datasets
Accuracy Precision Recall Specificity Fmeasure
With S
ampling W
eightsW
ithout Sam
pling Weights
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Sample Percentage
quantile
0%
25%
50%
75%
100%
Classification of Influential Errors in the Spanish National Occupation Classification List −− 1−Digit Codes
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13
-
Results. IV
50 train + 50 test datasets
Accuracy Precision Recall Specificity Fmeasure
With S
ampling W
eightsW
ithout Sam
pling Weights
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Sample Percentage
quantile
0%
25%
50%
75%
100%
Classification of Influential Errors in the Spanish National Occupation Classification List −− 2−Digit Codes
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13
-
Results. IV
50 train + 50 test datasets
Accuracy Precision Recall Specificity FmeasureW
ith Sam
pling Weights
Without S
ampling W
eights
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Sample Percentage
quantile
0%
25%
50%
75%
100%
Classification of Influential Errors in the Spanish National Occupation Classification List −− 3−Digit Codes
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13
-
Results. V
1 train + 1 test datasets
M N O P
H I J K L
C D E F G
* 00 000 A B
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.00
0.02
0.04
0.06
0.000
0.005
0.010
0.015
0.020
0.025
0.000
0.005
0.010
0.015
0.020
0.025
0.00
0.05
0.10
0.15
0.20
0.00
0.01
0.02
0.03
0.04
0.00
0.01
0.02
0.03
0.00
0.02
0.04
0.06
0.00
0.05
0.10
0.15
0.000
0.005
0.010
0.015
0.020
0.000
0.025
0.050
0.075
0.00
0.01
0.02
0.03
0.04
0.00
0.01
0.02
0.03
0.04
0.000
0.025
0.050
0.075
0.100
0.00
0.01
0.02
0.03
0.0000
0.0025
0.0050
0.0075
0.0100
0.0125
0.00
0.25
0.50
0.75
1.00
0.00
0.02
0.04
0.06
0.00
0.02
0.04
0.06
0.00
0.03
0.06
0.09
Fraction of Sample of Edited Questionnaires
Abs
olut
e R
elat
ive
Pse
udoB
ias
Weights
No
Yes
Editing EfficiencySpanish National Occupation Classification List −− 1 Digit
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 11 / 13
-
Results. V
50 train + 50 test datasets
With Sampling Weights Without Sampling Weights*
00000
AB
CD
EF
GH
IJ
KL
MN
OP
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.000.250.500.751.00
−0.050−0.025
0.0000.0250.050
0.000.050.10
0.000.050.100.150.20
0.000.010.020.030.04
0.000.020.040.06
0.000.030.060.090.12
0.000.010.02
0.000.020.040.06
0.000.010.02
0.000.020.040.06
0.00000.00250.00500.0075
0.000.030.060.09
0.000.010.020.030.04
0.000.010.020.03
0.000.020.040.06
0.000.010.020.03
0.000.010.020.030.04
0.0000.0250.0500.0750.100
Sample Percentage
Abs
olut
e R
elat
ive
Pse
udoB
ias
quantile
0%
25%
50%
75%
100%
Classification of Influential Errors in the Spanish National Occupation Classification List −− 1−Digit Codes
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 11 / 13
-
Implementation
Continuous variables
Sk = S
(M
(1), . . . ,M(Q))
UnitPrioritization
UnitPrioritizationParam
ErrorMoments Observation-Prediction Model Parameters (ωk, p̂k, σ̂k, ν̂k, ŷk)
ErrorMomentParam
contObsPredModelParam
ObsErrSTDParam
ErrProbParam
PredParam StQList
takes
takes
takes to update
takes to update
takes to update
takes
takes
takes
Mkk =
√2
π· ωk · ν̂k · 1F1
−
1
2;1
2;−
(yk − ŷk
)2
2 · ν̂2k
·
1
1 +1− p̂k
p̂k
σ̂2kσ̂2k+ν̂2
k
−1/2
exp
(−
(yk−ŷk)2
2·ν̂2k
)
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 12 / 13
-
Implementation
Categorical variables
Sk = S
(M
(1), . . . ,M(Q))
UnitPrioritization
UnitPrioritizationParam
ErrorMoments Observation-Prediction Model Parameters (ωk, p̂k)
ErrorMomentParam
categObsPredModelParam
ErrProbParam–RF
takes
takes
takes to update
Mkk = ωk · p̂k
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 12 / 13
-
Preliminary Conclusions
• Same underlying methodological approach as withcontinuous variables.
• Effective prioritization of units with influential errors.
• Details to be worked out for a multivariate editing in a fullyfledged E&I strategy .
• Unbalanced learning sub/over-sampling , cost-sensitivelearning, ensemble techniques. . .
• More general threshold selection schemes.
• Sampling weights to be introduced in the generation of therandom forest?
• If random forests, why not SVMs, boosting , neuralnetworks . . . ?
• Also for continuous variables?
Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 13 / 13