The Nature of the Bias When Studying Only Linkable Person Records:
Evidence from the American Community Survey
Adela Luque (U.S. Census Bureau)
Brittany Bond (U.S. Department of Commerce)
J. David Brown & Amy O’Hara (U.S. Census Bureau)
FedCASIC March 2014
1
2
Disclaimer
Any opinions and conclusions expressed herein are those of the authors and do not necessarily reflect the views of the U.S. Census Bureau
All results have been reviewed to ensure that no confidential information on individual persons is disclosed
Overview
Motivation Objectives Data & Methodology Background on Anonymous Identifier
Assignment Process Expected Effects Results Conclusions
3
Motivation
Record linkage can enrich data, improve its quality & lead to research not otherwise possible - while reducing respondent burden & operational costs
Linking data requires common identifiers unique to each record that protect confidentiality
Census Bureau assigns Protected Identification Keys (PIKs) via a probabilistic matching algorithm: PVS (Personal Identification Validation System)
Not possible to reliably assign a PIK to every record, which may introduce bias in data analysis
4
Objectives What characteristics are associated with the probability
of receiving a PIK? That is, what is the nature of the bias introduced by incomplete PIK assignment?
Help researchers understand nature of bias, interpret results more accurately, adjust/reweight linked analytical dataset
Examine bias using regression analysis - before & after changes in PVS. Do alterations to PVS improve PIK assignment rates as well as reduce bias? NORC (2011) described some demographic and socio-economic
characteristics of those records not getting a PIK
5
Data & Methodology 2009 & 2010 American Community Survey (ACS) – processed through
PVS Ongoing representative survey of the U.S. population Socioeconomic, demographic & housing characteristics 50 states & DC - Annual sample approximately 4.5 million person records
Probit model for 2009 and 2010 separately Dependent variable = 1 if person record received a PIK (0 otherwise) Covariates:
Demographic characteristics: age, sex, race and Hispanic origin Socio-economic characteristics: employment status, income, poverty status, marital status, level of
education, public program participation, health insurance status, citizenship status, English proficiency, military status, mobility status, and household type
Housing and address-related characteristics: urban vs. rural, type of living quarter, age of living quarter
ACS replicate weights Report marginal effects
2009 & 2010 results compared – before & after changes to PVS
6
Background on PVS
Probabilistic match of data from an incoming file (e.g., survey) to reference file containing data from the Social Security Administration enhanced with address data obtained from federal administrative records
If a match is found, person record receives a PIK or is “validated”
7
Background on PVS
Initial edit to clean & standardize linking fields (name, dob, sex & address)
Incoming data processed through cascading modules (or matching algorithms)
Only records failing a given module move on to the next Impossible to compare all records in incoming file to all
records in reference file → “blocking” Data split into blocks/groups based on exact matches of
certain fields or part of fields – probabilistic matching within block
8
Background on PVS
2009 PVS Modules Verification – Only for incoming files w/ SSNs Geosearch looks for name/dob/gender matches after
blocking on an address or address part (within 3-digit ZIP area)
Namesearch looks for name & dob matches within a block based on parts of name/dob
Each module has several ‘passes’ – different blocking & matching strategies
9
Background on PVS 2010 PVS Enhancements
ZIP3 Adjacency Module looks for name/dob/gender/address matches after blocking on address field parts in areas adjacent to 3-digit ZIP area
DOB Search Module looks for name/gender/dob matches after blocking on month & day of birth
Household Composition Search Module looks for name/dob matches for unmatched records that are seen in past at same address with PIKed record
Inclusion of ITINs in reference file
10
Expected EffectsLess likely to obtain a PIK: Insufficient or inaccurate person identifying info in
incoming record Issues w/ data collection or withholding due to language
barriers, trust in govt., privacy preferences Identifying info in incoming file & reference file more likely to
differ Address info differs/not updated
Movers, rent vs. own, certain types of housing Record not in government reference files
Newborns, recent immigrant, very poor/unemployed/no govt. program recipient
11
Results – Overall Validation Rates
2009 20100.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
88.192.6
PVS Validation Rate (weighted)
12
Sources: 2009 & 2010 ACS
Probit Results
Hispanic Non-Hispanic (base)
-0.040
-0.035
-0.030
-0.025
-0.020
-0.015
-0.010
-0.005
0.000
Marginal Effect of Hispanic Origin on PVS Validation
20092010
13
Sources: 2009 & 2010 ACS
Probit Results
14
White alone (base)
Black alone AIAN alone Asian alone NHPI alone Some Other Race
-0.035
-0.030
-0.025
-0.020
-0.015
-0.010
-0.005
0.000
0.005
0.010
Marginal Effect of Race on PVS Validation
2009 2010
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
Non-U.S. Citizen Foreign-Born U.S. Citizen U.S. Citizen (base)
-0.140
-0.120
-0.100
-0.080
-0.060
-0.040
-0.020
0.000
0.020
Marginal Effect of Citizenship Status on PVS Validation
20092010
15
Probit Results
Poor Spoken English Not Poorly Spoken English (base)
-0.050
-0.045
-0.040
-0.035
-0.030
-0.025
-0.020
-0.015
-0.010
-0.005
0.000
Marginal Effect of Home Spoken English Quality on PVS Validation
20092010
16
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
0-2 3-5 6-9 10-14 15-18 19-24 25-34 (base)
35-44 45-54 55-64 65-74 75 and Older
-0.050
-0.040
-0.030
-0.020
-0.010
0.000
0.010
0.020
0.030
0.040
0.050
Marginal Effect of Age on PVS Validation
2009 2010
17
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
Non-Mover in 12 Months Before IM (base)
Mover From Abroad in 12 Months before IM
Domestic Mover in 12 Months before IM Moving status, missing
-0.180
-0.160
-0.140
-0.120
-0.100
-0.080
-0.060
-0.040
-0.020
0.000
0.020
Marginal Effect of Mobility Status on PVS Validation
2009 2010
18
Probit Results
Rented Housing Unit Own Home (base)
-0.040
-0.035
-0.030
-0.025
-0.020
-0.015
-0.010
-0.005
0.000
Marginal Effect of Rent vs. Own on PVS Validation
20092010
19
Probit Results
Mobile
Home (b
ase)
Lives i
n Gro
up Quarte
rs
Detached O
ne-Family
House
Attached O
ne-Family
House
Building w
ith 2 Apartm
ents
Building w
ith 3-4 Apartm
ents
Building w
ith 5-9 Apartm
ents
Building w
ith 10-19 Apartm
ents
Building w
ith 20-49 Apartm
ents
Building w
ith 50+ Apartm
ents
Other (Boat/R
V/Van, e
tc.)
-0.040
-0.030
-0.020
-0.010
0.000
0.010
0.020
0.030
0.040
0.050
Marginal Effect of Type of Living Quarter on PVS Validation
2009 2010
20
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
Non-Family Household (base) Living with Family0.0000
0.0050
0.0100
0.0150
0.0200
0.0250
0.0300
Marginal Effect of Family Household Status on PVS Validation
20092010
21
Probit Results
Rural (base) Urban Area
-0.004
-0.002
0.000
0.002
0.004
0.006
0.008
0.010
Marginal Effect of Rural vs Urban Area on PVS Validation
20092010
22
Probit Results
Private
Employment
Government E
mployment
Self-Employed
Family
Employment
Not Employed - i
ncludes m
issing (b
ase)-0.005
0.0000.0050.0100.0150.0200.0250.0300.035
Marginal Effect of Employment Status on PVS Validation
2009 2010
23
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
Social S
ecurit
y Recipient
Private
Health
Insu
rance
Public H
ealth In
sura
nce
Both Priv
ate and Public
Health
Insu
rance
Uninsure
d (base
)0.000
0.010
0.020
0.030
0.040
0.050
0.060
Marginal Effect of Health Insurance Status on PVS Validation
20092010
24
Probit Results
Public Assistance Recipient Not Public Assistance Recipient (base)0.0000
0.0020
0.0040
0.0060
0.0080
0.0100
0.0120
0.0140
Marginal Effect of Receiving Public Assistance on PVS Validation
20092010
25
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
26
Not In Poverty (base) In Poverty
-0.020
-0.018
-0.016
-0.014
-0.012
-0.010
-0.008
-0.006
-0.004
-0.002
0.000
Marginal Effect of Poverty Status on PVS Validation
20092010
Conclusions
Mobile persons, those with lower income, unemployed, in process of integrating in economy/society, non-participants in government programs are less likely to be validated Renters, movers, mobile homes Low income, non-employed, most minorities, non-U.S.
citizens, poor English Non-participants of govt. program, uninsured, non-military
Researchers may wish to reweight observations based on validation propensity
27
Conclusions
Changes to PVS system Increased overall validation rate by 4.5 percentage
points Reduced validation differences across most groups
from 2009 to 2010
Record linkage research can lead to higher PIK assignment rates and less bias
28
Top Related