The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim,...
-
Upload
peregrine-cameron -
Category
Documents
-
view
215 -
download
0
Transcript of The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim,...
The Application of the The Application of the Concept of Uniqueness Concept of Uniqueness
for Creatingfor CreatingPublic Use Microdata Public Use Microdata
FilesFiles
Jay J. Kim, Jay J. Kim, U.S. National Center for Health U.S. National Center for Health StatisticsStatistics
Dong MDong M. Jeong, . Jeong, Korea National StatisticKorea National Statistical Officeal Office
ContentsContents
IntroductionIntroduction Intruders and DisclosureIntruders and Disclosure Measures of Disclosure RiskMeasures of Disclosure Risk
1. Narrow Definition of Disclosure Risk1. Narrow Definition of Disclosure Risk
2. Broader Definition of Disclosure Risk2. Broader Definition of Disclosure Risk Evaluation of Definition of Disclosure Evaluation of Definition of Disclosure
RiskRisk Concluding RemarksConcluding Remarks
1. Introduction.1. Introduction.
Government agencies release microdata Government agencies release microdata files from their survey data or files from their survey data or administrative records data. administrative records data.
Large amounts of information on Large amounts of information on individuals individuals is available to many is available to many organizations and data users, who can organizations and data users, who can become “become “intrudersintruders”.”.
If a public use microdata file (If a public use microdata file (PUMFPUMF) is ) is released, intruders can try to match their released, intruders can try to match their records with the ones from the PUMF records with the ones from the PUMF and gain access to new information. and gain access to new information.
Intruders use common variables between Intruders use common variables between PUMF and their files for linking the PUMF and their files for linking the records on two files, which are called records on two files, which are called ““key variableskey variables” or “” or “matching variablesmatching variables”.”.
In the U.S., laws such as In the U.S., laws such as Title 13Title 13 stipulates protection of the confidentiality stipulates protection of the confidentiality of many types of data.of many types of data.
Thus, the data disseminating agencies Thus, the data disseminating agencies must protect the confidentiality of the must protect the confidentiality of the individuals on the PUMFs. On the other individuals on the PUMFs. On the other hand, they should not ignore the data hand, they should not ignore the data users’ needs, i.e., the utility of the data users’ needs, i.e., the utility of the data files. files.
Here, we develop probability Here, we develop probability models models quantifying disclosure risk for a microdata quantifying disclosure risk for a microdata file. file.
This is a modification of the Marsh, et al This is a modification of the Marsh, et al (1991) procedure. (1991) procedure.
The model can use population and sample The model can use population and sample “uniques” only, or it can also include “uniques” only, or it can also include population twins or triplets.population twins or triplets.
We will show the results of applying the We will show the results of applying the probability model - using population and probability model - using population and sample uniques only - for creating disclosure-sample uniques only - for creating disclosure-limited microdata files using the 2005 Korean limited microdata files using the 2005 Korean demographic census data.demographic census data.
2. Intruders and 2. Intruders and DisclosureDisclosure
Potential intrudersPotential intruders: : i). Organizational intruders, e.g., i). Organizational intruders, e.g., credit credit
card companies, mortgage departments card companies, mortgage departments of banks, insurance companies, credit of banks, insurance companies, credit bureaus, trade associations, etc.bureaus, trade associations, etc.
ii). ii). Individual intruders: with readily Individual intruders: with readily available high powered computers,available high powered computers, anyone can assemble his own database anyone can assemble his own database using information in the public domainusing information in the public domain and become an intruder.and become an intruder.
Two types of disclosureTwo types of disclosure:: i). i). Identity disclosureIdentity disclosure – identification. – identification. If the intruder is a journalist and tries If the intruder is a journalist and tries
to embarrass the data disseminating to embarrass the data disseminating agencies, his claim that he has been agencies, his claim that he has been successful in identifying someone on successful in identifying someone on their PUMF would be sufficient.their PUMF would be sufficient.
If the intruder publicizes the findings If the intruder publicizes the findings in the news media, it could have a in the news media, it could have a devastating effect on the agencies’ devastating effect on the agencies’ data collection efforts.data collection efforts.
ii). ii). Attribute disclosureAttribute disclosure;;
AAfter identificationfter identification is made, one can is made, one can gain ngain new sensitive informationew sensitive information..
For defining a measure of disclosure For defining a measure of disclosure risk, we will consider that identity risk, we will consider that identity disclosure is the same as disclosuredisclosure is the same as disclosure..
3. Measures of 3. Measures of Disclosure RiskDisclosure Risk
DefineDefine
P(a)P(a) = the probability of key = the probability of key variables being recorded identically in variables being recorded identically in both PUMF and intruder’s file;both PUMF and intruder’s file;
P(bP(b||a)a) = the probability that an = the probability that an individual appears in a PUMF is the individual appears in a PUMF is the same assame as the sampling fraction for that the sampling fraction for that individual in the PUMF;individual in the PUMF;
P(cP(c||a,b)a,b) = the probability of = the probability of population unique;population unique;
andand
P(dP(d||a,b,c)a,b,c) = the probability of = the probability of verifying population unique.verifying population unique.
Marsh, et al (1991) defined the Marsh, et al (1991) defined the probability of correct identification of probability of correct identification of an individual asan individual as
P(a) P(bP(a) P(b||a) P(ca) P(c||a,b) P(da,b) P(d||a,b,c)a,b,c)
We modify the Marsh, et al’s model.We modify the Marsh, et al’s model.
We assume in Marsh, et al’s formula We assume in Marsh, et al’s formula thatthat
i). There are no recording or i). There are no recording or classification errors for the values of classification errors for the values of the key variables, i.e., the key variables, i.e., P(a)P(a) = 1. = 1.
ii). We ii). We can verify correctlycan verify correctly population population uniqueness with certainty, i.e., uniqueness with certainty, i.e., P(dP(d||a,b,c)a,b,c) = 1. = 1.
Disclosure can occur when all the Disclosure can occur when all the following 5 conditions are met:following 5 conditions are met:
i). An individual is unique in a i). An individual is unique in a population based on key variables.population based on key variables.
If the intruder’s file is a 100 percent If the intruder’s file is a 100 percent population file, he can establishpopulation file, he can establish uniqueness of a certain individual by uniqueness of a certain individual by using his file.using his file.
ii). The individual is on the PUMF.ii). The individual is on the PUMF.
iii). The individual is on intruder’s file. iii). The individual is on intruder’s file. An intruder can have information on An intruder can have information on
key variables for a specific person and key variables for a specific person and try to examine whether that person try to examine whether that person appears in the PUMF. In this case, appears in the PUMF. In this case, intruder’s file has a single record.intruder’s file has a single record.
iv).iv). The individual is unique on PUMF The individual is unique on PUMF AND AND
v).v). The individual is unique on intruder’s The individual is unique on intruder’s file.file.
DefineDefine
AA = an individual of interest; = an individual of interest;
= PUMF;= PUMF;
= an intruder’s file;= an intruder’s file;
= unique class in the = unique class in the population;population;
1F
2F
1P
= unique class in PUMF;= unique class in PUMF;
andand
= unique class in intruder’s = unique class in intruder’s file.file.
11FS
21FS
3.1 A Narrow Definition of 3.1 A Narrow Definition of Disclosure RiskDisclosure Risk
This definition depends on the This definition depends on the population andpopulation and
sample uniquessample uniques only. only.
3.1.1 Assume an Intruder does 3.1.1 Assume an Intruder does Phising (Fishing)Phising (Fishing)
Expedition.Expedition.
The probability of correct identification:The probability of correct identification:
(1)(1)
If an individual is a population unique, it If an individual is a population unique, it would also be a sample unique, i.e.,would also be a sample unique, i.e.,
1 21 2 1 1 1F FP A F A F A S A S A P
1 21 2 1 1 1F FP A F A F A S A S A P
1 2
1 2
1 1 1
1 1 1 1
1
|
F F
F F
P A S A S A P
P A S A S A P P A P
P A P
Equation (1) reduces toEquation (1) reduces to
which can be further re-expressed as which can be further re-expressed as follows:follows:
(2)(2)
1 2 1P A F A F A P
1 2 1 1|P A F A F A P P A P
The event that A is unique in population is The event that A is unique in population is independent of whether A is selected in independent of whether A is selected in sample or not. Thus, equation (2) reduces tosample or not. Thus, equation (2) reduces to
(3)(3)
The event that A is in the PUMF is usually The event that A is in the PUMF is usually independent of the event that A is in the independent of the event that A is in the intruder’s file. In this case, equation (3) can intruder’s file. In this case, equation (3) can be simplified asbe simplified as
(4) (4)
1 2 1P A F A F P A P
1 2 1P A F P A F P A P
However, a survey can be a subset of another However, a survey can be a subset of another survey. For example, U.S. Census Bureau’s survey. For example, U.S. Census Bureau’s PUMF is a subset of their census sample. Thus PUMF is a subset of their census sample. Thus if is a subset of if is a subset of
and equation (3) and equation (3) becomesbecomes
(5)(5)Also,Also,
(6)(6)
1F 2F
1 1P A F P A P
1 2 1 2Subsampling Rate of fromP A P P A F F F
1 2 1P A F A F P A F
1 2 2 1 2|P A F A F P A F P A F A F
3.1.2 Assuming an Intruder 3.1.2 Assuming an Intruder Already Knows That A is in Already Knows That A is in
PUMFPUMFIf the intruder has response If the intruder has response
knowledge, thenknowledge, then
Thus, from equation (4), the disclosure Thus, from equation (4), the disclosure risk will be risk will be
1 1P A F
2 1P A F P A P
3.2 Broader Definition of 3.2 Broader Definition of Disclosure RiskDisclosure Risk
Even if an individual is not unique in Even if an individual is not unique in the population, he still can be the population, he still can be identifiedidentified with additional information. with additional information.
Suppose C individuals in the Suppose C individuals in the population have the same values of the population have the same values of the key variables and matching to any one key variables and matching to any one of them is equally likely. of them is equally likely.
DefineDefine
= Equivalence class of size C in the = Equivalence class of size C in the population.population.
Then the probability of correct Then the probability of correct identification is,identification is,
CP
1 21 2 1 1
1F F CP A F A F A S A S A P
C
4. Evaluation of 4. Evaluation of Disclosure RiskDisclosure Risk
We used the measures of disclosure risk We used the measures of disclosure risk developed here in creating PUMS from the developed here in creating PUMS from the 2005 Korean census data.2005 Korean census data.
We show the results of the applications on We show the results of the applications on the 2005 census data from Choongchung the 2005 census data from Choongchung (CC) Province.(CC) Province.
Masking scheme used is to coarse (group) Masking scheme used is to coarse (group) categories.categories.
Korea National Korea National Statistical Office Statistical Office (KNSO)(KNSO) creates the creates the 2 percent2 percent PUMFs PUMFs by taking a 20 percent subsample of by taking a 20 percent subsample of the 10 percent census sample, the 10 percent census sample,
(0.1 x 0.2 = 0.02).(0.1 x 0.2 = 0.02).
: : 2 percent2 percent PUMF. PUMF.
: : 10 percent census sample.10 percent census sample.
1F
2F
PopulationPopulation HouseholdsHouseholds Housing UnitsHousing Units
CensusCensus 1,798,3971,798,397 660,526660,526 586,757586,757
Census Sample Census Sample (10%)(10%)
189,505189,505 71,09171,091 65,39865,398
2% Microdata2% Microdata 38,02738,027 14,21814,218 13,03813,038
Table 1. Population Size, and Number of Households and
Housing Units – CC Province
Key variables used: gender (2); age (111); Key variables used: gender (2); age (111); marital status (4 ); relationship to marital status (4 ); relationship to householder (14); household type (5 ); householder (14); household type (5 ); tenure (6 ); building type of residence (12); tenure (6 ); building type of residence (12); and type of housing and number of floors of and type of housing and number of floors of the building (12). the building (12).
The probability of a population unique is The probability of a population unique is calculated using the 100 percent census calculated using the 100 percent census file.file.
Without grouping, the number of uniques is Without grouping, the number of uniques is 9,664. It is 9,664. It is 0.54 %0.54 % of 1.8 million. of 1.8 million.
If we assume that the intruder has a If we assume that the intruder has a 10 percent census sample file,10 percent census sample file, the the disclosure risk isdisclosure risk is
However, whole blocks are selected in However, whole blocks are selected in the 10 percent census sample, thus the 10 percent census sample, thus residents in the sample blocks know residents in the sample blocks know that their neighbors are also in the that their neighbors are also in the sample.sample. To those who have response To those who have response knowledge, the disclosure risk isknowledge, the disclosure risk is
0.1 0.2 0.0054 0.00011
0.2 0.0054 0.0011
# of Vars# of Vars GenderGender AgeAge RelationshipRelationship Marital StatusMarital Status # of Uniques# of Uniques
11 xx 00
11 xx 22
11 xx 00
11 xx 00
22 xx xx 55
22 xx xx 00
22 xx xx 00
22 xx xx 6565
22 xx xx 1111
22 xx xx 00
33 xx xx xx 167167
33 xx xx xx 3030
33 xx xx xx 22
33 xx xx xx 349349
44 xx xx xx xx 713713
Table 2. Number of Unique Persons before Grouping Categories
Table 3. Number of Uniques with 5 Year Intervals for
Age
# of Vars# of Vars GenderGender Grouped Grouped AgeAge RelationshipRelationship Marital StatusMarital Status # of Uniques# of Uniques
11 xx 2 → 0 2 → 0
22 xx xx 5 → 2 5 → 2
22 xx xx 65 → 6 65 → 6
22 xx xx 11 → 1 11 → 1
33 xx xx xx 167 → 18 167 → 18
33 xx xx xx 30 → 3 30 → 3
33 xx xx xx 349 → 53349 → 53
44 xx xx xx xx 713 → 106 713 → 106
Table 4. Number of Uniques with Table 4. Number of Uniques with Grouped Age and Relationship Grouped Age and Relationship
CategoriesCategories# of # of VarsVars
GendeGenderr
GroupeGrouped d
AgeAge
GroupedGrouped
RelationshRelationshipip
Marital Marital StatusStatus
# of# of
UniqueUniquess
22 xx xx 6 → 26 → 2
33 xx xx xx 18 → 418 → 4
33 xx xx xx 53 → 353 → 3
44 xx xx xx xx 106 → 106 → 88
Table 5. Number of Uniques with Table 5. Number of Uniques with Grouped Age, Relationship and Marital Grouped Age, Relationship and Marital
Status CategoriesStatus Categories
# of# of
VarsVarsGendeGenderr
GroupeGrouped d
AgeAge
GroupedGrouped
RelationshRelationshipip
GroupedGrouped MaritalMarital
StatusStatus
# of# of
UniquesUniques
33 xx xx xx 3 → 13 → 1
33 xx xx xx 3 → 33 → 3
44 xx xx xx xx 8 → 48 → 4
Table 6. Two different groupings in Table 6. Two different groupings in the number the number
of categoriesof categories
RelationsRelationshiphip
BuildinBuildingg
TypeType
Type of Type of Housing Housing and # of and # of FloorsFloors
# of # of
UniquesUniques
Grouping Grouping 11
99
(14)(14) 66
(12)(12) 66
(12)(12) 501501
Grouping Grouping 22
33
(14)(14) 44
(12)(12) 44
(12)(12) 495495
Probability of unique = .028 % for both Probability of unique = .028 % for both groupings.groupings.
If we assume the intruder has the 10 If we assume the intruder has the 10 percent census sample file, the percent census sample file, the disclosure risk is disclosure risk is
0.0000056 < 1 in 100,000.0.0000056 < 1 in 100,000.
If we assume response knowledge, the If we assume response knowledge, the disclosure risk goes up to disclosure risk goes up to
0.000028.0.000028.
5. Concluding Remarks5. Concluding Remarks
We developed comprehensive We developed comprehensive probability models quantifying probability models quantifying disclosure risk for microdata files and disclosure risk for microdata files and applied them to the Korean census applied them to the Korean census data.data.
Using the models, we measured the Using the models, we measured the disclosure risks for the original census disclosure risks for the original census data. The risks were too high.data. The risks were too high.
We grouped categories of the key We grouped categories of the key variables and re-calculated the variables and re-calculated the disclosure risks. The risks were disclosure risks. The risks were lowered to a satisfactory level.lowered to a satisfactory level.
For creating their official 2 percent For creating their official 2 percent PUMFs from the census data, KNSO PUMFs from the census data, KNSO used the approaches mentioned here used the approaches mentioned here including the measures of disclosure including the measures of disclosure risks and grouping categories.risks and grouping categories.
Thank you very much !Thank you very much !
Jay J. Kim Dong M. Jay J. Kim Dong M. Jeong Jeong
[email protected]@cdc.gov [email protected]@nso.go.kr
Disclaimer: Disclaimer: This paper represents the views This paper represents the views of the authors and should not be interpreted as of the authors and should not be interpreted as representing the views, policies or practices of representing the views, policies or practices of the Centers for Disease Control and the Centers for Disease Control and Prevention, National Center for Health Prevention, National Center for Health Statistics.Statistics.