The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim,...

The Application of the The Application of the Concept of Uniqueness Concept of Uniqueness

for Creatingfor CreatingPublic Use Microdata Public Use Microdata

FilesFiles

Jay J. Kim, Jay J. Kim, U.S. National Center for Health U.S. National Center for Health StatisticsStatistics

Dong MDong M. Jeong, . Jeong, Korea National StatisticKorea National Statistical Officeal Office

ContentsContents

IntroductionIntroduction Intruders and DisclosureIntruders and Disclosure Measures of Disclosure RiskMeasures of Disclosure Risk

1. Narrow Definition of Disclosure Risk1. Narrow Definition of Disclosure Risk

2. Broader Definition of Disclosure Risk2. Broader Definition of Disclosure Risk Evaluation of Definition of Disclosure Evaluation of Definition of Disclosure

RiskRisk Concluding RemarksConcluding Remarks

1. Introduction.1. Introduction.

Government agencies release microdata Government agencies release microdata files from their survey data or files from their survey data or administrative records data. administrative records data.

Large amounts of information on Large amounts of information on individuals individuals is available to many is available to many organizations and data users, who can organizations and data users, who can become “become “intrudersintruders”.”.

If a public use microdata file (If a public use microdata file (PUMFPUMF) is ) is released, intruders can try to match their released, intruders can try to match their records with the ones from the PUMF records with the ones from the PUMF and gain access to new information. and gain access to new information.

Intruders use common variables between Intruders use common variables between PUMF and their files for linking the PUMF and their files for linking the records on two files, which are called records on two files, which are called ““key variableskey variables” or “” or “matching variablesmatching variables”.”.

In the U.S., laws such as In the U.S., laws such as Title 13Title 13 stipulates protection of the confidentiality stipulates protection of the confidentiality of many types of data.of many types of data.

Thus, the data disseminating agencies Thus, the data disseminating agencies must protect the confidentiality of the must protect the confidentiality of the individuals on the PUMFs. On the other individuals on the PUMFs. On the other hand, they should not ignore the data hand, they should not ignore the data users’ needs, i.e., the utility of the data users’ needs, i.e., the utility of the data files. files.

Here, we develop probability Here, we develop probability models models quantifying disclosure risk for a microdata quantifying disclosure risk for a microdata file. file.

This is a modification of the Marsh, et al This is a modification of the Marsh, et al (1991) procedure. (1991) procedure.

The model can use population and sample The model can use population and sample “uniques” only, or it can also include “uniques” only, or it can also include population twins or triplets.population twins or triplets.

We will show the results of applying the We will show the results of applying the probability model - using population and probability model - using population and sample uniques only - for creating disclosure-sample uniques only - for creating disclosure-limited microdata files using the 2005 Korean limited microdata files using the 2005 Korean demographic census data.demographic census data.

2. Intruders and 2. Intruders and DisclosureDisclosure

Potential intrudersPotential intruders: : i). Organizational intruders, e.g., i). Organizational intruders, e.g., credit credit

card companies, mortgage departments card companies, mortgage departments of banks, insurance companies, credit of banks, insurance companies, credit bureaus, trade associations, etc.bureaus, trade associations, etc.

ii). ii). Individual intruders: with readily Individual intruders: with readily available high powered computers,available high powered computers, anyone can assemble his own database anyone can assemble his own database using information in the public domainusing information in the public domain and become an intruder.and become an intruder.

Two types of disclosureTwo types of disclosure:: i). i). Identity disclosureIdentity disclosure – identification. – identification. If the intruder is a journalist and tries If the intruder is a journalist and tries

to embarrass the data disseminating to embarrass the data disseminating agencies, his claim that he has been agencies, his claim that he has been successful in identifying someone on successful in identifying someone on their PUMF would be sufficient.their PUMF would be sufficient.

If the intruder publicizes the findings If the intruder publicizes the findings in the news media, it could have a in the news media, it could have a devastating effect on the agencies’ devastating effect on the agencies’ data collection efforts.data collection efforts.

ii). ii). Attribute disclosureAttribute disclosure;;

AAfter identificationfter identification is made, one can is made, one can gain ngain new sensitive informationew sensitive information..

For defining a measure of disclosure For defining a measure of disclosure risk, we will consider that identity risk, we will consider that identity disclosure is the same as disclosuredisclosure is the same as disclosure..

3. Measures of 3. Measures of Disclosure RiskDisclosure Risk

DefineDefine

P(a)P(a) = the probability of key = the probability of key variables being recorded identically in variables being recorded identically in both PUMF and intruder’s file;both PUMF and intruder’s file;

P(bP(b||a)a) = the probability that an = the probability that an individual appears in a PUMF is the individual appears in a PUMF is the same assame as the sampling fraction for that the sampling fraction for that individual in the PUMF;individual in the PUMF;

P(cP(c||a,b)a,b) = the probability of = the probability of population unique;population unique;

andand

P(dP(d||a,b,c)a,b,c) = the probability of = the probability of verifying population unique.verifying population unique.

Marsh, et al (1991) defined the Marsh, et al (1991) defined the probability of correct identification of probability of correct identification of an individual asan individual as

P(a) P(bP(a) P(b||a) P(ca) P(c||a,b) P(da,b) P(d||a,b,c)a,b,c)

We modify the Marsh, et al’s model.We modify the Marsh, et al’s model.

We assume in Marsh, et al’s formula We assume in Marsh, et al’s formula thatthat

i). There are no recording or i). There are no recording or classification errors for the values of classification errors for the values of the key variables, i.e., the key variables, i.e., P(a)P(a) = 1. = 1.

ii). We ii). We can verify correctlycan verify correctly population population uniqueness with certainty, i.e., uniqueness with certainty, i.e., P(dP(d||a,b,c)a,b,c) = 1. = 1.

Disclosure can occur when all the Disclosure can occur when all the following 5 conditions are met:following 5 conditions are met:

i). An individual is unique in a i). An individual is unique in a population based on key variables.population based on key variables.

If the intruder’s file is a 100 percent If the intruder’s file is a 100 percent population file, he can establishpopulation file, he can establish uniqueness of a certain individual by uniqueness of a certain individual by using his file.using his file.

ii). The individual is on the PUMF.ii). The individual is on the PUMF.

iii). The individual is on intruder’s file. iii). The individual is on intruder’s file. An intruder can have information on An intruder can have information on

key variables for a specific person and key variables for a specific person and try to examine whether that person try to examine whether that person appears in the PUMF. In this case, appears in the PUMF. In this case, intruder’s file has a single record.intruder’s file has a single record.

iv).iv). The individual is unique on PUMF The individual is unique on PUMF AND AND

v).v). The individual is unique on intruder’s The individual is unique on intruder’s file.file.

DefineDefine

AA = an individual of interest; = an individual of interest;

= PUMF;= PUMF;

= an intruder’s file;= an intruder’s file;

= unique class in the = unique class in the population;population;

1F

2F

1P

= unique class in PUMF;= unique class in PUMF;

andand

= unique class in intruder’s = unique class in intruder’s file.file.

11FS

21FS

3.1 A Narrow Definition of 3.1 A Narrow Definition of Disclosure RiskDisclosure Risk

This definition depends on the This definition depends on the population andpopulation and

sample uniquessample uniques only. only.

3.1.1 Assume an Intruder does 3.1.1 Assume an Intruder does Phising (Fishing)Phising (Fishing)

Expedition.Expedition.

The probability of correct identification:The probability of correct identification:

(1)(1)

If an individual is a population unique, it If an individual is a population unique, it would also be a sample unique, i.e.,would also be a sample unique, i.e.,

1 21 2 1 1 1F FP A F A F A S A S A P

1 21 2 1 1 1F FP A F A F A S A S A P

1 2

1 2

1 1 1

1 1 1 1

1

|

F F

F F

P A S A S A P

P A S A S A P P A P

P A P

Equation (1) reduces toEquation (1) reduces to

which can be further re-expressed as which can be further re-expressed as follows:follows:

(2)(2)

1 2 1P A F A F A P

1 2 1 1|P A F A F A P P A P

The event that A is unique in population is The event that A is unique in population is independent of whether A is selected in independent of whether A is selected in sample or not. Thus, equation (2) reduces tosample or not. Thus, equation (2) reduces to

(3)(3)

The event that A is in the PUMF is usually The event that A is in the PUMF is usually independent of the event that A is in the independent of the event that A is in the intruder’s file. In this case, equation (3) can intruder’s file. In this case, equation (3) can be simplified asbe simplified as

(4) (4)

1 2 1P A F A F P A P

1 2 1P A F P A F P A P

However, a survey can be a subset of another However, a survey can be a subset of another survey. For example, U.S. Census Bureau’s survey. For example, U.S. Census Bureau’s PUMF is a subset of their census sample. Thus PUMF is a subset of their census sample. Thus if is a subset of if is a subset of

and equation (3) and equation (3) becomesbecomes

(5)(5)Also,Also,

(6)(6)

1F 2F

1 1P A F P A P

1 2 1 2Subsampling Rate of fromP A P P A F F F

1 2 1P A F A F P A F

1 2 2 1 2|P A F A F P A F P A F A F

3.1.2 Assuming an Intruder 3.1.2 Assuming an Intruder Already Knows That A is in Already Knows That A is in

PUMFPUMFIf the intruder has response If the intruder has response

knowledge, thenknowledge, then

Thus, from equation (4), the disclosure Thus, from equation (4), the disclosure risk will be risk will be

1 1P A F

2 1P A F P A P

3.2 Broader Definition of 3.2 Broader Definition of Disclosure RiskDisclosure Risk

Even if an individual is not unique in Even if an individual is not unique in the population, he still can be the population, he still can be identifiedidentified with additional information. with additional information.

Suppose C individuals in the Suppose C individuals in the population have the same values of the population have the same values of the key variables and matching to any one key variables and matching to any one of them is equally likely. of them is equally likely.

DefineDefine

= Equivalence class of size C in the = Equivalence class of size C in the population.population.

Then the probability of correct Then the probability of correct identification is,identification is,

CP

1 21 2 1 1

1F F CP A F A F A S A S A P

C

4. Evaluation of 4. Evaluation of Disclosure RiskDisclosure Risk

We used the measures of disclosure risk We used the measures of disclosure risk developed here in creating PUMS from the developed here in creating PUMS from the 2005 Korean census data.2005 Korean census data.

We show the results of the applications on We show the results of the applications on the 2005 census data from Choongchung the 2005 census data from Choongchung (CC) Province.(CC) Province.

Masking scheme used is to coarse (group) Masking scheme used is to coarse (group) categories.categories.

Korea National Korea National Statistical Office Statistical Office (KNSO)(KNSO) creates the creates the 2 percent2 percent PUMFs PUMFs by taking a 20 percent subsample of by taking a 20 percent subsample of the 10 percent census sample, the 10 percent census sample,

(0.1 x 0.2 = 0.02).(0.1 x 0.2 = 0.02).

: : 2 percent2 percent PUMF. PUMF.

: : 10 percent census sample.10 percent census sample.

1F

2F

PopulationPopulation HouseholdsHouseholds Housing UnitsHousing Units

CensusCensus 1,798,3971,798,397 660,526660,526 586,757586,757

Census Sample Census Sample (10%)(10%)

189,505189,505 71,09171,091 65,39865,398

2% Microdata2% Microdata 38,02738,027 14,21814,218 13,03813,038

Table 1. Population Size, and Number of Households and

Housing Units – CC Province

Key variables used: gender (2); age (111); Key variables used: gender (2); age (111); marital status (4 ); relationship to marital status (4 ); relationship to householder (14); household type (5 ); householder (14); household type (5 ); tenure (6 ); building type of residence (12); tenure (6 ); building type of residence (12); and type of housing and number of floors of and type of housing and number of floors of the building (12). the building (12).

The probability of a population unique is The probability of a population unique is calculated using the 100 percent census calculated using the 100 percent census file.file.

Without grouping, the number of uniques is Without grouping, the number of uniques is 9,664. It is 9,664. It is 0.54 %0.54 % of 1.8 million. of 1.8 million.

If we assume that the intruder has a If we assume that the intruder has a 10 percent census sample file,10 percent census sample file, the the disclosure risk isdisclosure risk is

However, whole blocks are selected in However, whole blocks are selected in the 10 percent census sample, thus the 10 percent census sample, thus residents in the sample blocks know residents in the sample blocks know that their neighbors are also in the that their neighbors are also in the sample.sample. To those who have response To those who have response knowledge, the disclosure risk isknowledge, the disclosure risk is

0.1 0.2 0.0054 0.00011

0.2 0.0054 0.0011

# of Vars# of Vars GenderGender AgeAge RelationshipRelationship Marital StatusMarital Status # of Uniques# of Uniques

11 xx 00

11 xx 22

11 xx 00

11 xx 00

22 xx xx 55

22 xx xx 00

22 xx xx 00

22 xx xx 6565

22 xx xx 1111

22 xx xx 00

33 xx xx xx 167167

33 xx xx xx 3030

33 xx xx xx 22

33 xx xx xx 349349

44 xx xx xx xx 713713

Table 2. Number of Unique Persons before Grouping Categories

Table 3. Number of Uniques with 5 Year Intervals for

Age

# of Vars# of Vars GenderGender Grouped Grouped AgeAge RelationshipRelationship Marital StatusMarital Status # of Uniques# of Uniques

11 xx 2 → 0 2 → 0

22 xx xx 5 → 2 5 → 2

22 xx xx 65 → 6 65 → 6

22 xx xx 11 → 1 11 → 1

33 xx xx xx 167 → 18 167 → 18

33 xx xx xx 30 → 3 30 → 3

33 xx xx xx 349 → 53349 → 53

44 xx xx xx xx 713 → 106 713 → 106

Table 4. Number of Uniques with Table 4. Number of Uniques with Grouped Age and Relationship Grouped Age and Relationship

CategoriesCategories# of # of VarsVars

GendeGenderr

GroupeGrouped d

AgeAge

GroupedGrouped

RelationshRelationshipip

Marital Marital StatusStatus

# of# of

UniqueUniquess

22 xx xx 6 → 26 → 2

33 xx xx xx 18 → 418 → 4

33 xx xx xx 53 → 353 → 3

44 xx xx xx xx 106 → 106 → 88

Table 5. Number of Uniques with Table 5. Number of Uniques with Grouped Age, Relationship and Marital Grouped Age, Relationship and Marital

Status CategoriesStatus Categories

# of# of

VarsVarsGendeGenderr

GroupeGrouped d

AgeAge

GroupedGrouped

RelationshRelationshipip

GroupedGrouped MaritalMarital

StatusStatus

# of# of

UniquesUniques

33 xx xx xx 3 → 13 → 1

33 xx xx xx 3 → 33 → 3

44 xx xx xx xx 8 → 48 → 4

Table 6. Two different groupings in Table 6. Two different groupings in the number the number

of categoriesof categories

RelationsRelationshiphip

BuildinBuildingg

TypeType

Type of Type of Housing Housing and # of and # of FloorsFloors

# of # of

UniquesUniques

Grouping Grouping 11

99

(14)(14) 66

(12)(12) 66

(12)(12) 501501

Grouping Grouping 22

33

(14)(14) 44

(12)(12) 44

(12)(12) 495495

Probability of unique = .028 % for both Probability of unique = .028 % for both groupings.groupings.

If we assume the intruder has the 10 If we assume the intruder has the 10 percent census sample file, the percent census sample file, the disclosure risk is disclosure risk is

0.0000056 < 1 in 100,000.0.0000056 < 1 in 100,000.

If we assume response knowledge, the If we assume response knowledge, the disclosure risk goes up to disclosure risk goes up to

0.000028.0.000028.

5. Concluding Remarks5. Concluding Remarks

We developed comprehensive We developed comprehensive probability models quantifying probability models quantifying disclosure risk for microdata files and disclosure risk for microdata files and applied them to the Korean census applied them to the Korean census data.data.

Using the models, we measured the Using the models, we measured the disclosure risks for the original census disclosure risks for the original census data. The risks were too high.data. The risks were too high.

We grouped categories of the key We grouped categories of the key variables and re-calculated the variables and re-calculated the disclosure risks. The risks were disclosure risks. The risks were lowered to a satisfactory level.lowered to a satisfactory level.

For creating their official 2 percent For creating their official 2 percent PUMFs from the census data, KNSO PUMFs from the census data, KNSO used the approaches mentioned here used the approaches mentioned here including the measures of disclosure including the measures of disclosure risks and grouping categories.risks and grouping categories.

Thank you very much !Thank you very much !

Jay J. Kim Dong M. Jay J. Kim Dong M. Jeong Jeong

[email protected]@cdc.gov [email protected]@nso.go.kr

Disclaimer: Disclaimer: This paper represents the views This paper represents the views of the authors and should not be interpreted as of the authors and should not be interpreted as representing the views, policies or practices of representing the views, policies or practices of the Centers for Disease Control and the Centers for Disease Control and Prevention, National Center for Health Prevention, National Center for Health Statistics.Statistics.

The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim,...

Documents

Transcript of The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim,...