1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk...

1

WP 10

On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk

Estimation

Natalie ShlomoHebrew University

Southampton University

Yosi RinottHebrew University

2

Disclosure Risk Measures

Notation:

Sample (size n): Population (size N):Tables with K cells: m-way table

Risk Measures: = expected number

of correct matches of sample uniques

Estimates:

},...,1:{ Kkf k f

},...,1:{ KkFk F),...,( 1 mkkk

)1,1(1 kk FfI

)1)1(2 kk FfI

)1|1(ˆ)1(1̂ kkk fFPfI

]1|1[ˆ)1(ˆ2 kkk fFEfI

3

On Definitions of Disclosure Risk

• In the statistics literature, we present examples of risk measures, and , but we lack formal definitions of when a file is safe

• In the computer science literature, there is a formal definition of disclosure risk (e.g., Dinur, Dwork, Nisim (2004-5), Adam and Wortman(1989), who write “it may be argued that elimination of disclosure

is possible only by elimination of statistics”)

In some of the CS literature any data must be released with noise.

The noise must be small enough so that legitimate information on large subsets of the data will be useful, and large enough so that information on small subsets, or individuals will be too noisy and therefore useless (regardless of whether they are obtained by direct queries or differencing etc.)

1 2

4

On Definitions of Disclosure Risk

Worst Case scenario of the CS approach (for example, that the intruder has all information on anyone in the data set except the individual being snooped) simplify definitions, there is no need to consider other, more realistic but more complicated scenarios.

But would Statistics Bureaus and statisticians agree to adding noise to any data?

Other approaches like query restriction or query auditing do not lead to formal definitions.

5

Definition of Disclosure Risk

Numerical Data Base ,

A Query is a sum over a subset of . Query is perturbed by adding some noise of magnitude

Proven that almost all can be reconstructed if and

none of them can be reconstructed if

Adding noise of order hides information on individuals and small groups, but yields meaningful information about sums of O(n) units for which noise of order is natural.

Work further expanded to lessen the magnitude of the noise by limiting the number of queries.

D { : 1,..., }id i n m

id }1,0{

id

n n

id

n

n

6

Definition of Disclosure Risk

Collaboration with the CS and Statistical Community where:

1. In the statistical community, there is a need for more formal and clear definitions of disclosure risk

2. In the CS community, there is a need for statistical methods to preserve the utility of the data

- allow sufficient statistics to be released without perturbation

- methods for adding correlated noise

- sub-sampling and other methods for data masking

Can the formal notions from CS and the practical approach of statisticians lead to a compromise that will allow us to set practical but well defined standard for disclosure risk?

7

Probabilistic ModelsFocus on sample microdata and not whole population (sampling

provides a priori protection against disclosure) Standard (natural) Assumptions

ind. Bernoulli or Poisson sampling

)(~| kkk NPoissonF 1 k

),(~| kkkk FBinFf

)(~| kkkk NPoissonf

))1((~| kkkkkk NPoissonffF

)(1~1| kkk PoissonfF

In particular

the size biased Poisson distribution

8

Probabilistic Models

Add iid

),(~ Gammak 11 kEK

)/1

1,(~|

N

NfNBffF k

kkkk

As ( ) we obtain the mu-argus assumption

As ( ) we obtain the above Poisson Model

),(~| kkkkk fNBffF

0

02

9

Mu-Argus Model (Benedetti, Capobianchi, Franconi (1998))

is the sampling weight of individual i obtained from design or post-stratification

where

If then but

are underestimated risk is under estimated

Monotonicity: if we replace by some , risk estimates increase to the correct level in , but how to estimate ?

kkk Ff ˆˆ

kcellsamplei

ik wF̂

iw

0kf 0ˆ kF NwFi

ik

k ˆ

k̂

0kf

10

Poisson Log-linear Models (Skinner, Holmes (1998), Elamir, Skinner (2005), Skinner, Shlomo (2005))

Monotonicity in the size of the model (number of parameters):

Saturated (“big” model) data under fitted risk underestimated

Independent (“small” model) data over fitted risk overestimated

Intermediate models with conditional independence involves smaller products of marginal proportions and therefore we expect monotonicity of the models, so similar to the choice of , there will be a model which will give a good risk estimate

)x exp(kN

11

Neighborhood of a Log-linear Model

Log-linear models takes into account a neighborhood of cells to infer on for determining the risk.

For example:Independence Neighborhood, k=(i,j):

The estimate is the productof marginal proportions obtainedby fixing one attribute at a time, thus if one attribute is incomegroup then inference on very richinvolves information on very poor,provided there is another attributein common, such as marital status.

k

i

j

k̂

12

Discussion of Neighborhoods

How likely is a sample unique a population unique?

If a sample unique has mostly small or empty neighboring cells, it is more likely to be a population unique.

• Argus is based on weights and no learning from other cells.

• The log linear Poisson model takes into account neighborhoods, reduces the number of parameters and also reduces their standard deviation and hence

of risk measures (provided that the model is valid).

Are there other types of neighborhoods which may be more natural?

We focus on ORDINAL variables

13

Proposed Neighborhoods

• Local smoothers for large sparse (ordinal) tables, e.g. Bishop, Fienberg, Holland (1975), Simonoff (1998)

• Use local neighborhoods to fit a simple smooth function to or to estimate smoothly

• Construct neighborhood of cells of k, by varying the coordinates of ordinal attributes, and fixing non-ordinal attributes

Neighborhood of cell k at distance c from cell k

kf

k

kN

( )kcN

14

Proposed Neighborhoods

j

i)(~| kkk nPoissonf

))1((~| kkkkkk NPoissonffF

Regressors, for cell k:

)(

)(k

cNl lk

c fx

}exp{ )(0

kcCc ck x

Define structural zeros if all neighborhoods of a cell which are used in the regression contain only empty cells

15

Example

Population from 1995 Israeli Census File, Age>15, N=746,949, n=14,939, and K=337,920

Key: Sex(2), Age groups(16), Groups of years of study(10), Number of years in Israel(11), Income groups(12), Number of persons in household (8)

Sex is not ordinal and is fixed

Weights for Argus obtained by post-stratification on weighting classes: sex, age and geographical location

16

Example

Model

True Values 430.0 1,125.8

Argus 114.5 456.0

Log-linear model: Independence 773.8 1,774.1

Log-linear model: 2-way Interactions 470.0 1,178.1

Neighborhood Method 786.8 2,146.9

Neighborhood Method w/out structural zeros

385.4 1,674.1

Neighborhood Method 723.3 2,099.6

Neighborhood Method w/out structural zeros

344.8 1,624.2

1 2

kaM

kaM

kcN

kcN

17

Results of Example

• Independent log-linear model and neighborhoods over estimate the two risk measures

• Argus Model under estimates

• The all 2-way interaction log-linear Poisson Model has the best estimates

• Taking into account the structural zeros in the neighborhoods yield more reasonable estimates

18

Conclusions

• Need to refine the neighborhood approach, define the model better and develop MLE theory

• We expect the new model to work well in multi-way tables when simple log-linear models are not valid

• Incorporate the approach into a more general regression model, the Negative Binomial Regression, which subsumes both the Poisson Risk Model and the Argus Model

1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk...

Documents

Transcript of 1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk...