1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk...
-
Upload
amos-johns -
Category
Documents
-
view
214 -
download
0
Transcript of 1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk...
1
WP 10
On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk
Estimation
Natalie ShlomoHebrew University
Southampton University
Yosi RinottHebrew University
2
Disclosure Risk Measures
Notation:
Sample (size n): Population (size N):Tables with K cells: m-way table
Risk Measures: = expected number
of correct matches of sample uniques
Estimates:
},...,1:{ Kkf k f
},...,1:{ KkFk F),...,( 1 mkkk
)1,1(1 kk FfI
)1)1(2 kk FfI
)1|1(ˆ)1(1̂ kkk fFPfI
]1|1[ˆ)1(ˆ2 kkk fFEfI
3
On Definitions of Disclosure Risk
• In the statistics literature, we present examples of risk measures, and , but we lack formal definitions of when a file is safe
• In the computer science literature, there is a formal definition of disclosure risk (e.g., Dinur, Dwork, Nisim (2004-5), Adam and Wortman(1989), who write “it may be argued that elimination of disclosure
is possible only by elimination of statistics”)
In some of the CS literature any data must be released with noise.
The noise must be small enough so that legitimate information on large subsets of the data will be useful, and large enough so that information on small subsets, or individuals will be too noisy and therefore useless (regardless of whether they are obtained by direct queries or differencing etc.)
1 2
4
On Definitions of Disclosure Risk
Worst Case scenario of the CS approach (for example, that the intruder has all information on anyone in the data set except the individual being snooped) simplify definitions, there is no need to consider other, more realistic but more complicated scenarios.
But would Statistics Bureaus and statisticians agree to adding noise to any data?
Other approaches like query restriction or query auditing do not lead to formal definitions.
5
Definition of Disclosure Risk
Numerical Data Base ,
A Query is a sum over a subset of . Query is perturbed by adding some noise of magnitude
Proven that almost all can be reconstructed if and
none of them can be reconstructed if
Adding noise of order hides information on individuals and small groups, but yields meaningful information about sums of O(n) units for which noise of order is natural.
Work further expanded to lessen the magnitude of the noise by limiting the number of queries.
D { : 1,..., }id i n m
id }1,0{
id
n n
id
n
n
6
Definition of Disclosure Risk
Collaboration with the CS and Statistical Community where:
1. In the statistical community, there is a need for more formal and clear definitions of disclosure risk
2. In the CS community, there is a need for statistical methods to preserve the utility of the data
- allow sufficient statistics to be released without perturbation
- methods for adding correlated noise
- sub-sampling and other methods for data masking
Can the formal notions from CS and the practical approach of statisticians lead to a compromise that will allow us to set practical but well defined standard for disclosure risk?
7
Probabilistic ModelsFocus on sample microdata and not whole population (sampling
provides a priori protection against disclosure) Standard (natural) Assumptions
ind. Bernoulli or Poisson sampling
)(~| kkk NPoissonF 1 k
),(~| kkkk FBinFf
)(~| kkkk NPoissonf
))1((~| kkkkkk NPoissonffF
)(1~1| kkk PoissonfF
In particular
the size biased Poisson distribution
8
Probabilistic Models
Add iid
),(~ Gammak 11 kEK
)/1
1,(~|
N
NfNBffF k
kkkk
As ( ) we obtain the mu-argus assumption
As ( ) we obtain the above Poisson Model
),(~| kkkkk fNBffF
0
02
9
Mu-Argus Model (Benedetti, Capobianchi, Franconi (1998))
is the sampling weight of individual i obtained from design or post-stratification
where
If then but
are underestimated risk is under estimated
Monotonicity: if we replace by some , risk estimates increase to the correct level in , but how to estimate ?
kkk Ff ˆˆ
kcellsamplei
ik wF̂
iw
0kf 0ˆ kF NwFi
ik
k ˆ
k̂
0kf
10
Poisson Log-linear Models (Skinner, Holmes (1998), Elamir, Skinner (2005), Skinner, Shlomo (2005))
Monotonicity in the size of the model (number of parameters):
Saturated (“big” model) data under fitted risk underestimated
Independent (“small” model) data over fitted risk overestimated
Intermediate models with conditional independence involves smaller products of marginal proportions and therefore we expect monotonicity of the models, so similar to the choice of , there will be a model which will give a good risk estimate
)x exp(kN
11
Neighborhood of a Log-linear Model
Log-linear models takes into account a neighborhood of cells to infer on for determining the risk.
For example:Independence Neighborhood, k=(i,j):
The estimate is the productof marginal proportions obtainedby fixing one attribute at a time, thus if one attribute is incomegroup then inference on very richinvolves information on very poor,provided there is another attributein common, such as marital status.
k
i
j
k̂
12
Discussion of Neighborhoods
How likely is a sample unique a population unique?
If a sample unique has mostly small or empty neighboring cells, it is more likely to be a population unique.
• Argus is based on weights and no learning from other cells.
• The log linear Poisson model takes into account neighborhoods, reduces the number of parameters and also reduces their standard deviation and hence
of risk measures (provided that the model is valid).
Are there other types of neighborhoods which may be more natural?
We focus on ORDINAL variables
13
Proposed Neighborhoods
• Local smoothers for large sparse (ordinal) tables, e.g. Bishop, Fienberg, Holland (1975), Simonoff (1998)
• Use local neighborhoods to fit a simple smooth function to or to estimate smoothly
• Construct neighborhood of cells of k, by varying the coordinates of ordinal attributes, and fixing non-ordinal attributes
Neighborhood of cell k at distance c from cell k
kf
k
kN
( )kcN
14
Proposed Neighborhoods
j
i)(~| kkk nPoissonf
))1((~| kkkkkk NPoissonffF
Regressors, for cell k:
)(
)(k
cNl lk
c fx
}exp{ )(0
kcCc ck x
Define structural zeros if all neighborhoods of a cell which are used in the regression contain only empty cells
15
Example
Population from 1995 Israeli Census File, Age>15, N=746,949, n=14,939, and K=337,920
Key: Sex(2), Age groups(16), Groups of years of study(10), Number of years in Israel(11), Income groups(12), Number of persons in household (8)
Sex is not ordinal and is fixed
Weights for Argus obtained by post-stratification on weighting classes: sex, age and geographical location
16
Example
Model
True Values 430.0 1,125.8
Argus 114.5 456.0
Log-linear model: Independence 773.8 1,774.1
Log-linear model: 2-way Interactions 470.0 1,178.1
Neighborhood Method 786.8 2,146.9
Neighborhood Method w/out structural zeros
385.4 1,674.1
Neighborhood Method 723.3 2,099.6
Neighborhood Method w/out structural zeros
344.8 1,624.2
1 2
kaM
kaM
kcN
kcN
17
Results of Example
• Independent log-linear model and neighborhoods over estimate the two risk measures
• Argus Model under estimates
• The all 2-way interaction log-linear Poisson Model has the best estimates
• Taking into account the structural zeros in the neighborhoods yield more reasonable estimates
18
Conclusions
• Need to refine the neighborhood approach, define the model better and develop MLE theory
• We expect the new model to work well in multi-way tables when simple log-linear models are not valid
• Incorporate the approach into a more general regression model, the Negative Binomial Regression, which subsumes both the Poisson Risk Model and the Argus Model