Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

30
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek

Transcript of Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Page 1: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Issues in Designing a Confidentiality Preserving Model Server

by Philip M Steel & Arnold Reznek

Page 2: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Talk Outline

BackgroundBasic designDescription of operationConfidentiality outlineConstraints on universe formationOther constraintsSummary

Page 3: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Background

PUBLIC remote access to confidential dataRestriction of queries and responses rather

than the registering and monitoring the userCurrent population survey (CPS), employment

and economic well-being; demographic supplement

Software development by SynecticsHTML, mySQL, php, to develop the query …

SAS as the statistical package run against the data

Page 4: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Risk Model for Microdata

• Intruder has access to record linkage software and identified data sources

• Disclosure occurs if the intruder is successful in linking his identified data with the published microdata

Page 5: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Risk Model for a Model Server

• Intruder has access to record linkage software and identified data sources

• Intruder uses model server to reconstruct microdata for both the variables overlapping his data sources and a sensitive variable

• Disclosure occurs if the intruder is successful in linking his identified data with the reconstructed microdata and has valid estimate of a sensitive characteristic or value

Page 6: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Basic Design Choice

Enable: Choose which functions will operate– Must construct a friendly interface– Limited to the procedures developed– Safe from unknown code

Disable: Choose which functions will not operate– User free to program within disabling constraints– No limit on complexity– Must be monitored (human, program or mix)

Page 7: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Operation

User visits web site, chooses data set, explores data, chooses geography, analysis type

User chooses population, constructs model, selects output

Web site constructs code to send behind firewall

Code checked and run against data at Census

Results checked and returned to user

Page 8: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Page 9: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Page 10: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Page 11: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Page 12: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Page 13: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Page 14: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Page 15: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Page 16: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Structure of Confidentiality Rules

Data preparation

Data exploration

Model universe formation

Model Statement

Model Output

Page 17: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Data exploration rules

Users may request tables for categorical variables and numeric recodes up to e1 dimensions. (start e1=4 including geo)

User may transform numeric recodes using a limited set of functions: log, root, square.

Page 18: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Universe formation: Categorical Variables

Example: Hispanic heads of household with a college degree.

Conditions: X1=H,X2=1,X3=5 (table cell)

Implication: Data preparation must support safe lower dimensional tables

Page 19: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Universe formation rules: Categorical Variables

Limit on the number of categorical variables (u1=3)

Minimum on the size of the universe selected (u2=75)

Page 20: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Universe Formation: Numeric Variables

Example: Families in poverty

Condition: Family income<18,500. Or Family income<18,501?

Implication: Rounding or pre-assigned cutpoints.

Page 21: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Universe formation rules: Numeric variables

Users will select categorical variables first

Numeric variables can be used only at pre-assigned cutpoints.

The number of observations in the whole CPS universe between cutpoints shall be at least u3 for every numeric variable. (start u3=80)

Page 22: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Universe formation rules (cont)

If a cutpoint is used in universe formation then the difference in the size of the model universe obtained by incrementing the cutpoint up or down cannot be less than u4. (start u4=4)

The universe for the model must have at least u2 observations. (start u2=75)

There will be no cutpoints above the 97th percentile of nonzero points or the last half percentile of all points .

Page 23: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Model statements rules

At most m1 variables may be used in the model statement (start m1=20)  

Dummy variables must distinguish at least m2 observations (start m2=20)

No interaction term may involve more than 4 variables. (m3=4)

No model involving 3 or more variables can be fully interacted. (m4=3)

Page 24: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Model Output

Residuals will be based on synthetic data

Limit on the number of significant digits?

R2 cannot be 1?

Rules for other diagnostics

Page 25: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Synthetic Residuals

Users may see synthetic bar charts or distributions and synthetic 2-way plots.

Synthetic data must be generated from fixed random number starts and topcoded (and bottom coded where appropriate) at 4 standard deviations from the mean. 

Page 26: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Data preparation

The topcode for numeric data needs to be calculated

Cutpoints must be determinedSeparate lists of variables for exploration,

universe formation, dependent and independent variables, model estimation

Standard recodes addedInference from the collection of all 4-way

categorical tables checked

Page 27: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Major Hurdles

Implementing facility for dummy variables

Presentation of geographic options

Implementing synthetic residuals

Architecture for differing variable roles

Page 28: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Future development

Relaxation of top codes

Implementation of model variance estimation (NSO weighting)

Introduction of new dataset

Introduction of new statistical procedures

Facility to add contextual data or merge files

Use of non-sampled data

Page 29: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Overview

• Avoids (as much as possible) tests which accept or reject a users choice.

• Restricts the dimension of the data access.• Has some flexibility in setting system

confidentiality parameters.• Changes the intruder model.• Introduces a modification of k-anonymity.

Page 30: Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

My thanks to Jerry Reiter, Laura Zayatz and Stephen Wenck

http://204.52.186.190/

Contact: [email protected]