Differential Privacy in US Census

Lecture 17: 590.03 Fall 12 1

Differential Privacy in US Census

CompSci 590.03Instructor: Ashwin Machanavajjhala

Lecture 17: 590.03 Fall 12 2

Announcements

• No class on Wednesday, Oct 31

• Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation”

Lecture 17: 590.03 Fall 12 3

Outline

• (continuation of last class) Relaxing differential privacy for utility– E-privacy [M et al

VLDB ‘09]

• Application of Differential Privacy in US Census [M et al ICDE ‘08]

Lecture 17: 590.03 Fall 12 4

E-PRIVACY

Lecture 17: 590.03 Fall 12 5

Defining Privacy ... nothing about an individual should be learnable from the database that cannot be learned without access to the database.

T. Dalenius, 1977

Problem with this approach: • Analyst knows Bob has green hair. • Analysts learns from published data that people with green hair

have 99% probability of cancer.

• Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data.

Lecture 17: 590.03 Fall 12 6

Defining Privacy• Therefore analyst knows Bob has high risk of cancer, even if Bob

is not in the published data.

• This should not be considered a privacy breach – such correlations are exactly what we want the analyst to learn.

Lecture 17: 590.03 Fall 12 7

Counterfactual ApproachConsider 2 distributions:

• Pr[Bob has cancer | adversary’s prior + output of mechanism on D]

– “What adversary learns about Bob after seeing the published information”

• Pr[Bob has cancer | adversary’s prior + output of mechanism on D-Bob]

where D’ = D – {Bob}– “What adversary would have learnt about Bob even (in the hypothetical

case) when Bob was not in the data”– Must be careful … when removing Bob you may also need to remove other tuples correlated

with Bob …

Lecture 17: 590.03 Fall 12 8

Counterfactual Privacy• Consider a set of data evolution scenarios (adversaries) {θ}• For every property sBob about Bob, and output of mechanism w

|log P(sBob| θ, M(D) = w) - log P(sBob| θ, M(D-Bob) = w) | ≤ ε

• When {θ} is the set of all product distributions which are independent across individuals, – D-Bob = D – {Bob’s record}– A mechanism satisfies the above definition if and only if it satisfies

differential privacy

Lecture 17: 590.03 Fall 12 9

Counterfactual Privacy• Consider a set of data evolution scenarios (adversaries) {θ}• For every property sBob about Bob, and output of mechanism w

|log P(sBob| θ, M(D) = w) - log P(sBob| θ, M(D-Bob) = w) | ≤ ε

• What about for other sets of prior distributions {θ}?– Open Question: If {θ} contains correlations, then definition of D-Bob itself is

not very clear (like discussed in previous class about count constraints, or social networks)

Lecture 17: 590.03 Fall 12 10

Certain vs Uncertain AdversariesSuppose an adversary has an uncertain prior:• Consider a two sided coin• A certain adversary knows bias of coin = p (for some p)

– Exactly knows the bias of the coin– Knows every coin flip is random draw which is H with prob p

• An uncertain adversary may think the coin’s bias is in [p – δ, p+δ]– Does not exactly know the bias of the coin. Assumes coin’s bias θ is drawn

from some probability distribution π– Given θ, every coin flip is random draw which is H with prob θ

Lecture 17: 590.03 Fall 12 11

Learning• In machine learning/statistics, you want to use the observed data

to learn something about the population.

• E.g., Given 10 flips of a coin, what is the bias of the coin?

• Assume your population is drawn from some prior distribution θ.• We don’t know the θ, but may know some θ’s are more likely

than others (like π, a probability distribution over θ’s)• We want to learn the best θ that explains the observations …

• If you are certain about θ in the first place, there is no need for statistics/machine learning.

Lecture 17: 590.03 Fall 12 12

Uncertain Priors and Learning• In many privacy scenarios, statistician (advertiser, epidemiologist,

machine learning operator, …) is the adversary.

• But statistician does not have a certain prior … (otherwise no learning)

• Maybe, we can model a class of “realistic” adversaries using uncertain priors.

Lecture 17: 590.03 Fall 12 13

E-Privacy• Counterfactual privacy with realistic adversaries

• Consider a set of data evolution scenarios (uncertain adversaries) {π}

• For every property sBob about Bob, and output of mechanism w

|log P(sBob| π, M(D) = w) - log P(sBob| π, M(D-Bob) = w) | ≤ ε

where P(sBob| π, M(D) = w) = ∫θ P(sBob| θ, M(D) = w) P(θ | π) dθ

Lecture 17: 590.03 Fall 12 14

Realistic Adversaries• Suppose your domain has 3 values green (g), red (r) and blue (b).• Suppose individuals are assumed to be drawn from some

common distribution θ = {pg, pr, pb}

{pg = 1, pr = 0, pb = 0}{pg = 0, pr = 1, pb = 0}

{pg = 0, pr = 0, pb = 1}

Lecture 17: 590.03 Fall 12 15

Modeling Realistic Adversaries• E.g., Dirichlet D(αg, αr, αb) priors to model uncertainty.

– Maximum probability given to {p*g, p*r, p*b}, where p*g = αg / (αg + αr + αb)

D(6,2,2) D(3,7,5)

D(6,2,6)D(2,3,4)

{p*g=0.2, pr=0.47, pb = 0.53}

{p*g=0.43, pr=0.14, pb = 0.43}

Lecture 17: 590.03 Fall 12 16

E.g., Dirichlet Prior• Dirichlet D(αg, αr, αb).

– Call α = (αg + αr + αb) the stubbornness of the prior.

– As α increases, more probability is given to {p*g, p*r, p*b}.

– When α ∞, {p*g, p*r, p*b} has probability 1 and we get the independence assumption.

D(2,3,4) D(6,2,6)α = 10 α = 14

Lecture 17: 590.03 Fall 12 17

Better Utility …• Suppose we consider a class of uncertain adversaries, who are

characterized by Dirichlet distributions with known stubborness (α), but unknown shape (αg, αb, αr).

• Algorithm: Generalization– Size of each group > α/ε-1.– If size of a group = δ(α/ ε-1), then most frequent sensitive value appears

in at most 1-1/(ε+δ) fraction of tuples in the group.

• Hence, we now also have a sliding scale to assess the power of adversaries (based on stubbornness)– More α implies more coarsening implies less utility

Lecture 17: 590.03 Fall 12 18

SummaryLots of Open Questions:

• What is the relationship between counterfactual privacy and Pufferfish?

• In what other ways can causality theory be used? For defining correlations?

• What are other interesting ways to instantiate E-privacy, and what are efficient algorithms for E-privacy?

• …

Lecture 17: 590.03 Fall 12 19

DIFFERENTIAL PRIVACY IN US CENSUS

Lecture 17: 590.03 Fall 12 20

OnTheMap: A Census application that plots commuting patterns of workers

Workplace(Public)

ResidencesResidences(Sensitive)

http://onthemap.ces.census.gov/



Lecture 17: 590.03 Fall 12 21

OnTheMap: A Census application that plots commuting patterns of workers

Worker ID Origin Destination1223 MD11511 DC22122

1332 MD2123 DC22122

1432 VA11211 DC22122

2345 PA12121 DC24132

1432 PA11122 DC24132

1665 MD1121 DC24132

1244 DC22122 DC22122

Census Blocks

Residence(Sensitive)

Workplace(Quasi-identifier)

Lecture 17: 590.03 Fall 12 22

Why publish commute patterns? • To compute Quarterly Workforce Indicators

– Total employment– Average Earnings– New Hires & Separations– Unemployment Statistics

E.g., Missouri state used this data to formulate a method allowing QWI to suggest industrial sectors where transitional training might be most effective … to proactively reduce time spent on unemployment insurance …

Lecture 17: 590.03 Fall 12 23

A Synthetic Data Generator (Dirichlet resampling)

+ =Multi-set of Origins

for workers in Washington DC.

Noise(fake workers)

Step 1: Noise Addition (for each destination)

D (7, 5, 4) A (2, 3, 3) D+A (9, 8, 7)

Washington DC Somerset Fuller

Noise added to an origin with at least 1 worker is > 0

Noise infused data

Lecture 17: 590.03 Fall 12 24

A Synthetic Data Generator (Dirichlet resampling)

Step 2: Dirichlet Resampling (for each destination)

(9, 8, 7)(9, 7, 7)Draw a point at random

Replace two of the same kind.

(9, 9, 7)

S : Synthetic Data

frequency of block b in D+A = 0 frequency of b in S = 0i.e., block b is ignored by the algorithm.

Lecture 17: 590.03 Fall 12 25

How should we add noise?• Intuitively, more noise yields more privacy …

• How much noise should we add ?• To which blocks should we add noise?

• Currently this is poorly understood.– Total amount of noise added is a state secret – Only 3-4 people in the US know this value in the current implementation

of OnTheMap.

Lecture 17: 590.03 Fall 12 26

How should we add noise?• Intuitively, more noise yields more privacy …

• How much noise should we add ?• To which blocks should we add noise?

• Currently this is poorly understood.– Total amount of noise added is a state secret – Only 3-4 people in the US know this value in the current implementation

of OnTheMap.

1. How much noise should we add?2. To which blocks should we add noise?

Lecture 17: 590.03 Fall 12 27

Privacy of Synthetic DataTheorem 1:

The Dirichlet resampling algorithm preserves privacy if and only if for every destination d, the noise added to each block is at least

where m(d) is the size of the synthetic population for destination d and ε is the privacy parameter.

m(d)ε - 1

Lecture 17: 590.03 Fall 12 28

1. How much noise should we add?Noise required per block: (differential privacy)

Add noise to every block on the map.There are 8 million Census blocks on the map!1 million original workers and 16 billion fake workers!!!

Privacy (eε =) 5 10 20 50Noise per block (x 106) 0.25 0.11 0.05 0.02

1 million original and synthetic

workers.

lesser privacy

2. To which blocks should we add noise?

Lecture 17: 590.03 Fall 12 29

Intuition behind Theorem 1.

Two possible inputs

blue and red are two different origin blocks.

Adversary knows individual 1 is

Either blue or red.

Adversary knows individuals [2..n] are blue.

D2D1

Lecture 17: 590.03 Fall 12 30


Two possible inputs


Noise Addition

D2D1

Lecture 17: 590.03 Fall 12 31


Noise infused inputs


For every output …

O

Dirichlet ResamplingD2D1

Pr[D1 O] = 1/10 * 2/11 * 3/12 * 4/13 * 5/14 * 6/15Pr[D2 O] = 2/10 * 3/11 * 4/12 * 5/13 * 6/14 * 7/15

= 7Pr[D1 O] Pr[D2 O]

Lecture 17: 590.03 Fall 12 32





O

Adversary infers that it is very likely individual 1 is red …

… unless noise added is very large.


Lecture 17: 590.03 Fall 12 33

Privacy Analysis: Summary• Chose differential privacy.

– Guards against powerful adversaries. – Measures privacy as a distance between prior and posterior.

• Derived necessary and sufficient conditions when OnTheMap preserves privacy.

• The above conditions make the data published by OnTheMap useless.

Lecture 17: 590.03 Fall 12 34

But, breach occurs with very low probability.




O


Probability of O ≈ 10-4

Lecture 17: 590.03 Fall 12 35

Negligible function

Definition: f(x) is negligible if it goes to 0 faster than the inverse of any polynomial.e.g., 2-x and e-x are negligible functions.

Lecture 17: 590.03 Fall 12 36

(ε,δ)-Indistinguishability

Pr[D1 T] ≤ eε Pr[D2 T] + δ(|D2|)

For any subset of outputs T

O1D2D1

For every pair of inputs that differ in one value

O2 O3 O4

If T occurs with negligible probability, the adversary is allowed to distinguish between D1 and D2 by a factor > ε using Oi in T.

Lecture 17: 590.03 Fall 12 37

Conditions for (ε,δ)-IndistinguishabilityTheorem 2:

The Dirichlet resampling algorithm preserves (ε,δ)-indistinguishability if for every destination d, the noise added to each block is at least

where n(d) is the number of workers commuting to dand m(d) ≤ n(d).

log n(d)

Lecture 17: 590.03 Fall 12 38

Probabilistic Differential Privacy• (ε,δ)-Indistinguishability is an asymptotic measure

– May not guarantee privacy when number of workers at a destination is small.

Definition (Disclosure Set Disc(D, ε)):The set of output tables that breach ε-differential privacy for D and some other table D’ that differs from D in one value.

Lecture 17: 590.03 Fall 12 39

Probabilistic Differential Privacy

Adversary may distinguish between D1 and D2 based on a set of unlikely outputs

with probability at most δ

For every probable output

OD2D1

For every pair of inputs that differ in one value

Pr[O | < eε] > 1 - δPr[D1 O]Pr[D2 O]

Lecture 17: 590.03 Fall 12 40

1. How much noise should we add?Noise required per block:

Privacy (eε =) 5 10 20 50Noise 25x104 11x104 5x104 2x104

Noise 17.5 5.5 2.16 0.74

1 million original and synthetic workers.

lesser privacy

Differential Privacy

Probabilistic Differential Privacy (δ = 10-5)

Lecture 17: 590.03 Fall 12 41

Prob. Differential Privacy: Summary• Ignoring privacy breaches that occur due to low probability

outputs drastically reduces noise.

• Two ways to bound low probability outputs– (ε,δ)-Indistinguishability and Negligible functions.

Noise required for privacy ≥ (log n(d)) per block– (ε,δ)-Probabilistic differential privacy and Disclosure sets.

Efficient algorithm to calculate noise per block (see paper).

• Does probabilistic differential privacy allow useful information to be published?

Lecture 17: 590.03 Fall 12 42

1. How much noise should we add?Noise required per block:

Privacy (eε =) 5 10 20 50Noise 25x104 11x104 5x104 2x104

Noise 17.5 5.5 2.16 0.74

1 million original and synthetic workers.

lesser privacy

Differential Privacy

Probabilistic Differential Privacy (δ = 10-5)

2. To which blocks should we add noise?Why not add noise to every block?

Lecture 17: 590.03 Fall 12 43

Privacy (ε =) 5 10 20 50Noise per block 17.5 5.5 2.16 0.74

Why not add noise to every block?Noise required per block: (probabilistic differential privacy)

• There are about 8 million blocks on the map!– Total noise added is about 6 million.

• Causes non-trivial spurious commute patterns. – Roughly 1 million fake workers from West Coast (out of a total 7 million

points in D+A). – Hence, 1/7 of the synthetic data have residences in West Coast and work in

Washington DC.

lesser privacy1 million original

and synthetic workers.

Lecture 17: 590.03 Fall 12 44

Privacy (ε =) 5 10 20 50Noise per block 17.5 5.5 2.16 0.74

2. To which blocks should we add noise?Noise required per block: (probabilistic differential privacy)

Adding noise to all blocks creates spurious commute patterns.

lesser privacy1 million original

and synthetic workers.

Why not add noise only to blocksthat appear in the original data?

Lecture 17: 590.03 Fall 12 45

Theorem 3: Adding noise only to blocks that appear in the data breaches privacy.

If a block b does not appear in the original data and no noise is added to b

then b cannot appear in the synthetic data.

Lecture 17: 590.03 Fall 12 46

Theorem 3: Adding noise only to blocks that appear in the data breaches privacy.

• Worker W comes from Somerset or Fayette. • No one else comes from there.• If • S has a synthetic worker from Somerset

• Then • W comes from Somerset!!

Somerset 1Fayette 0

Somerset 0Fayette 1

Lecture 17: 590.03 Fall 12 47

Ignoring outliers degrades utility

• Each of these points are outliers.• Contribute to about half the workers.

Lecture 17: 590.03 Fall 12 48

Our solution to “Where to add noise?”Step 1 : Coarsen the domain

– Based on an existing public dataset (Census Transportation Planning Package, CTPP).

Lecture 17: 590.03 Fall 12 49

Our solution to “Where to add noise?”Step 1 : Coarsen the domainStep 2: Probabilistically drop blocks with 0 support

– Pick a function f: {b1, …, bk } (0,1] (based on external data)– For every block b with 0 support,

ignore b with probability f(b)

Theorem 4: Parameter ε increases by

b max ( max ( 2 noise per block, f(b) ) )

Lecture 17: 590.03 Fall 12 50

Utility of the provably private algorithm

Experimental Setup:• OTM: Currently published OnTheMap

data used as original data. • All destinations in Minnesota.• 120, 690 origins per destination.

– chosen by pruning out blocks that are > 100 miles from the destination.

• ε = 100, δ = 10-5

• Additional leakage due to probabilistic pruning = 4 (min f(b) = 0.0378)

Utility measured by average commute distance for each destination block.

Lecture 17: 590.03 Fall 12 51

Utility of the provably private algorithmUtility measured by average commute distance

for each destination block.

Short commutes have low error in both sparse and

dense regions.

Lecture 17: 590.03 Fall 12 52

Utility of the provably private algorithm

Long commutes in sparse regions are

overestimated.

Lecture 17: 590.03 Fall 12 53

OnTheMap: Summary• OnTheMap: A real census application.

– Synthetically generated data published for economic research.– Currently, privacy implications are poorly understood.

• Parameters to the algorithm are state secret.

• First formal privacy analysis of this application. – Analyzed the privacy of OnTheMap using variants of Differential Privacy. – First solutions to publish useful information despite sparse data.

Lecture 17: 590.03 Fall 12 54

Next Class• No class on Wednesday, Oct 31

• Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation”

Lecture 17: 590.03 Fall 12 55

References[M et al ICDE ‘08]

A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, L. Vilhuber, “Privacy: From Theory to Practice on the Map”, ICDE 2008

[M et al VLDB ‘09]A. Machanavajjhala, J. Gehrke, M. Gotz, “Data Publishing against Realistic Adversaries”, PVLDB 2(1) 2009

Differential Privacy in US Census

Documents

Transcript of Differential Privacy in US Census