The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13,...

27
The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13, 2009 QuickTime™ and a decompressor are needed to see thi

Transcript of The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13,...

The Role of History and Prediction in Data Privacy

Kristen LeFevre

University of Michigan

May 13, 2009

QuickTime™ and a decompressor

are needed to see this picture.

2

Data Privacy

• Personal information collected every day

Healthcare, insurance information

Supermarket transaction data

RFID, GPS Data

E-mailEmployment history

Web search / clickstream

3

Data Privacy

• Legal, ethical, technical issues surrounding– Data ownership– Data collection– Data dissemination and use

• Considerable recent interest from technical community– High-profile mishaps and lawsuits– Compliance with data-sharing mandates QuickTime™ and a

decompressorare needed to see this picture.

4

Privacy Protection Technologies for Public Datasets

• Goal: Protect sensitive personal information while preserving data utility

• Privacy Policies and Mechanisms• Example Policies:

– Protect individual identities– Protect the values of sensitive attributes– Differential privacy [Dwork 06]

• Example Mechanisms:– Generalize (“coarsen”) the data– Aggregate the data– Add random noise to the data– Add random noise to query results

5

Observations

• Much work has focused on static data– One-time snapshot publishing– Disclosure by composing multiple different

snapshots of a static database [Xiao 07, Ganta 08]

– Auditing queries on a static database [Chin 81, Kenthapadi 06, …]

• What are the unique challenges when the data evolves over time?

6

Outline

• Sample Problem: Continuously publishing privacy-sensitive GPS traces– Motivation & problem setup– Framework for reasoning about privacy– Algorithms for continuous publishing– Experimental results

• Applications to other dynamic dataspeculation

7

GPS Traces(ongoing work w/ Wen Jin, Jignesh Patel)

• GPS devices attached to phones, cars• Interest in collecting and distributing

location traces in real time– Real-time traffic reporting– Adaptive pricing / placement of outdoor ads

• Simultaneous concern for personal privacy• Challenge: Can we continuously collect

and publish location traces without compromising individual privacy?

8

Data Recipient

QuickTime™ and a decompressor

are needed to see this picture.

Problem Setting

QuickTime™ and a decompressor

are needed to see this picture.

Central TraceRepository

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

GPS Users (7 AM)P

riva

cy P

oli

cy

“Sanitized” LocationSnapshot

“Sanitized” LocationSnapshot

GPS Users (7:05 AM)

“Sanitized” LocationSnapshot

“Sanitized” LocationSnapshot

9

Problem Setting

• Finite population of n users with unique identifiers {u1,…,un}

• Assume users’ locations are reported and published in discrete epochs t1,t2,…

• Location snapshot D(tj)– Associates each user with a location during

epoch tj

• Publish sanitized version D*(tj )

10

Threat Model

• Attacker wants to determine the location of a target user ui during epoch tj

• Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages)

QuickTime™ and a decompressor

are needed to see this picture.

11

Some Naïve Solutions

• Strawman 1: Replace users’ identifiers ({u1,…,un}) with pseudonyms ({p1,…,pn})

– Problem: Once attacker “unmasks” user pi, he can track her location forever

• Strawman 2: New pseudonyms ({p1j,…,pn

j}) at each epoch tj

– Problem: Users can still be tracked using multi-target tracking tools [Gruteser 05, Krumm 07]

12

Key Problem: Motion Prediction

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture. QuickTime™ and a decompressor

are needed to see this picture.

1

2 3{Alice, Bob, Charlie}

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

4

5

6{Alice, Bob, Charlie}

What if the speedlimit is 60 mph?

Alice Alice

13

Threat Model

• Attacker wants to determine the location of a target user ui during epoch tj

• Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages)

• Motion prediction: Given one or more locations for ui, attacker can predict (probabilistically) ui’s location during following and preceding epochs

14

Privacy Principle: Temporal Unlinkability

• Consider an attacker who is able to identify (locate) target user uj during m sequential epochs

• Under reasonable assumptions, he should not be able to locate uj with high confidence during any other epochs*

*Similar in spirit to “mix zones” [Beresford 03], which addressed a related problem in a less-formal way.

15

Sanitization Mechanism

• Needed to select a sanitization mechanism; chose one for maximum flexibility

• Assign each user ui consistent pseudonym pi

• Divide users into clusters– Within each cluster, break association between

pseudonym, location

• Release candidate for D(tj)

D*(tj) = {(C1(tj), L1(tj)),…, (CB(tj), LB(tj))} i=1..B Ci(tj) = {p1,…,pn}– Ci(tj) Ch(tj) = (i h)– Each Li(tj) contains the locations of users in Ci(tj)

16

Sanitization Mechanism: Example

• Pseudonyms {p1, p2, p3, p4}

{p1,p2}

{p3,p4}

t0

QuickTime™ and a decompressor

are needed to see this picture.1QuickTime™ and a

decompressorare needed to see this picture.2

QuickTime™ and a decompressor

are needed to see this picture.3

QuickTime™ and a decompressor

are needed to see this picture.4

{p1,p2}

{p3,p4}

t1

QuickTime™ and a decompressor

are needed to see this picture.5QuickTime™ and a

decompressorare needed to see this picture.6

QuickTime™ and a decompressor

are needed to see this picture.7

QuickTime™ and a decompressor

are needed to see this picture.8

{p1,p3}

{p2,p4}

t2

QuickTime™ and a decompressor

are needed to see this picture.9

QuickTime™ and a decompressor

are needed to see this picture.10

QuickTime™ and a decompressor

are needed to see this picture.11QuickTime™ and a

decompressorare needed to see this picture.12

17

Reasoning about Privacy

• How can we guarantee temporal unlinkability under the threats of auxiliary information and motion prediction?– (Using the cluster-based sanitization mechanism)

• Novel framework with two key components– Motion model describes location correlations

between epochs– Breach probability function describes an

attacker’s ability to compromise temporal unlinkability

18

Motion Models

• Model motion using an h-step Markov chain– Conditional probability for user’s location, given his location

during h prior (future) epochs– Same motion model used by attacker and publisher

• Forward motion model template

– Pr[Loc(P,Tj) = Lj | Loc(P,Tj-1) = Lj-1, …, Loc(P,Tj-h) = Lj-h]

• Backward motion model template

– Pr[Loc(P,Tj) = Lj | Loc(P,Tj+1) = Lj+1, …, Loc(P,Tj+h) = Lj+h]

• Independent and replaceable component– For this work, used 1-step motion model based on velocity

distribution (speed and direction)

19

Motion Models: Example

{p1,p2}

{p3,p4}

t0 t1

• Pseudonyms {p1, p2, p3, p4}• Epochs t0, t1, t2

QuickTime™ and a decompressor

are needed to see this picture.p1QuickTime™ and a

decompressorare needed to see this picture.p2

QuickTime™ and a decompressor

are needed to see this picture.p3

QuickTime™ and a decompressor

are needed to see this picture.p4

QuickTime™ and a decompressor

are needed to see this picture.aQuickTime™ and a

decompressorare needed to see this picture.b

QuickTime™ and a decompressor

are needed to see this picture.c

QuickTime™ and a decompressor

are needed to see this picture.d

t2

QuickTime™ and a decompressor

are needed to see this picture.p3

QuickTime™ and a decompressor

are needed to see this picture.p1

QuickTime™ and a decompressor

are needed to see this picture.p2QuickTime™ and a

decompressorare needed to see this picture.p4

Pr[loc(p1,t1) = a|Loc(p1,t0)=x]

Pr[loc(p1,t1) = b|Loc(p1,t0)=x]Pr[loc(p1,t1) = a|Loc(p1,t2)=y]

20

Privacy Breaches

• Forward breach probability– Pr[Loc(P,Tj) = Lj | D(Tj-1), …, D(Tj-h), D*(Tj)]

• Backward breach probability– Pr[Loc(P,Tj) = Lj | D(Tj+1), …, D(Tj+h), D*(Tj)]

• Privacy Breach: Release candidate D*(Tj) causes a breach iff either of the following is true for threshold Cmax P, Lj Pr[Loc(P,Tj) = Lj | D(Tj-1), …, D(Tj-h), D*(Tj)] > C

max P, Lj Pr[Loc(P,Tj) = Lj | D(Tj+1), …, D(Tj-h), D*(Tj)] > C

21

Privacy Breaches: Example

{p1,p2}

{p3,p4}

t0 t1

QuickTime™ and a decompressor

are needed to see this picture.p1QuickTime™ and a

decompressorare needed to see this picture.p2

QuickTime™ and a decompressor

are needed to see this picture.p3

QuickTime™ and a decompressor

are needed to see this picture.p4

QuickTime™ and a decompressor

are needed to see this picture.aQuickTime™ and a

decompressorare needed to see this picture.b

QuickTime™ and a decompressor

are needed to see this picture.c

QuickTime™ and a decompressor

are needed to see this picture.d

e1 = Pr[loc(p1,t1) = a|Loc(p1,t0)=x]

e2 = Pr[loc(p1,t1) = b|Loc(p1,t0)=x]

e3 = Pr[loc(p2,t1) = a|Loc(p2,t0)=y]

e4 = Pr[loc(p2,t1) = b|Loc(p2,t0)=y]

Pr[loc(p1,t1) = a|D(T0), D*(T1)] =

e1 * e4

e1 * e4 + e2 * e3

…Goal: Verify that all (forward and

backward) breach probabilities < threshold C

x

y

22

Checking for Breaches

• Does release candidate D*(Tj) cause a breach?

• Brute force algorithm– Exponential in release candidate cluster size

• Heuristic pruning tools– Reduce the search space considerably in

practice

23

Publishing Algorithms

• How to publish useful data, without causing a privacy breach?

• Cluster-based sanitization mechanism offers two main options– Increase cluster size (or change composition)– Reduce publication frequency

24

Publishing Algorithms

• General Case– At each epoch Tj, publish the most compact release

candidate D*(Tj) that does not cause a breach– Need to delay publishing until epoch Tj+h to check for

backward breaches– NP-hard optimization problem; proposed alternative

heuristics

• Special Case– Durable clusters (same individuals at each epoch)– Motion model satisfies symmetry property– No need to delay publishing

25

Experimental Study

• Used real highway traffic data from UM Transportation Research Institute

– GPS data sampled from cars of 72 volunteers– Sampling rate (epoch) = 0.01 seconds– Speed range 0-170 km/hour

• Also synthetic data– Able to control the generative motion distribution

26

Experimental Study

• All static “snapshot” anonymization mechanisms vulnerable to motion prediction attacks– Applied two representative algorithms (r-Gather

[Aggarwal 06] and k-Condense [Aggarwal 04])– Each produces a set of clusters with k users each

QuickTime™ and a decompressor

are needed to see this picture.

r-Gather

QuickTime™ and a decompressor

are needed to see this picture.

k-Condense

27

Speculation / Future Work

• GPS example illustrates importance of reasoning about data dynamics and history, and predictable patterns of change in privacy

• Dynamic private data in other apps.– E.g., longitudinal social science data

• Study subjects age predictably • Most people don’t move very far• Income changes predictably

• Hypothesis: History and prediction are important in these settings, too!