De- anonymizing Data

65
De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala 1 Lecture 2 : 590.03 Fall 12 Source (http://xkcd.org/834/)

description

De- anonymizing Data. Source (http://xkcd.org/834/). CompSci 590.03 Instructor: Ashwin Machanavajjhala. Announcements. Project ideas will be posted on the site by Friday. You are welcome to send me (or talk to me about) your own ideas. Outline. Recap & Intro to Anonymization - PowerPoint PPT Presentation

Transcript of De- anonymizing Data

Page 1: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 1

De-anonymizing Data

CompSci 590.03Instructor: Ashwin Machanavajjhala

Source (http://xkcd.org/834/)

Page 2: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 2

Announcements• Project ideas will be posted on the site by Friday.

– You are welcome to send me (or talk to me about) your own ideas.

Page 3: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 3

Outline• Recap & Intro to Anonymization

• Algorithmically De-anonymizing Netflix Data

• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks

Page 4: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 4

Outline• Recap & Intro to Anonymization

• Algorithmically De-anonymizing Netflix Data

• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks

Page 5: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 5

Personal Big-Data

Google

DB

Person 1r1

Person 2r2

Person 3r3

Person NrN

Census

DB

Hospital

DB

Doctors Medical Researchers

Economists Information Retrieval

Researchers

Recommen-dation

Algorithms

Page 6: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 6

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

Page 7: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 7

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex.

Quasi Identifier

87 % of US population

Page 8: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12

Statistical Privacy (Trusted Collector) Problem

8

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Server

DB

Utility:Privacy: No breach about any individual

Page 9: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12

Statistical Privacy (Untrusted Collector) Problem

9

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Server

DB f ( )

Page 10: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 10

Randomized Response• Flip a coin

– heads with probability p, and – tails with probability 1-p (p > ½)

• Answer question according to the following table:

True Answer = Yes True Answer = No

Heads Yes No

Tails No Yes

Page 11: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12

Statistical Privacy (Trusted Collector) Problem

11

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Server

DB

Page 12: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12

Query Answering

12

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Hospital

DB

Correlate Genome to disease

How many allergy patients?

Page 13: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 13

Query Answering• Need to know the list of questions up front

• Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions.

• Will see this in detail later in the course.

Page 14: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12

Anonymous/ Sanitized Data Publishing

14

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Hospital

DB

I wont tell you what questions I am interested in!

writingcenterunderground.wordpress.com

Page 15: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12

Anonymous/ Sanitized Data Publishing

15

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Hospital

DB

D’B

Answer any # of questions directly on DB’ without

any modifications.

Page 16: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 16

Today’s class• Identifying individual records and their sensitive values from data

publishing (with insufficient sanitization).

Page 17: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 17

Outline• Recap & Intro to Anonymization

• Algorithmically De-anonymizing Netflix Data

• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks

Page 18: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 18

Terms• Coin tosses of an algorithm

• Union Bound

• Heavy Tailed Distribution

Page 19: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 19

Terms (contd.)• Heavy Tailed Distribution

Not heavy tailed.

Normal Distribution

Page 20: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 20

Terms (contd.)• Heavy Tailed Distribution

Heavy tailed.

Laplace Distribution

Page 21: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 21

Terms (contd.)• Heavy Tailed Distribution

Heavy tailed.

Zipf Distribution

Page 22: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 22

Terms (contd.)• Cosine Similarity

• Collaborative filtering– Problem of recommending new items to a user based on their ratings on

previously seen items.

θ

Page 23: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 23

Netflix Dataset

3 4 2 1 5

1 1 1

5 5 1

5 2 2 1

4 2 1 4

3 3 5

4 3 1

3 2 4

Movies

Use

rs

Rating + TimeStamp

Record (r)

Column/Attribute

Page 24: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 24

Definitions• Support

– Set (or number) of non-null attributes in a record or column

• Similarity

• Sparsity

Page 25: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 25

Adversary Model• Aux(r) – some subset of attributes from r

Page 26: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 26

Privacy Breach• Definition 1: An algorithm A outputs an r’ such that

• Definition 2: (When only a sample of the dataset is input)

Page 27: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 27

AlgorithmScoreBoard

• For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’.

• Pick r’ with the maximum score

OR

• Return all records with Score > α

Page 28: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 28

AnalysisTheorem 1: Suppose we use Scoreboard with α = 1 – ε. If Aux

contains m randomly chosen attributes s.t.

Then Scoreboard returns a record r’ such that

Pr [Sim(m, r’) > 1 – ε – δ ] > 1 – ε

Page 29: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 29

Proof of Theorem 1• Call r’ a false match if Sim(Aux, r’) < 1 – ε – δ.

• For any false match, Pr[ Sim(Auxi, ri’) > 1 – ε ] < 1 – δ

• Sim(Aux, r’) = min Sim(Auxi, ri’)

• Therefore, Pr[ Sim(Aux, r’) > 1 – ε ] < (1 – δ)m

• Pr[some false match has similarity > 1- ε ] < N(1-δ)m

• N(1-δ)m < ε when m > log(N/ε) / log(1/1-δ)

Page 30: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 30

Other results• If dataset D is (1-ε-δ, ε)-sparse, then D can be (1, 1-ε)-

deanonymized.

• Analogous results when a list of candidate records are returned

Page 31: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 31

Netflix Dataset• Slightly different algorithm

Page 32: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 32

Summary of Netflix Paper• Adversary can use a subset of ratings made by a user to uniquely

identify the user’s record from the “anonymized” dataset with high probability

• Simple Scoreboard algorithm provably guarantees identification of records.

• A variant of Scoreboard can de-anonymize Netflix dataset.

• Algorithms are robust to noise in the adversary’s background knowledge

Page 33: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 33

Outline• Recap & Intro to Anonymization

• Algorithmically De-anonymizing Netflix Data

• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks

Page 34: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 34

Social Network Data

• Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities

• Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc.

Page 35: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 35

Anonymizing Social Networks

• Naïve anonymization– removes the label of each node and publish only the structure of the

network• Information Leaks

– Nodes may still be re-identified based on network structure

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 36: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 36

Passive Attacks on an Anonymized Network

• Consider the above email communication graph– Each node represents an individual– Each edge between two individuals indicates that they have exchanged

emails

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 37: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 37

Passive Attacks on an Anonymized Network

• Alice has sent emails to three individuals only

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 38: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 38

Passive Attacks on an Anonymized Network

• Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three• Hence, Alice can re-identify herself

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 39: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 39

Passive Attacks on an Anonymized Network

• Cathy has sent emails to five individuals

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 40: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 40

Passive Attacks on an Anonymized Network

• Cathy has sent emails to five individuals• Only one node has a degree five• Hence, Cathy can re-identify herself

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 41: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 41

Passive Attacks on an Anonymized Network

• Now consider that Alice and Cathy share their knowledge about the anonymized network

• What can they learn about the other individuals?

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 42: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 42

Passive Attacks on an Anonymized Network

• First, Alice and Cathy know that only Bob have sent emails to both of them

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 43: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 43

Passive Attacks on an Anonymized Network

• First, Alice and Cathy know that only Bob have sent emails to both of them

• Bob can be identified

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 44: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 44

Passive Attacks on an Anonymized Network

• Alice has sent emails to Bob, Cathy, and Ed only

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 45: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 45

Passive Attacks on an Anonymized Network

• Alice has sent emails to Bob, Cathy, and Ed only• Ed can be identified

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 46: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 46

Passive Attacks on an Anonymized Network

• Alice and Cathy can learn that Bob and Ed are connected

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Page 47: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 47

Passive Attacks on an Anonymized Network

• The above attack is based on knowledge about degrees of nodes. [Liu and Terzi, SIGMOD 2008]

• More sophisticated attacks can be launched given additional knowledge about the network structure, e.g., a subgraph of the network.[Zhou and Pei, ICDE 2008, Hay et al., VLDB 2008, ]

• Protecting privacy becomes even more challenging when the nodes in the anonymized network are labeled.[Pang et al., SIGCOMM CCR 2006]

Page 48: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 48

Inferring Sensitive Values on a Network• Each individual has a single sensitive attribute.

– Some individuals share the sensitive attribute, while others keep it private

• GOAL: Infer the private sensitive attributes using– Links in the social network– Groups that the individuals

belong to

• Approach: Learn a predictive model (think classifier) using public profiles as training data. [Zheleva and Getoor, WWW 2009]

Page 49: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 49

Inferring Sensitive Values on a Network • Baseline: Most commonly appearing sensitive value amongst all

public profiles.

Page 50: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 50

Inferring Sensitive Values on a Network

• LINK: Each node x has a list of binary features Lx, one for every node in the social network.– Feature value Lx[y] = 1 if and only if (x,y) is an edge. – Train a model on all pairs (Lx, sensitive value(x)), for x’s with public sensitive

values.– Use learnt model to predict private sensitive values

Page 51: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 51

Inferring Sensitive Values on a Network

• GROUP: Each node x has a list of binary features Gx, one for every group in the social network.– Feature value Gx[y] = 1 if and only if x belongs to group y. – Train a model on all pairs (Gx, sensitive value(x)), where x’s sensitive value

is public.– Use model to predict private sensitive values

Page 52: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 52

Inferring Sensitive Values on a Network

Flickr (Location)

Facebook (Gender)

Facebook (Political View)

Dogster (Dog Breed)

Baseline 27.7% 50% 56.5% 28.6%LINK 56.5% 68.6% 58.1% 60.2%

GROUP 83.6% 77.2% 46.6% 82.0%

[Zheleva and Getoor, WWW 2009]

Page 53: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 53

Active Attacks on Social Networks[Backstrom et al., WWW 2007]

• Attacker may create a few nodes in the graph– Creates a few ‘fake’ Facebook user accounts.

• Attacker may add edges from the new nodes. – Create friends using ‘fake’ accounts.

• Goal: Discover an edge between two legitimate users.

Page 54: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 54

High Level View of Attack• Step 1: Create a graph structure with the ‘fake’ nodes such that it

can be identified in the anonymous data.

Page 55: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 55

High Level View of Attack• Step 2: Add edges from the ‘fake’ nodes to real nodes.

Page 56: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 56

High Level View of Attack• Step 3: From the anonymized data, identify fake graph due to its

special graph structure.

Page 57: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 57

High Level View of Attack• Step 4: Deduce edges by following links

Page 58: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 58

Details of the Attack• Choose k real users

W = {w1, …, wk}• Create k fake users

X = {x1, …, xk}

• Creates edges (xi, wi)

• Create edges (xi, xi+1)• Create all other edges in X

with probability 0.5.

Large graph

Page 59: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 59

Why does it work?• Given a graph G, and a set of nodes S, G[S] = graph induced by

nodes in S.

• There is an isomorphism between two sets of nodes S, S’ if– There is a function mapping each node in S to a node in S’– (u,v) is an edge in G[S] if and only if (f(u), f(v)) is an edge in S’

• Isomorphism from S to S is called an automorphism– Think: permuting the nodes

Page 60: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 60

Why does it work?

• There is no S such that G[S] is isomorphic to G[X] (call it H).

• H can be efficiently found from G.• H has no non-trivial automorphisms.

Large graph(size N)

Page 61: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 61

RecoverySubgraph isomorphism is NP-hard

– i.e., Finding X could be hard.

But since X has a path, with random edges, there is a simple brute force with pruning search algorithm.

Run Time: O(N 2O(log log N) )

Large graph(size N)

2

Page 62: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 62

Works in Real Life!• LiveJournal –

4.4 million nodes, 77 million edges

• Success all but guaranteed by adding 10 nodes.

• Recovery typically takes a second.

Probability of Successful Attack

[Backstrom et al., WWW 2007]

Page 63: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 63

Summary of Social Networks• Nodes in a graph can be re-identified using background

knowledge of the structure of the graph

• Link and group structure provide valuable information for accurately inferring private sensitive values.

• Active attacks that add nodes and edges are shown to be very

successful.

• Guarding against these attacks is an open area for research !

Page 64: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 64

Next Class• K-Anonymity + Algorithms: How to limit de-anonymization?

Page 65: De- anonymizing  Data

Lecture 2 : 590.03 Fall 12 65

ReferencesL. Sweeney, “K-Anonymity: a model for protecting privacy”, IJUFKS 2002A. Narayanan & V. Shmatikov, “Robust De-anonymization of Large Sparse Datasets”, SSP

2008L. Backstrom, C. Dwork & J. Kleinberg, “Wherefore art thou r3579x?: anonymized social

networks, hidden patterns, and structural steganography”, WWW 2007E. Zheleva & L. Getoor, “To join or not to join: the illusion of privacy in social networks with

mixed public and private user profiles”, WWW 2009