De- anonymizing Data

Lecture 2 : 590.03 Fall 12 1

De-anonymizing Data

CompSci 590.03Instructor: Ashwin Machanavajjhala

Source (http://xkcd.org/834/)

Lecture 2 : 590.03 Fall 12 2

Announcements• Project ideas will be posted on the site by Friday.

– You are welcome to send me (or talk to me about) your own ideas.

Lecture 2 : 590.03 Fall 12 3

Outline• Recap & Intro to Anonymization

• Algorithmically De-anonymizing Netflix Data

• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks

Lecture 2 : 590.03 Fall 12 4




Lecture 2 : 590.03 Fall 12 5

Personal Big-Data

Google

DB

Person 1r1

Person 2r2

Person 3r3

Person NrN

Census

DB

Hospital

DB

Doctors Medical Researchers

Economists Information Retrieval

Researchers

Recommen-dation

Algorithms

Lecture 2 : 590.03 Fall 12 6

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

Lecture 2 : 590.03 Fall 12 7

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex.

Quasi Identifier

87 % of US population

Lecture 2 : 590.03 Fall 12

Statistical Privacy (Trusted Collector) Problem

8

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Server

DB

Utility:Privacy: No breach about any individual


Statistical Privacy (Untrusted Collector) Problem

9

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Server

DB f ( )

Lecture 2 : 590.03 Fall 12 10

Randomized Response• Flip a coin

– heads with probability p, and – tails with probability 1-p (p > ½)

• Answer question according to the following table:

True Answer = Yes True Answer = No

Heads Yes No

Tails No Yes


Statistical Privacy (Trusted Collector) Problem

11

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Server

DB


Query Answering

12

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Hospital

DB

Correlate Genome to disease

How many allergy patients?

‘

Lecture 2 : 590.03 Fall 12 13

Query Answering• Need to know the list of questions up front

• Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions.

• Will see this in detail later in the course.


Anonymous/ Sanitized Data Publishing

14

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Hospital

DB

I wont tell you what questions I am interested in!

writingcenterunderground.wordpress.com


Anonymous/ Sanitized Data Publishing

15

Individual 1r1

Individual 2r2

Individual 3r3

Individual NrN

Hospital

DB

D’B

Answer any # of questions directly on DB’ without

any modifications.

Lecture 2 : 590.03 Fall 12 16

Today’s class• Identifying individual records and their sensitive values from data

publishing (with insufficient sanitization).

Lecture 2 : 590.03 Fall 12 17




Lecture 2 : 590.03 Fall 12 18

Terms• Coin tosses of an algorithm

• Union Bound

• Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 12 19

Terms (contd.)• Heavy Tailed Distribution

Not heavy tailed.

Normal Distribution

Lecture 2 : 590.03 Fall 12 20


Heavy tailed.

Laplace Distribution

Lecture 2 : 590.03 Fall 12 21


Heavy tailed.

Zipf Distribution

Lecture 2 : 590.03 Fall 12 22

Terms (contd.)• Cosine Similarity

• Collaborative filtering– Problem of recommending new items to a user based on their ratings on

previously seen items.

θ

Lecture 2 : 590.03 Fall 12 23

Netflix Dataset

3 4 2 1 5

1 1 1

5 5 1

5 2 2 1

4 2 1 4

3 3 5

4 3 1

3 2 4

Movies

Use

rs

Rating + TimeStamp

Record (r)

Column/Attribute

Lecture 2 : 590.03 Fall 12 24

Definitions• Support

– Set (or number) of non-null attributes in a record or column

• Similarity

• Sparsity

Lecture 2 : 590.03 Fall 12 25

Adversary Model• Aux(r) – some subset of attributes from r

Lecture 2 : 590.03 Fall 12 26

Privacy Breach• Definition 1: An algorithm A outputs an r’ such that

• Definition 2: (When only a sample of the dataset is input)

Lecture 2 : 590.03 Fall 12 27

AlgorithmScoreBoard

• For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’.

• Pick r’ with the maximum score

OR

• Return all records with Score > α

Lecture 2 : 590.03 Fall 12 28

AnalysisTheorem 1: Suppose we use Scoreboard with α = 1 – ε. If Aux

contains m randomly chosen attributes s.t.

Then Scoreboard returns a record r’ such that

Pr [Sim(m, r’) > 1 – ε – δ ] > 1 – ε

Lecture 2 : 590.03 Fall 12 29

Proof of Theorem 1• Call r’ a false match if Sim(Aux, r’) < 1 – ε – δ.

• For any false match, Pr[ Sim(Auxi, ri’) > 1 – ε ] < 1 – δ

• Sim(Aux, r’) = min Sim(Auxi, ri’)

• Therefore, Pr[ Sim(Aux, r’) > 1 – ε ] < (1 – δ)m

• Pr[some false match has similarity > 1- ε ] < N(1-δ)m

• N(1-δ)m < ε when m > log(N/ε) / log(1/1-δ)

Lecture 2 : 590.03 Fall 12 30

Other results• If dataset D is (1-ε-δ, ε)-sparse, then D can be (1, 1-ε)-

deanonymized.

• Analogous results when a list of candidate records are returned

Lecture 2 : 590.03 Fall 12 31

Netflix Dataset• Slightly different algorithm

Lecture 2 : 590.03 Fall 12 32

Summary of Netflix Paper• Adversary can use a subset of ratings made by a user to uniquely

identify the user’s record from the “anonymized” dataset with high probability

• Simple Scoreboard algorithm provably guarantees identification of records.

• A variant of Scoreboard can de-anonymize Netflix dataset.

• Algorithms are robust to noise in the adversary’s background knowledge

Lecture 2 : 590.03 Fall 12 33




Lecture 2 : 590.03 Fall 12 34

Social Network Data

• Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities

• Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc.

Lecture 2 : 590.03 Fall 12 35

Anonymizing Social Networks

• Naïve anonymization– removes the label of each node and publish only the structure of the

network• Information Leaks

– Nodes may still be re-identified based on network structure

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 36

Passive Attacks on an Anonymized Network

• Consider the above email communication graph– Each node represents an individual– Each edge between two individuals indicates that they have exchanged

emails

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 37


• Alice has sent emails to three individuals only

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 38


• Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three• Hence, Alice can re-identify herself

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 39


• Cathy has sent emails to five individuals

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 40


• Cathy has sent emails to five individuals• Only one node has a degree five• Hence, Cathy can re-identify herself

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 41


• Now consider that Alice and Cathy share their knowledge about the anonymized network

• What can they learn about the other individuals?

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 42


• First, Alice and Cathy know that only Bob have sent emails to both of them

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 43


• First, Alice and Cathy know that only Bob have sent emails to both of them

• Bob can be identified

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 44


• Alice has sent emails to Bob, Cathy, and Ed only

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 45


• Alice has sent emails to Bob, Cathy, and Ed only• Ed can be identified

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 46


• Alice and Cathy can learn that Bob and Ed are connected

Alice

Ed

Bob

Fred

Cathy

Grace

Diane

Lecture 2 : 590.03 Fall 12 47


• The above attack is based on knowledge about degrees of nodes. [Liu and Terzi, SIGMOD 2008]

• More sophisticated attacks can be launched given additional knowledge about the network structure, e.g., a subgraph of the network.[Zhou and Pei, ICDE 2008, Hay et al., VLDB 2008, ]

• Protecting privacy becomes even more challenging when the nodes in the anonymized network are labeled.[Pang et al., SIGCOMM CCR 2006]

Lecture 2 : 590.03 Fall 12 48

Inferring Sensitive Values on a Network• Each individual has a single sensitive attribute.

– Some individuals share the sensitive attribute, while others keep it private

• GOAL: Infer the private sensitive attributes using– Links in the social network– Groups that the individuals

belong to

• Approach: Learn a predictive model (think classifier) using public profiles as training data. [Zheleva and Getoor, WWW 2009]

Lecture 2 : 590.03 Fall 12 49

Inferring Sensitive Values on a Network • Baseline: Most commonly appearing sensitive value amongst all

public profiles.

Lecture 2 : 590.03 Fall 12 50

Inferring Sensitive Values on a Network

• LINK: Each node x has a list of binary features Lx, one for every node in the social network.– Feature value Lx[y] = 1 if and only if (x,y) is an edge. – Train a model on all pairs (Lx, sensitive value(x)), for x’s with public sensitive

values.– Use learnt model to predict private sensitive values

Lecture 2 : 590.03 Fall 12 51


• GROUP: Each node x has a list of binary features Gx, one for every group in the social network.– Feature value Gx[y] = 1 if and only if x belongs to group y. – Train a model on all pairs (Gx, sensitive value(x)), where x’s sensitive value

is public.– Use model to predict private sensitive values

Lecture 2 : 590.03 Fall 12 52


Flickr (Location)

Facebook (Gender)

Facebook (Political View)

Dogster (Dog Breed)

Baseline 27.7% 50% 56.5% 28.6%LINK 56.5% 68.6% 58.1% 60.2%

GROUP 83.6% 77.2% 46.6% 82.0%

[Zheleva and Getoor, WWW 2009]

Lecture 2 : 590.03 Fall 12 53

Active Attacks on Social Networks[Backstrom et al., WWW 2007]

• Attacker may create a few nodes in the graph– Creates a few ‘fake’ Facebook user accounts.

• Attacker may add edges from the new nodes. – Create friends using ‘fake’ accounts.

• Goal: Discover an edge between two legitimate users.

Lecture 2 : 590.03 Fall 12 54

High Level View of Attack• Step 1: Create a graph structure with the ‘fake’ nodes such that it

can be identified in the anonymous data.

Lecture 2 : 590.03 Fall 12 55

High Level View of Attack• Step 2: Add edges from the ‘fake’ nodes to real nodes.

Lecture 2 : 590.03 Fall 12 56

High Level View of Attack• Step 3: From the anonymized data, identify fake graph due to its

special graph structure.

Lecture 2 : 590.03 Fall 12 57

High Level View of Attack• Step 4: Deduce edges by following links

Lecture 2 : 590.03 Fall 12 58

Details of the Attack• Choose k real users

W = {w1, …, wk}• Create k fake users

X = {x1, …, xk}

• Creates edges (xi, wi)

• Create edges (xi, xi+1)• Create all other edges in X

with probability 0.5.

Large graph

Lecture 2 : 590.03 Fall 12 59

Why does it work?• Given a graph G, and a set of nodes S, G[S] = graph induced by

nodes in S.

• There is an isomorphism between two sets of nodes S, S’ if– There is a function mapping each node in S to a node in S’– (u,v) is an edge in G[S] if and only if (f(u), f(v)) is an edge in S’

• Isomorphism from S to S is called an automorphism– Think: permuting the nodes

Lecture 2 : 590.03 Fall 12 60

Why does it work?

• There is no S such that G[S] is isomorphic to G[X] (call it H).

• H can be efficiently found from G.• H has no non-trivial automorphisms.

Large graph(size N)

Lecture 2 : 590.03 Fall 12 61

RecoverySubgraph isomorphism is NP-hard

– i.e., Finding X could be hard.

But since X has a path, with random edges, there is a simple brute force with pruning search algorithm.

Run Time: O(N 2O(log log N) )

Large graph(size N)

2

Lecture 2 : 590.03 Fall 12 62

Works in Real Life!• LiveJournal –

4.4 million nodes, 77 million edges

• Success all but guaranteed by adding 10 nodes.

• Recovery typically takes a second.

Probability of Successful Attack

[Backstrom et al., WWW 2007]

Lecture 2 : 590.03 Fall 12 63

Summary of Social Networks• Nodes in a graph can be re-identified using background

knowledge of the structure of the graph

• Link and group structure provide valuable information for accurately inferring private sensitive values.

• Active attacks that add nodes and edges are shown to be very

successful.

• Guarding against these attacks is an open area for research !

Lecture 2 : 590.03 Fall 12 64

Next Class• K-Anonymity + Algorithms: How to limit de-anonymization?

Lecture 2 : 590.03 Fall 12 65

ReferencesL. Sweeney, “K-Anonymity: a model for protecting privacy”, IJUFKS 2002A. Narayanan & V. Shmatikov, “Robust De-anonymization of Large Sparse Datasets”, SSP

2008L. Backstrom, C. Dwork & J. Kleinberg, “Wherefore art thou r3579x?: anonymized social

networks, hidden patterns, and structural steganography”, WWW 2007E. Zheleva & L. Getoor, “To join or not to join: the illusion of privacy in social networks with

mixed public and private user profiles”, WWW 2009

De- anonymizing Data

Documents

Transcript of De- anonymizing Data