De- anonymizing Data
description
Transcript of De- anonymizing Data
Lecture 2 : 590.03 Fall 12 1
De-anonymizing Data
CompSci 590.03Instructor: Ashwin Machanavajjhala
Source (http://xkcd.org/834/)
Lecture 2 : 590.03 Fall 12 2
Announcements• Project ideas will be posted on the site by Friday.
– You are welcome to send me (or talk to me about) your own ideas.
Lecture 2 : 590.03 Fall 12 3
Outline• Recap & Intro to Anonymization
• Algorithmically De-anonymizing Netflix Data
• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks
Lecture 2 : 590.03 Fall 12 4
Outline• Recap & Intro to Anonymization
• Algorithmically De-anonymizing Netflix Data
• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks
Lecture 2 : 590.03 Fall 12 5
Personal Big-Data
DB
Person 1r1
Person 2r2
Person 3r3
Person NrN
Census
DB
Hospital
DB
Doctors Medical Researchers
Economists Information Retrieval
Researchers
Recommen-dation
Algorithms
Lecture 2 : 590.03 Fall 12 6
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
•Name•Address•Date Registered•Party affiliation•Date last voted
• Zip
• Birth date
• Sex
Medical Data Voter List
• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis
Lecture 2 : 590.03 Fall 12 7
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
•Name•Address•Date Registered•Party affiliation•Date last voted
• Zip
• Birth date
• Sex
Medical Data Voter List
• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex.
Quasi Identifier
87 % of US population
Lecture 2 : 590.03 Fall 12
Statistical Privacy (Trusted Collector) Problem
8
Individual 1r1
Individual 2r2
Individual 3r3
Individual NrN
Server
DB
Utility:Privacy: No breach about any individual
Lecture 2 : 590.03 Fall 12
Statistical Privacy (Untrusted Collector) Problem
9
Individual 1r1
Individual 2r2
Individual 3r3
Individual NrN
Server
DB f ( )
Lecture 2 : 590.03 Fall 12 10
Randomized Response• Flip a coin
– heads with probability p, and – tails with probability 1-p (p > ½)
• Answer question according to the following table:
True Answer = Yes True Answer = No
Heads Yes No
Tails No Yes
Lecture 2 : 590.03 Fall 12
Statistical Privacy (Trusted Collector) Problem
11
Individual 1r1
Individual 2r2
Individual 3r3
Individual NrN
Server
DB
Lecture 2 : 590.03 Fall 12
Query Answering
12
Individual 1r1
Individual 2r2
Individual 3r3
Individual NrN
Hospital
DB
Correlate Genome to disease
How many allergy patients?
‘
Lecture 2 : 590.03 Fall 12 13
Query Answering• Need to know the list of questions up front
• Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions.
• Will see this in detail later in the course.
Lecture 2 : 590.03 Fall 12
Anonymous/ Sanitized Data Publishing
14
Individual 1r1
Individual 2r2
Individual 3r3
Individual NrN
Hospital
DB
I wont tell you what questions I am interested in!
writingcenterunderground.wordpress.com
Lecture 2 : 590.03 Fall 12
Anonymous/ Sanitized Data Publishing
15
Individual 1r1
Individual 2r2
Individual 3r3
Individual NrN
Hospital
DB
D’B
Answer any # of questions directly on DB’ without
any modifications.
Lecture 2 : 590.03 Fall 12 16
Today’s class• Identifying individual records and their sensitive values from data
publishing (with insufficient sanitization).
Lecture 2 : 590.03 Fall 12 17
Outline• Recap & Intro to Anonymization
• Algorithmically De-anonymizing Netflix Data
• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks
Lecture 2 : 590.03 Fall 12 18
Terms• Coin tosses of an algorithm
• Union Bound
• Heavy Tailed Distribution
Lecture 2 : 590.03 Fall 12 19
Terms (contd.)• Heavy Tailed Distribution
Not heavy tailed.
Normal Distribution
Lecture 2 : 590.03 Fall 12 20
Terms (contd.)• Heavy Tailed Distribution
Heavy tailed.
Laplace Distribution
Lecture 2 : 590.03 Fall 12 21
Terms (contd.)• Heavy Tailed Distribution
Heavy tailed.
Zipf Distribution
Lecture 2 : 590.03 Fall 12 22
Terms (contd.)• Cosine Similarity
• Collaborative filtering– Problem of recommending new items to a user based on their ratings on
previously seen items.
θ
Lecture 2 : 590.03 Fall 12 23
Netflix Dataset
3 4 2 1 5
1 1 1
5 5 1
5 2 2 1
4 2 1 4
3 3 5
4 3 1
3 2 4
Movies
Use
rs
Rating + TimeStamp
Record (r)
Column/Attribute
Lecture 2 : 590.03 Fall 12 24
Definitions• Support
– Set (or number) of non-null attributes in a record or column
• Similarity
• Sparsity
Lecture 2 : 590.03 Fall 12 25
Adversary Model• Aux(r) – some subset of attributes from r
Lecture 2 : 590.03 Fall 12 26
Privacy Breach• Definition 1: An algorithm A outputs an r’ such that
• Definition 2: (When only a sample of the dataset is input)
Lecture 2 : 590.03 Fall 12 27
AlgorithmScoreBoard
• For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’.
• Pick r’ with the maximum score
OR
• Return all records with Score > α
Lecture 2 : 590.03 Fall 12 28
AnalysisTheorem 1: Suppose we use Scoreboard with α = 1 – ε. If Aux
contains m randomly chosen attributes s.t.
Then Scoreboard returns a record r’ such that
Pr [Sim(m, r’) > 1 – ε – δ ] > 1 – ε
Lecture 2 : 590.03 Fall 12 29
Proof of Theorem 1• Call r’ a false match if Sim(Aux, r’) < 1 – ε – δ.
• For any false match, Pr[ Sim(Auxi, ri’) > 1 – ε ] < 1 – δ
• Sim(Aux, r’) = min Sim(Auxi, ri’)
• Therefore, Pr[ Sim(Aux, r’) > 1 – ε ] < (1 – δ)m
• Pr[some false match has similarity > 1- ε ] < N(1-δ)m
• N(1-δ)m < ε when m > log(N/ε) / log(1/1-δ)
Lecture 2 : 590.03 Fall 12 30
Other results• If dataset D is (1-ε-δ, ε)-sparse, then D can be (1, 1-ε)-
deanonymized.
• Analogous results when a list of candidate records are returned
Lecture 2 : 590.03 Fall 12 31
Netflix Dataset• Slightly different algorithm
Lecture 2 : 590.03 Fall 12 32
Summary of Netflix Paper• Adversary can use a subset of ratings made by a user to uniquely
identify the user’s record from the “anonymized” dataset with high probability
• Simple Scoreboard algorithm provably guarantees identification of records.
• A variant of Scoreboard can de-anonymize Netflix dataset.
• Algorithms are robust to noise in the adversary’s background knowledge
Lecture 2 : 590.03 Fall 12 33
Outline• Recap & Intro to Anonymization
• Algorithmically De-anonymizing Netflix Data
• Algorithmically De-anonymizing Social Networks– Passive Attacks– Active Attacks
Lecture 2 : 590.03 Fall 12 34
Social Network Data
• Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities
• Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc.
Lecture 2 : 590.03 Fall 12 35
Anonymizing Social Networks
• Naïve anonymization– removes the label of each node and publish only the structure of the
network• Information Leaks
– Nodes may still be re-identified based on network structure
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 36
Passive Attacks on an Anonymized Network
• Consider the above email communication graph– Each node represents an individual– Each edge between two individuals indicates that they have exchanged
emails
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 37
Passive Attacks on an Anonymized Network
• Alice has sent emails to three individuals only
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 38
Passive Attacks on an Anonymized Network
• Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three• Hence, Alice can re-identify herself
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 39
Passive Attacks on an Anonymized Network
• Cathy has sent emails to five individuals
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 40
Passive Attacks on an Anonymized Network
• Cathy has sent emails to five individuals• Only one node has a degree five• Hence, Cathy can re-identify herself
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 41
Passive Attacks on an Anonymized Network
• Now consider that Alice and Cathy share their knowledge about the anonymized network
• What can they learn about the other individuals?
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 42
Passive Attacks on an Anonymized Network
• First, Alice and Cathy know that only Bob have sent emails to both of them
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 43
Passive Attacks on an Anonymized Network
• First, Alice and Cathy know that only Bob have sent emails to both of them
• Bob can be identified
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 44
Passive Attacks on an Anonymized Network
• Alice has sent emails to Bob, Cathy, and Ed only
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 45
Passive Attacks on an Anonymized Network
• Alice has sent emails to Bob, Cathy, and Ed only• Ed can be identified
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 46
Passive Attacks on an Anonymized Network
• Alice and Cathy can learn that Bob and Ed are connected
Alice
Ed
Bob
Fred
Cathy
Grace
Diane
Lecture 2 : 590.03 Fall 12 47
Passive Attacks on an Anonymized Network
• The above attack is based on knowledge about degrees of nodes. [Liu and Terzi, SIGMOD 2008]
• More sophisticated attacks can be launched given additional knowledge about the network structure, e.g., a subgraph of the network.[Zhou and Pei, ICDE 2008, Hay et al., VLDB 2008, ]
• Protecting privacy becomes even more challenging when the nodes in the anonymized network are labeled.[Pang et al., SIGCOMM CCR 2006]
Lecture 2 : 590.03 Fall 12 48
Inferring Sensitive Values on a Network• Each individual has a single sensitive attribute.
– Some individuals share the sensitive attribute, while others keep it private
• GOAL: Infer the private sensitive attributes using– Links in the social network– Groups that the individuals
belong to
• Approach: Learn a predictive model (think classifier) using public profiles as training data. [Zheleva and Getoor, WWW 2009]
Lecture 2 : 590.03 Fall 12 49
Inferring Sensitive Values on a Network • Baseline: Most commonly appearing sensitive value amongst all
public profiles.
Lecture 2 : 590.03 Fall 12 50
Inferring Sensitive Values on a Network
• LINK: Each node x has a list of binary features Lx, one for every node in the social network.– Feature value Lx[y] = 1 if and only if (x,y) is an edge. – Train a model on all pairs (Lx, sensitive value(x)), for x’s with public sensitive
values.– Use learnt model to predict private sensitive values
Lecture 2 : 590.03 Fall 12 51
Inferring Sensitive Values on a Network
• GROUP: Each node x has a list of binary features Gx, one for every group in the social network.– Feature value Gx[y] = 1 if and only if x belongs to group y. – Train a model on all pairs (Gx, sensitive value(x)), where x’s sensitive value
is public.– Use model to predict private sensitive values
Lecture 2 : 590.03 Fall 12 52
Inferring Sensitive Values on a Network
Flickr (Location)
Facebook (Gender)
Facebook (Political View)
Dogster (Dog Breed)
Baseline 27.7% 50% 56.5% 28.6%LINK 56.5% 68.6% 58.1% 60.2%
GROUP 83.6% 77.2% 46.6% 82.0%
[Zheleva and Getoor, WWW 2009]
Lecture 2 : 590.03 Fall 12 53
Active Attacks on Social Networks[Backstrom et al., WWW 2007]
• Attacker may create a few nodes in the graph– Creates a few ‘fake’ Facebook user accounts.
• Attacker may add edges from the new nodes. – Create friends using ‘fake’ accounts.
• Goal: Discover an edge between two legitimate users.
Lecture 2 : 590.03 Fall 12 54
High Level View of Attack• Step 1: Create a graph structure with the ‘fake’ nodes such that it
can be identified in the anonymous data.
Lecture 2 : 590.03 Fall 12 55
High Level View of Attack• Step 2: Add edges from the ‘fake’ nodes to real nodes.
Lecture 2 : 590.03 Fall 12 56
High Level View of Attack• Step 3: From the anonymized data, identify fake graph due to its
special graph structure.
Lecture 2 : 590.03 Fall 12 57
High Level View of Attack• Step 4: Deduce edges by following links
Lecture 2 : 590.03 Fall 12 58
Details of the Attack• Choose k real users
W = {w1, …, wk}• Create k fake users
X = {x1, …, xk}
• Creates edges (xi, wi)
• Create edges (xi, xi+1)• Create all other edges in X
with probability 0.5.
Large graph
Lecture 2 : 590.03 Fall 12 59
Why does it work?• Given a graph G, and a set of nodes S, G[S] = graph induced by
nodes in S.
• There is an isomorphism between two sets of nodes S, S’ if– There is a function mapping each node in S to a node in S’– (u,v) is an edge in G[S] if and only if (f(u), f(v)) is an edge in S’
• Isomorphism from S to S is called an automorphism– Think: permuting the nodes
Lecture 2 : 590.03 Fall 12 60
Why does it work?
• There is no S such that G[S] is isomorphic to G[X] (call it H).
• H can be efficiently found from G.• H has no non-trivial automorphisms.
Large graph(size N)
Lecture 2 : 590.03 Fall 12 61
RecoverySubgraph isomorphism is NP-hard
– i.e., Finding X could be hard.
But since X has a path, with random edges, there is a simple brute force with pruning search algorithm.
Run Time: O(N 2O(log log N) )
Large graph(size N)
2
Lecture 2 : 590.03 Fall 12 62
Works in Real Life!• LiveJournal –
4.4 million nodes, 77 million edges
• Success all but guaranteed by adding 10 nodes.
• Recovery typically takes a second.
Probability of Successful Attack
[Backstrom et al., WWW 2007]
Lecture 2 : 590.03 Fall 12 63
Summary of Social Networks• Nodes in a graph can be re-identified using background
knowledge of the structure of the graph
• Link and group structure provide valuable information for accurately inferring private sensitive values.
• Active attacks that add nodes and edges are shown to be very
successful.
• Guarding against these attacks is an open area for research !
Lecture 2 : 590.03 Fall 12 64
Next Class• K-Anonymity + Algorithms: How to limit de-anonymization?
Lecture 2 : 590.03 Fall 12 65
ReferencesL. Sweeney, “K-Anonymity: a model for protecting privacy”, IJUFKS 2002A. Narayanan & V. Shmatikov, “Robust De-anonymization of Large Sparse Datasets”, SSP
2008L. Backstrom, C. Dwork & J. Kleinberg, “Wherefore art thou r3579x?: anonymized social
networks, hidden patterns, and structural steganography”, WWW 2007E. Zheleva & L. Getoor, “To join or not to join: the illusion of privacy in social networks with
mixed public and private user profiles”, WWW 2009