Efficient and Flexible Data Anonymization · Data Protection Act, EU legislation (95/46/EC,...
Transcript of Efficient and Flexible Data Anonymization · Data Protection Act, EU legislation (95/46/EC,...
Efficient and Flexible Data Anonymization
Grigorios Loukides, PhD [email protected]
Computer Science & Informatics, Cardiff University
IBM Research, Ireland, Sept. 26, 2012
Content
Motivation – setting, applications, need & benefits
Background – data, attacks, methods
Efficient and flexible anonymization Privacy models Data transformation strategies Algorithms Applications
Conclusions
2
The setting of privacy-preserving data publishing
Data Anonymizer (custodian)
Data
Supermarket
Hospital Researcher
SQL
Individuals’ info, actions, processes
...
Insurance company
Marketing company
Data owners / producers
Data users Prevent the disclosure of
private and sensitive information Data
3
Data publishing helps supporting the Smarter Cities initiative
Adapted from http://www.ibm.com/smarterplanet 4
… by allowing improved decision making
Adapted from http://www.ibm.com/smarterplanet
Research: 600 studies per year Patient care: disease management
mobility management, LBS advertisements
load management
5
... much of these data are sensitive
Adapted from http://www.ibm.com/smarterplanet 6
Privacy must be preserved
[1]
Legal requirements Data Protection Act, EU legislation (95/46/EC, 2009/136/EC etc.), HIPAA
Genetic data (e.g., diagnoses + DNA) must be deposited into bio-repositories after being anonymized
Telecommunication data (e.g., web visits) can be anonymized so that they can be retained and used for more than 6 months.
Economic benefits Cost of privacy breach - $7.2M on average* and up to $35.3M
Wider data use - fewer restrictions
Benefits from sharing anonymized fit note data in the UK
Increased joint sales, reduced costs
Increased customer retention (appreciation of privacy, more trust)
7
Content
Motivation – setting, applications, need & benefits
Background – data, attacks, methods Efficient and flexible anonymization
Privacy models Data transformation strategies Algorithms Applications
Conclusions
8
Background – problem dimensions
Methods
Data transformation
Synthetic data generation
9
Background – data
Type of data Individuals’ info Example Relational Demographics US census Transactional Purchased items Walmart, Tesco Sequential Webpage visits AOL Trajectory GPS traces Tomtom Graph Social connections Facebook Text Clinical Fit notes
Different data types model various individuals’ information that is required in many applications
10
Background – attacks
Identity disclosure* – relational data Individuals are linked to their published records based on
quasi-identifiers (attributes that in combination can identify an individual)
Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F
Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F
External data De-identified data
87% of US citizens can be identified by DOB, Gender, 5-digit ZIP code*
* Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. IJUFKS. 2002. 11
Background – attacks
Identity disclosure – transaction data Individuals are linked to their published records based on
public items
Diagnoses 333.4, 401.0, 401.1
401.0, 401.1 401.0, 401.1
Name Diagnoses Greg 333.4, 401.0, 401.1 Jim 401.0, 401.1 Jack 401.0, 401.1
External data – EMR Research data
96.5% of patients uniquely re-identifiable based on diagnoses*
2.7K patients VUMC
1.2M patients VUMC
* Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. Journal of American Medical Informatics Association, 2010. (AMIA Best paper award)
12
~60% based on 3 diagnoses*
Background – attacks
Sensitive information disclosure Individuals are linked to information they do not want to be
associated with (i.e., sensitive values or sensitive items)
Purchased items strawberries, milk
beer, adult dvd chocolate, icecream
Postcode Salary NW10 100K NW10 100K NW10 100K NW25 19K
Diagnoses DNA 404.2 C…A
404.0, 404.1 T…T 404.0, 404.1 C…C
DNA is highly sensitive Web search terms Rated movies
13
Name Postcode Greg NW10
External data
Background – attacks
Inferential disclosure Sensitive knowledge inferred through aggregate information
Aggregate queries in deep-web repositories** 10% of cars stocked by a car dealer are Ford Business rivalry (competitor sells Ford cars at a higher price)
Frequent sequences in sequence databases* Most of a car dealer’s customers part-exchange their BMW to buy Ford Unsolicited advertisement (competitor approaches BMW owners to offer them Ford cars)
* Gkoulalas-Divanis and G. Loukides. Revisiting sequential pattern hiding to enhance utility. KDD, 2011.
** Dasgupta et al. Privacy preservation of aggregates in hidden databases: why and how? SIGMOD, 2009.
14
Background – problem dimensions
Methods
Data transformation
Synthetic data generation
15
Background – methods
Synthetic data generation - build a statistical model using a noise infused version of the data, and then synthetic data are generated by randomly sampling from this model
Data are synthetic and only preserve predetermined statistics
Data transformation methods Perturbative – aim to preserve privacy and aggregate statistics
(e.g., means and correlation coefficients), – randomization, data swapping, microaggregation, rounding
Data are not truthful individuals are associated with false information
Non-perturbative – aim to change the granularity of the data Data truthful individuals are associated with more general
information
16
Background – non-perturbative data transformation methods
Suppression – removes values or items before data publication
Name Age Sex Greg 20 M Jim 45 M
External data
Age Sex Disease 20 M HIV 46 M Flu
Original data
Diagnoses 333.4, 401.0, 401.1
401.0, 401.1
Name Diagnoses Greg 333.4, 401.0, 401.1 Jim 401.0, 401.1
External data Original data
17
Background – non-perturbative data transformation methods
Suppression – removes values or items before data publication
Name Age Sex Greg 20 M Jim 45 M
External data
Age Sex Disease * M HIV * M Flu
Suppressed data
Diagnoses 333.4, 401.0, 401.1
401.0, 401.1
Name Diagnoses Greg 333.4, 401.0, 401.1 Jim 401.0, 401.1
External data Original data
18
Distinguishing information is removed identity disclosure is prevented
Background – non-perturbative data transformation methods
Generalization – replaces values or items with more general values before data publication
(Diagnosis, Age) (401.0, [33-36]) (401, 38) (401.0, [33-36]) (401, 38)
Name (Diagnosis, Age) Greg (401.0, 33) (401.0, 35) (401.9, 38) Jim (401.0, 34) (401.1, 38)
External data Original data
3 types of hypertension
any hypertension
generalized age value
* Tamersoy et al. Anonymization of longitudinal electronic medical records. IEEE Trans. on Inform. Technol. in Biom., 2012.
19
Is preserving privacy sufficient?
Finding a good utility/privacy trade-off is challenging!
minimum level of protection required
b
Priv
acy
Ris
k
Data Utility
a
high
high low
c
c
No publishing
Original data publishing
c
c
c
c
c
RU confidentiality map
20
minimum level of utility required
Content
Motivation – setting, applications, need & benefits
Background – data, attacks, scenarios, methods
Efficient and flexible anonymization Privacy models Data transformation strategies Algorithms Applications
Conclusions
21
Focus of this talk
Methods
Data transformation
Synthetic data generation
22
Overview of the problem we consider
Anonymization Algorithm
De-identified data
Privacy model
Privacy component
Utility model
Anonymized data
Extract / Specify what needs protection
Evaluate anonymized data vs. utility policy Utility
component
Extract / Specify what is useful
Evaluate anonymized vs. privacy model
23
Privacy models – Privacy-constrained anonymity Protect from identity disclosure Probability of re-identification is limited
* Loukides et al. COAT: Constraint-based Anonymization of Transactions. Knowledge and Information Systems, 2011.
We must specify which itemsets lead to identity disclosure
An itemset that may lead to identity disclosure
Privacy-constraint*
Privacy constraints 401.1
401.2, 401.9
24
Privacy models – Privacy-constrained anonymity
Each individual’s transaction should be indistinguishable from at least k-1 other transactions w.r.t. the specified privacy constraints
Privacy-constrained anonymity*
Diagnoses (401.1, 401.2, 401.9) (401.1, 401.2, 401.9)
Name Diagnoses Greg 401.0, 401.1, 401.9 Jim 401.2, 401.9
External data Protected data
Privacy constraints 401.1
401.2, 401.9
k =2
* Loukides et al. COAT: Constraint-based Anonymization of Transactions. Knowledge and Information Systems, 2011.
Re-identification probability given the data and the privacy constraints ≤ 1/k
25
any combination of {401.1,401.2,401.9}
Privacy models – PS-rule based model
PS-rules a j
cd g
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.
We must distinguish between public and sensitive itemsets
An implication IJ between a public and a sensitive itemset
PS-rule*
Protect from identity and sensitive information disclosure Probability of re-identification and of inferring sensitive information
is limited
26
Prevent the inference of j based on knowledge of a
Privacy models – PS-rule based model
Purchased items
a b c h i j a b c e f i
c d g c d j
Protected data
PS-rules a j
cd g
k =2, c=0.5
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.
I appears in ≥ k transactions, and J appears in ≤ c% of them
Protected PS-rule
sup(a)=2 conf(aj)=0.5
sup(I) ≥ k conf(IJ) ≤ c
27
Privacy models – PS-rule based model
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.
Protection from identity and sensitive information disclosure Probability of associating an individual with the antecedent of any PS-rule ≤ 1/k Probability of associating an individual with the consequent of any PS-rule ≤ c
Purchased items
a b c h i j a b c e f i
c d g c d j
PS-rules a j
cd g
k =2, c=0.5
sup(a)=2 conf(aj)=0.5
Protected data
All specified PS-rules must be protected in the published data
PS-rule based model satisfaction
28
Privacy models – properties and benefits
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press. ** Terrovitis et al. Privacy-preserving anonymization of set-valued data. PVLDB, 2008.
The privacy-constrained and PS-rule based privacy models
have existing models as their special cases (km-anonymity, …)
can capture detailed requirements crucial for enhancing data utility in real-world applications
a and cd need protection from identity disclosure, and j from sensitive inf. disclosure
Example – the state-of-the-art privacy model km-anonymity: all public m-itemsets must be protected from
identity disclosure
29
Privacy models – properties and benefits
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press. ** Terrovitis et al. Privacy-preserving anonymization of set-valued data. PVLDB, 2008.
The privacy-constrained and PS-rule based privacy models
Purchased items
a b c h i j a b c e f i
c d g c d j
PS-rules a j
cd g
k =2, c=0.5
Purchased items
(a,b,c) (d,e,h) i j (a,b,c) (d,e,h) i (a,b,c) (d,e,h) g (a,b,c) (d,e,h) j
35 instead of 2-itemsets protected from identity disclosure no protection from sensitive information disclosure much lower data utility
Protection using PS-rules Protection using km-anonymity**
Example – the state-of-the-art privacy model
30
Privacy models – privacy requirement specification with PS-rules
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.
Scenarios that capture many real data publishing applications*
Data publisher knows which itemsets are public or sensitive Electronic Medical Record (EMR) data, Octopus card
Data publisher knows that a class of items are public or sensitive
Healthcare data sharing policies “DVDs” may lead to identity disclosure, not sure which ones and “all pills” are sensitive
Data publisher must protect all items, or has no specific requirements
31
Privacy models – privacy requirement specification with PS-rules
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.
Data publishers’ knowledge is modelled using hierarchies*
Data publisher selects an ordered pair of nodes <uP,uS> from HP and HS
dvds
HP (for public items) HS (for sensitive items)
root least specific requirement
leaf most specific
Set of PS-rules to protect any item in uP from identity disclosure, and any itemset uS from sensitive information disclosure
de i is part of this set, but dci is not 32
Privacy models – privacy requirement specification with PS-rules
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.
Data publishers’ knowledge modelled using hierarchies* Algorithm to construct PS-rules from ordered pairs of nodes <uP,uS>
Ensures that all constructed rules require protection Deals with multiple ordered pairs Eliminates redundant rules
The use of detailed PS-rules significantly improves data utility
33
Overview
Anonymization Algorithm
De-identified data
Privacy model
Utility model
Anonymized data
Evaluate anonymized data vs. utility policy Utility
component
Extract / Specify what is useful
Evaluate anonymized vs. privacy model
34
Privacy component
Extract / Specify what needs protection
Data utility – measures Suppression and generalization reduce data utility they
affect the granularity of published information
We need to measure how much data utility is lost
by measuring information loss Assumes that we do not know the applications data will be used for
by measuring the accuracy of performing a specific task using anonymized data
Reasonable for several data sharing applications
35
Data utility – measures Information loss measures for transaction data
Utility Loss (UL)* - captures the uncertainty of interpreting a generalized item
Utility Criterion (UC)** - accurate objective measure, takes into account how generalized items are created by algorithms
Diagnoses (401.1, 401.2)
(174.5, 372.5, 401.1) (174.5, 372.5, 401.1)
* Loukides et al. COAT: Constraint-based Anonymization of Transactions Knowledge and Information Systems, 2011.
** Loukides et al. Utility-preserving transaction data anonymization with low information loss. Expert Systems with Applications, 2012.
# of items it replaces weight relative support
36
Data utility – measures Task-based measure for transaction data
Relative Error (RE)* - captures the fraction of records that are retrieved incorrectly when COUNT() query is applied to anonymized data
* Loukides et al. COAT: Constraint-based Anonymization of Transactions Knowledge and Information Systems, 2011.
COUNT(*) from T where Diagnosis is “401.2”
11
|21|)(
|)()(|=
−=
−=
qactqestqactRE3
23||)( ×=×= pgqest
Diagnoses (401.1,401.2) (401.1,401.2) 401.3 (401.1,401.2) 401.4 401.3
Diagnoses 401.1 401.2 401.3 401.1 401.4 401.3
37
Data utility – guarantees Many ways to generalize data: , |𝐼| in the order of thousands! 2|𝐼|
Data publishers interested in certain of these ways
(Cold, cough)
I am interested in patients with cold or cough
Controls how items may be generalized leads to useful solutions
Utility constraint*
* Loukides et al. COAT: Constraint-based Anonymization of Transactions Knowledge and Information Systems, 2011.
38
Data utility – guarantees
A specified generalized item will not be more general than required
Utility constraint satisfaction*
Data remain as useful as the original for counting aggregate concepts
Utility-constrained anonymization*
COUNT(*) from T where Diagnosis is
401.1 “Cold” or 401.2 “Cough”
Diagnoses (401.1,401.2) (401.1,401.2) 401.3 (401.1,401.2) 401.4 401.3
Diagnoses 401.1 401.2 401.3 401.1 401.4 401.3
Important applications in biomedicine (later on)
* Loukides et al. COAT: Constraint-based Anonymization of Transactions Knowledge and Information Systems, 2011.
39
Overview
Anonymization Algorithm
De-identified data
Privacy model
Utility model
Anonymized data
Evaluate anonymized data vs. utility policy Utility
component
Extract / Specify what is useful
Evaluate anonymized vs. privacy model
40
Privacy component
Extract / Specify what needs protection
Anonymization algorithms (since 2011)
Algorithm Privacy model Search strategy Transform. strategy
Utility Constr.
COAT Priv. Constr. An. Greedy search Set-based an. Yes
PCTA Priv. Constr. An. Item clustering Generalization No
UPCTA Priv. Constr. An. Item clustering Set-based an. Yes
UAR Priv. Constr. An. Priv. Constr. Reordering Set-based an. Yes
RBAT PS-rule based Top-down Partitioning Generalization No
Tree-based PS-rule based Top-down & Bottom-up Partitioning
Generalization No
Sample-based PS-rule based Sample-based Partitioning & Cut revision
Generalization No
41
4 specialized algorithms for privacy-preserving medical data sharing
Anonymization algorithms – Tree-based
Check if rules are protected by computing their support and confidence in the temporary dataset
cd hi
Continue splitting to enhance utility
cd hi
42
Start with all items generalized into one
Split it into two to enhance data utility (more specific generalized items)
Return the anonymized dataset
Anonymization algorithms – Tree-based
43
Can we do better? Efficiency: Rule checking is computationally expensive, particularly in
the dataset resulting from the first few splits
Data utility: Tree-based may “stop” early (i.e., data can be split more to increase utility) due to non-monotonicity of confidence
Generalized items by Tree-based
More specific generalized items that can be constructed
Anonymization algorithms – Sample-based
Start with all items generalized into one
Split it into two to enhance data utility
Check to see if rules have enough support in a random sample (“sufficiently” close to the dataset)
Phase 1: Sample-based Partitioning
Theorem*
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.
How to determine “good” sample size?
44
Anonymization algorithms – Sample-based
Start with all items generalized into one
Split it into two to enhance data utility
Check to see if rules have enough support in a random sample (“sufficiently” close to the dataset)
Phase 1: Sample-based Partitioning
Phase 2: Top-down cut revision Split generalized items
Check to see if rules have enough support in the dataset
Continue splitting to enhance utility
45
In Phase 2, we want to avoid the early stopping problem of Tree-based and help data utility.
Anonymization algorithms – Sample-based
Phase 1: Sample-based Partitioning
Phase 2: Top-down cut revision
Phase 3: Bottom-up cut revision Merge generalized items with their siblings Check rules for protection in the dataset Continue as long as all rules are protected Return the anonymized dataset
46
At this stage, data are not anonymous, because we did not check for rule confidence. In phase 3, we ensure data protection.
Datasets – BMS1, BMS2: click-stream data POS : sales transaction data
Evaluation Effectiveness
Utility using ARE (query answering measure) Protection
Efficiency
Scalability
*M. Terrovitis, N. Mamoulis, P. Kalnis. Privacy-preserving anonymization of set-valued data, PVLDB, 2008. ** Xu et al. Anonymizing transaction databases for publication. KDD, 2008.
Anonymization algorithms – Experimental evaluation
47
Baseline (no rule pruning)
*M. Terrovitis, N. Mamoulis, P. Kalnis. Privacy-preserving anonymization of set-valued data, PVLDB, 2008. ** Xu et al. Anonymizing transaction databases for publication. KDD, 2008.
Anonymization algorithms – Competitors
Apriori Anonymization* (Sketch) Start with original data For j=1 to m
For each transaction T Find all j-itemsets with support less than k For each of these itemsets
Generate all possible generalizations Find the generalization that satisfies and has
minimum information loss
Greedy** Protects all p-itemsets induced from public items from identity disclosure and considers all non-public items as sensitive Employs suppression
48
Anonymization algorithms – Effectiveness (Data Utility)
Split good for utility yet it considers 𝑂 𝑃 ≪ 𝑂(2 𝑃 ) generalizations
Avoid overprotecting data
* Wong et al. Minimality attack in privacy-preserving data publishing. VLDB, 2007. ** Xiao et al. Transparent anonymization: thwarting adversaries who know the algorithm. ACM TODS, 2010.
Work well with various privacy requirements
49
Anonymization algorithms – Effectiveness (Data Protection)
Protection from both identity and sensitive information disclosure (due to PS-rule based model)
Protection even when attacker knows all public items and the
workings of the anonymization algorithm (due to the algorithmic design) Minimality* and transparency attacks**
* Wong et al. Minimality attack in privacy-preserving data publishing. VLDB, 2007. ** Xiao et al. Transparent anonymization: thwarting adversaries who know the algorithm. ACM TODS, 2010.
50
Anonymization algorithms – Efficiency Checking all PS-rules for protection is computationally
expensive (a dataset scan for each rule) We can “prune” certain types of rules
(not check them for protection)
If cg is protected,
then so is cgj
Before anonymization If sup 𝐼,𝐷 ≥ 𝑘 and sup 𝐽,𝐷 ≤ 𝑐 × 𝑘, then IJ is protected in any generalized dataset that can be constructed from D During anonymization If IJ is protected, we can prune all the rules whose antecedent is I and consequent is a superset of J
… more strategies in *
* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.
51
Anonymization algorithms – Efficiency We can find which rules to check efficiently (helps pruning)
Rule-tree data structure
e.g., we check only rules that contain a certain item
52
Anonymization algorithms – Scalability
Worst-case time complexity: 𝑶 𝟐 𝑷 × 𝑺 × 𝑵 , |𝑷| small in practice
Worst-case space complexity: 𝑶 𝟐 𝑷 × 𝑺 + 𝑵 × |𝑰|
53
Anonymization algorithms – Applications / Case studies
* Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. Journal of American Medical Informatics Association, 2010. (AMIA Best paper award) ** Loukides et al. Anonymization of electronic medical records for validating genome-wide association studies. Proc. of the Nat. Acad. of Science (PNAS), 2010.
96.5%
60.0%
70.0%
80.0%
90.0%
100.0%
1 10 100 1000
% o
f re-
iden
tifie
d pa
tient
s
Distinguishability
GW
AS
-rel
ated
dis
ease
s
96.5% of patients are vulnerable [4]
Anonymized data that remain useful for GWAS**
Patient identities can be disclosed through diagnosis codes
Acknowledged as an important advance in the last 10 years of genomic research in a Nature
paper by E. Green, NHGRI Director
Algorithms for EMRs
54
Anonymization algorithms – Applications / Case studies
* Loukides et al. Utility-Aware Anonymization of Diagnosis Codes, IEEE Transactions on Information Technology in Biomedicine, In press.
Anonymized data that support clinical case count studies
Low ARE and support for clinical case count studies
Data in the VUMC biobank can be anonymized and remain useful
VUMC Biobank (diagnosis, DNA) of 79K patients
Largest dataset in medical data privacy VUMC experts’ requirements
55
Conclusions
Need for sharing data that remains protected
Several methods to do so
Effective and efficient approaches for transaction data and some of their applications
but still long way to go Complex & heterogeneous data Integration with other data management operations Use in applications Software
56
Is preserving privacy easy?
Can we remove everything that “looks” sensitive!
A lawsuit was filed, Netflix settled the lawsuit “We will find new ways to collaborate with researchers”
57
We need better solutions!
Smarter cities initiative means doing better in
Adapted from http://www.ibm.com/smarterplanet 58
Background – attacks
Sensitive information disclosure cannot be prevented by guarding against identity disclosure
** Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Inf. Systems, In press.
* **
* Loukides et al. Preventing range disclosure in k-anonymised data. Expert Systems with Applications, 2011.
Specialized methods to prevent this attack are required
59
Anonymization algorithms – Applications / Case studies
* Loukides et al. On balancing disclosure risk and data utility in transaction data sharing using R-U confidentiality map. Joint UNECE/Eurostat work session on SDC, 2012.
Construct anonymizations with desired trade-off
Allow intuitive comparison between different methods
R-U map and knee point method to track utility/privacy trade-off*
60
Anonymization algorithms – Applications / Case studies
* Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. Journal of American Medical Informatics Association, 2010. (AMIA Best paper award) ** Loukides et al. Anonymization of electronic medical records for validating genome-wide association studies. Proc. of the Nat. Acad. of Science (PNAS), 2010.
96.5%
60.0%
70.0%
80.0%
90.0%
100.0%
1 10 100 1000
% o
f re-
iden
tifie
d pa
tient
s
Distinguishability
GW
AS
-rel
ated
dis
ease
s
96.5% of patients are vulnerable [4]
Anonymized data that remain useful for GWAS**
Patient identities can be disclosed through diagnosis codes
Acknowledged as an important advance in the last 10 years of genomic research in a Nature
paper by E. Green, NHGRI Director
Algorithms for EMRs
61
Content
Motivation – setting, applications, need & benefits
Background – data, attacks, methods, scenarios
Efficient and flexible anonymization Privacy models Data transformation strategies Algorithms Applications
Conclusions
62
Overview
Anonymization Algorithm
De-identified data
Privacy model
Privacy component
Utility model
Anonymized data
Extract / Specify what needs protection
Evaluate anonymized data vs. utility policy Utility
component
Extract / Specify what is useful
Evaluate anonymized vs. privacy model
63
Objectives
Privacy How to publish protected data? How to capture data publishers’ privacy requirements?
Data utility How to measure data utility? How to guarantee data utility?
Algorithmic design How to design effective, efficient, scalable algorithms?
Applications / Case studies How to address real anonymization problems?
64
Datasets – BMS1, BMS2 contain click-stream data and POS contains sales transaction data
Evaluation Effectiveness: ARE (query answering measure), protection
Efficiency, scalability
Methods Tree-based, Sample-based vs Baseline (no pruning), Apriori Anonymization*,
Greedy**
*M. Terrovitis, N. Mamoulis, P. Kalnis. Privacy-preserving anonymization of set-valued data, PVLDB, 2008. ** Xu et al. Anonymizing transaction databases for publication. KDD, 2008.
Anonymization algorithms –
65
Privacy-preserving data sharing – Models / algorithms
Generalization Suppression
Encryption ???
Methods
66
Privacy-preserving data sharing – Applications / Case studies
96.5%
60.0%
70.0%
80.0%
90.0%
100.0%
1 10 100 1000
% o
f re-
iden
tifie
d pa
tient
s
Distinguishability
GW
AS
-rel
ated
dis
ease
s
96.5% of patients are vulnerable [4]
Anonymized data guaranteed to remain
useful for genetic studies [5]
“One of the important advances in the last 10 years of genomic research” – E. Green, director of the National Human Genome Research Institute
67
Privacy-preserving data sharing – Prototypes
68
Thanks & Acknowledgements
References
Aris Gkoulalas-Divanis (Ireland) Michail Vlachos (Zurich)
Bradley Malin Joshua Denny
Robert Gwadera
1. L. Sweeney, k-anonymity: a model for protecting privacy. Int.Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems, 2002. 2. Ponema Institute/Symantec corporation, 2010 Annual Study: US cost of a data breach. 3. Open Data White Paper – Unleashing the potential, 2012. http://data.gov.uk/sites/default/files/Open_data_White_Paper.pdf
Questions, comments, requests for materials [email protected] or
http://users.cs.cf.ac.uk/G.Loukides
69
Privacy-preserving query answering
Data request
Privacy-protected result
Data repository Data users
Background - scenarios
Interactive
Strong privacy, but few applications and short data lifespan
Data publisher
Data Owners
Data Anonymized Data
Data users
Most popular, but several privacy and data utility challenges
Non-interactive
70
Background – attacks
Membership disclosure Individuals’ presence in the published dataset is inferred
Discrimination (refuse employment, pay for insurance premium,…)
* Nergiz et al. Hiding the presence of individuals from shared databases. SIGMOD, 2007.
** Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLOS Genetics, 2008.
Crime, Financial, Medical data* Presence in crime, bankruptcy, HIV-positive databases
DNA mixture data** Not all participants suffer from genetic disorder Patients that suffer from genetic disorder can still be identified
dbGaP withdrew open access to GWAS study results
71
Mixture DNA
Identity and DNA