Efficient and Flexible Data Anonymization · Data Protection Act, EU legislation (95/46/EC,...

Efficient and Flexible Data Anonymization

Grigorios Loukides, PhD [email protected]

Computer Science & Informatics, Cardiff University

IBM Research, Ireland, Sept. 26, 2012

Content

Motivation – setting, applications, need & benefits

Background – data, attacks, methods

Efficient and flexible anonymization Privacy models Data transformation strategies Algorithms Applications

Conclusions

2

The setting of privacy-preserving data publishing

Data Anonymizer (custodian)

Data

Supermarket

Hospital Researcher

SQL

Individuals’ info, actions, processes

...

Insurance company

Marketing company

Data owners / producers

Data users Prevent the disclosure of

private and sensitive information Data

3

Data publishing helps supporting the Smarter Cities initiative

Adapted from http://www.ibm.com/smarterplanet 4

… by allowing improved decision making

Adapted from http://www.ibm.com/smarterplanet

Research: 600 studies per year Patient care: disease management

mobility management, LBS advertisements

load management

5

... much of these data are sensitive


Privacy must be preserved

[1]

Legal requirements Data Protection Act, EU legislation (95/46/EC, 2009/136/EC etc.), HIPAA

Genetic data (e.g., diagnoses + DNA) must be deposited into bio-repositories after being anonymized

Telecommunication data (e.g., web visits) can be anonymized so that they can be retained and used for more than 6 months.

Economic benefits Cost of privacy breach - $7.2M on average* and up to $35.3M

Wider data use - fewer restrictions

Benefits from sharing anonymized fit note data in the UK

Increased joint sales, reduced costs

Increased customer retention (appreciation of privacy, more trust)

7

Content


Background – data, attacks, methods Efficient and flexible anonymization

Privacy models Data transformation strategies Algorithms Applications

Conclusions

8

Background – problem dimensions

Methods

Data transformation

Synthetic data generation

9

Background – data

Type of data Individuals’ info Example Relational Demographics US census Transactional Purchased items Walmart, Tesco Sequential Webpage visits AOL Trajectory GPS traces Tomtom Graph Social connections Facebook Text Clinical Fit notes

Different data types model various individuals’ information that is required in many applications

10

Background – attacks

Identity disclosure* – relational data Individuals are linked to their published records based on

quasi-identifiers (attributes that in combination can identify an individual)

Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F

Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F

External data De-identified data

87% of US citizens can be identified by DOB, Gender, 5-digit ZIP code*

* Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. IJUFKS. 2002. 11


Identity disclosure – transaction data Individuals are linked to their published records based on

public items

Diagnoses 333.4, 401.0, 401.1

401.0, 401.1 401.0, 401.1

Name Diagnoses Greg 333.4, 401.0, 401.1 Jim 401.0, 401.1 Jack 401.0, 401.1

External data – EMR Research data

96.5% of patients uniquely re-identifiable based on diagnoses*

2.7K patients VUMC

1.2M patients VUMC

* Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. Journal of American Medical Informatics Association, 2010. (AMIA Best paper award)

12

~60% based on 3 diagnoses*


Sensitive information disclosure Individuals are linked to information they do not want to be

associated with (i.e., sensitive values or sensitive items)

Purchased items strawberries, milk

beer, adult dvd chocolate, icecream

Postcode Salary NW10 100K NW10 100K NW10 100K NW25 19K

Diagnoses DNA 404.2 C…A

404.0, 404.1 T…T 404.0, 404.1 C…C

DNA is highly sensitive Web search terms Rated movies

13

Name Postcode Greg NW10

External data


Inferential disclosure Sensitive knowledge inferred through aggregate information

Aggregate queries in deep-web repositories** 10% of cars stocked by a car dealer are Ford Business rivalry (competitor sells Ford cars at a higher price)

Frequent sequences in sequence databases* Most of a car dealer’s customers part-exchange their BMW to buy Ford Unsolicited advertisement (competitor approaches BMW owners to offer them Ford cars)

* Gkoulalas-Divanis and G. Loukides. Revisiting sequential pattern hiding to enhance utility. KDD, 2011.

** Dasgupta et al. Privacy preservation of aggregates in hidden databases: why and how? SIGMOD, 2009.

14

Background – problem dimensions

Methods

Data transformation


15

Background – methods

Synthetic data generation - build a statistical model using a noise infused version of the data, and then synthetic data are generated by randomly sampling from this model

Data are synthetic and only preserve predetermined statistics

Data transformation methods Perturbative – aim to preserve privacy and aggregate statistics

(e.g., means and correlation coefficients), – randomization, data swapping, microaggregation, rounding

Data are not truthful individuals are associated with false information

Non-perturbative – aim to change the granularity of the data Data truthful individuals are associated with more general

information

16

Background – non-perturbative data transformation methods

Suppression – removes values or items before data publication

Name Age Sex Greg 20 M Jim 45 M

External data

Age Sex Disease 20 M HIV 46 M Flu

Original data

Diagnoses 333.4, 401.0, 401.1

401.0, 401.1

Name Diagnoses Greg 333.4, 401.0, 401.1 Jim 401.0, 401.1

External data Original data

17


Suppression – removes values or items before data publication

Name Age Sex Greg 20 M Jim 45 M

External data

Age Sex Disease * M HIV * M Flu

Suppressed data

Diagnoses 333.4, 401.0, 401.1

401.0, 401.1



18

Distinguishing information is removed identity disclosure is prevented


Generalization – replaces values or items with more general values before data publication

(Diagnosis, Age) (401.0, [33-36]) (401, 38) (401.0, [33-36]) (401, 38)

Name (Diagnosis, Age) Greg (401.0, 33) (401.0, 35) (401.9, 38) Jim (401.0, 34) (401.1, 38)


3 types of hypertension

any hypertension

generalized age value

* Tamersoy et al. Anonymization of longitudinal electronic medical records. IEEE Trans. on Inform. Technol. in Biom., 2012.

19

Is preserving privacy sufficient?

Finding a good utility/privacy trade-off is challenging!

minimum level of protection required

b

Priv

acy

Ris

k

Data Utility

a

high

high low

c

c

No publishing

Original data publishing

c

c

c

c

c

RU confidentiality map

20

minimum level of utility required

Content


Background – data, attacks, scenarios, methods


Conclusions

21

Focus of this talk

Methods

Data transformation


22

Overview of the problem we consider

Anonymization Algorithm

De-identified data

Privacy model

Privacy component

Utility model

Anonymized data

Extract / Specify what needs protection

Evaluate anonymized data vs. utility policy Utility

component

Extract / Specify what is useful

Evaluate anonymized vs. privacy model

23

Privacy models – Privacy-constrained anonymity Protect from identity disclosure Probability of re-identification is limited

* Loukides et al. COAT: Constraint-based Anonymization of Transactions. Knowledge and Information Systems, 2011.

We must specify which itemsets lead to identity disclosure

An itemset that may lead to identity disclosure

Privacy-constraint*

Privacy constraints 401.1

401.2, 401.9

24

Privacy models – Privacy-constrained anonymity

Each individual’s transaction should be indistinguishable from at least k-1 other transactions w.r.t. the specified privacy constraints

Privacy-constrained anonymity*

Diagnoses (401.1, 401.2, 401.9) (401.1, 401.2, 401.9)


External data Protected data

Privacy constraints 401.1

401.2, 401.9

k =2

* Loukides et al. COAT: Constraint-based Anonymization of Transactions. Knowledge and Information Systems, 2011.

Re-identification probability given the data and the privacy constraints ≤ 1/k

25

any combination of {401.1,401.2,401.9}

Privacy models – PS-rule based model

PS-rules a j

cd g

* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press.

We must distinguish between public and sensitive itemsets

An implication IJ between a public and a sensitive itemset

PS-rule*

Protect from identity and sensitive information disclosure Probability of re-identification and of inferring sensitive information

is limited

26

Prevent the inference of j based on knowledge of a


Purchased items

a b c h i j a b c e f i

c d g c d j

Protected data

PS-rules a j

cd g

k =2, c=0.5


I appears in ≥ k transactions, and J appears in ≤ c% of them

Protected PS-rule

sup(a)=2 conf(aj)=0.5

sup(I) ≥ k conf(IJ) ≤ c

27



Protection from identity and sensitive information disclosure Probability of associating an individual with the antecedent of any PS-rule ≤ 1/k Probability of associating an individual with the consequent of any PS-rule ≤ c

Purchased items


c d g c d j

PS-rules a j

cd g

k =2, c=0.5

sup(a)=2 conf(aj)=0.5

Protected data

All specified PS-rules must be protected in the published data

PS-rule based model satisfaction

28

Privacy models – properties and benefits

* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press. ** Terrovitis et al. Privacy-preserving anonymization of set-valued data. PVLDB, 2008.

The privacy-constrained and PS-rule based privacy models

have existing models as their special cases (km-anonymity, …)

can capture detailed requirements crucial for enhancing data utility in real-world applications

a and cd need protection from identity disclosure, and j from sensitive inf. disclosure

Example – the state-of-the-art privacy model km-anonymity: all public m-itemsets must be protected from

identity disclosure

29

Privacy models – properties and benefits

* Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Information Systems, In press. ** Terrovitis et al. Privacy-preserving anonymization of set-valued data. PVLDB, 2008.

The privacy-constrained and PS-rule based privacy models

Purchased items


c d g c d j

PS-rules a j

cd g

k =2, c=0.5

Purchased items

(a,b,c) (d,e,h) i j (a,b,c) (d,e,h) i (a,b,c) (d,e,h) g (a,b,c) (d,e,h) j

35 instead of 2-itemsets protected from identity disclosure no protection from sensitive information disclosure much lower data utility

Protection using PS-rules Protection using km-anonymity**

Example – the state-of-the-art privacy model

30

Privacy models – privacy requirement specification with PS-rules


Scenarios that capture many real data publishing applications*

Data publisher knows which itemsets are public or sensitive Electronic Medical Record (EMR) data, Octopus card

Data publisher knows that a class of items are public or sensitive

Healthcare data sharing policies “DVDs” may lead to identity disclosure, not sure which ones and “all pills” are sensitive

Data publisher must protect all items, or has no specific requirements

31



Data publishers’ knowledge is modelled using hierarchies*

Data publisher selects an ordered pair of nodes <uP,uS> from HP and HS

dvds

HP (for public items) HS (for sensitive items)

root least specific requirement

leaf most specific

Set of PS-rules to protect any item in uP from identity disclosure, and any itemset uS from sensitive information disclosure

de i is part of this set, but dci is not 32



Data publishers’ knowledge modelled using hierarchies* Algorithm to construct PS-rules from ordered pairs of nodes <uP,uS>

Ensures that all constructed rules require protection Deals with multiple ordered pairs Eliminates redundant rules

The use of detailed PS-rules significantly improves data utility

33

Overview


De-identified data

Privacy model

Utility model

Anonymized data


component



34

Privacy component


Data utility – measures Suppression and generalization reduce data utility they

affect the granularity of published information

We need to measure how much data utility is lost

by measuring information loss Assumes that we do not know the applications data will be used for

by measuring the accuracy of performing a specific task using anonymized data

Reasonable for several data sharing applications

35

Data utility – measures Information loss measures for transaction data

Utility Loss (UL)* - captures the uncertainty of interpreting a generalized item

Utility Criterion (UC)** - accurate objective measure, takes into account how generalized items are created by algorithms

Diagnoses (401.1, 401.2)

(174.5, 372.5, 401.1) (174.5, 372.5, 401.1)

* Loukides et al. COAT: Constraint-based Anonymization of Transactions Knowledge and Information Systems, 2011.

** Loukides et al. Utility-preserving transaction data anonymization with low information loss. Expert Systems with Applications, 2012.

# of items it replaces weight relative support

36

Data utility – measures Task-based measure for transaction data

Relative Error (RE)* - captures the fraction of records that are retrieved incorrectly when COUNT() query is applied to anonymized data


COUNT(*) from T where Diagnosis is “401.2”

11

|21|)(

|)()(|=

−=

−=

qactqestqactRE3

23||)( ×=×= pgqest

Diagnoses (401.1,401.2) (401.1,401.2) 401.3 (401.1,401.2) 401.4 401.3

Diagnoses 401.1 401.2 401.3 401.1 401.4 401.3

37

Data utility – guarantees Many ways to generalize data: , |𝐼| in the order of thousands! 2|𝐼|

Data publishers interested in certain of these ways

(Cold, cough)

I am interested in patients with cold or cough

Controls how items may be generalized leads to useful solutions

Utility constraint*


38

Data utility – guarantees

A specified generalized item will not be more general than required

Utility constraint satisfaction*

Data remain as useful as the original for counting aggregate concepts

Utility-constrained anonymization*

COUNT(*) from T where Diagnosis is

401.1 “Cold” or 401.2 “Cough”

Diagnoses (401.1,401.2) (401.1,401.2) 401.3 (401.1,401.2) 401.4 401.3

Diagnoses 401.1 401.2 401.3 401.1 401.4 401.3

Important applications in biomedicine (later on)


39

Overview


De-identified data

Privacy model

Utility model

Anonymized data


component



40

Privacy component


Anonymization algorithms (since 2011)

Algorithm Privacy model Search strategy Transform. strategy

Utility Constr.

COAT Priv. Constr. An. Greedy search Set-based an. Yes

PCTA Priv. Constr. An. Item clustering Generalization No

UPCTA Priv. Constr. An. Item clustering Set-based an. Yes

UAR Priv. Constr. An. Priv. Constr. Reordering Set-based an. Yes

RBAT PS-rule based Top-down Partitioning Generalization No

Tree-based PS-rule based Top-down & Bottom-up Partitioning

Generalization No

Sample-based PS-rule based Sample-based Partitioning & Cut revision

Generalization No

41

4 specialized algorithms for privacy-preserving medical data sharing

Anonymization algorithms – Tree-based

Check if rules are protected by computing their support and confidence in the temporary dataset

cd hi

Continue splitting to enhance utility

cd hi

42

Start with all items generalized into one

Split it into two to enhance data utility (more specific generalized items)

Return the anonymized dataset

Anonymization algorithms – Tree-based

43

Can we do better? Efficiency: Rule checking is computationally expensive, particularly in

the dataset resulting from the first few splits

Data utility: Tree-based may “stop” early (i.e., data can be split more to increase utility) due to non-monotonicity of confidence

Generalized items by Tree-based

More specific generalized items that can be constructed

Anonymization algorithms – Sample-based


Split it into two to enhance data utility

Check to see if rules have enough support in a random sample (“sufficiently” close to the dataset)

Phase 1: Sample-based Partitioning

Theorem*


How to determine “good” sample size?

44



Split it into two to enhance data utility

Check to see if rules have enough support in a random sample (“sufficiently” close to the dataset)


Phase 2: Top-down cut revision Split generalized items

Check to see if rules have enough support in the dataset

Continue splitting to enhance utility

45

In Phase 2, we want to avoid the early stopping problem of Tree-based and help data utility.



Phase 2: Top-down cut revision

Phase 3: Bottom-up cut revision Merge generalized items with their siblings Check rules for protection in the dataset Continue as long as all rules are protected Return the anonymized dataset

46

At this stage, data are not anonymous, because we did not check for rule confidence. In phase 3, we ensure data protection.

Datasets – BMS1, BMS2: click-stream data POS : sales transaction data

Evaluation Effectiveness

Utility using ARE (query answering measure) Protection

Efficiency

Scalability

*M. Terrovitis, N. Mamoulis, P. Kalnis. Privacy-preserving anonymization of set-valued data, PVLDB, 2008. ** Xu et al. Anonymizing transaction databases for publication. KDD, 2008.

Anonymization algorithms – Experimental evaluation

47

Baseline (no rule pruning)


Anonymization algorithms – Competitors

Apriori Anonymization* (Sketch) Start with original data For j=1 to m

For each transaction T Find all j-itemsets with support less than k For each of these itemsets

Generate all possible generalizations Find the generalization that satisfies and has

minimum information loss

Greedy** Protects all p-itemsets induced from public items from identity disclosure and considers all non-public items as sensitive Employs suppression

48

Anonymization algorithms – Effectiveness (Data Utility)

Split good for utility yet it considers 𝑂 𝑃 ≪ 𝑂(2 𝑃 ) generalizations

Avoid overprotecting data

* Wong et al. Minimality attack in privacy-preserving data publishing. VLDB, 2007. ** Xiao et al. Transparent anonymization: thwarting adversaries who know the algorithm. ACM TODS, 2010.

Work well with various privacy requirements

49

Anonymization algorithms – Effectiveness (Data Protection)

Protection from both identity and sensitive information disclosure (due to PS-rule based model)

Protection even when attacker knows all public items and the

workings of the anonymization algorithm (due to the algorithmic design) Minimality* and transparency attacks**

* Wong et al. Minimality attack in privacy-preserving data publishing. VLDB, 2007. ** Xiao et al. Transparent anonymization: thwarting adversaries who know the algorithm. ACM TODS, 2010.

50

Anonymization algorithms – Efficiency Checking all PS-rules for protection is computationally

expensive (a dataset scan for each rule) We can “prune” certain types of rules

(not check them for protection)

If cg is protected,

then so is cgj

Before anonymization If sup 𝐼,𝐷 ≥ 𝑘 and sup 𝐽,𝐷 ≤ 𝑐 × 𝑘, then IJ is protected in any generalized dataset that can be constructed from D During anonymization If IJ is protected, we can prune all the rules whose antecedent is I and consequent is a superset of J

… more strategies in *


51

Anonymization algorithms – Efficiency We can find which rules to check efficiently (helps pruning)

Rule-tree data structure

e.g., we check only rules that contain a certain item

52

Anonymization algorithms – Scalability

Worst-case time complexity: 𝑶 𝟐 𝑷 × 𝑺 × 𝑵 , |𝑷| small in practice

Worst-case space complexity: 𝑶 𝟐 𝑷 × 𝑺 + 𝑵 × |𝑰|

53

Anonymization algorithms – Applications / Case studies

* Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. Journal of American Medical Informatics Association, 2010. (AMIA Best paper award) ** Loukides et al. Anonymization of electronic medical records for validating genome-wide association studies. Proc. of the Nat. Acad. of Science (PNAS), 2010.

96.5%

60.0%

70.0%

80.0%

90.0%

100.0%

1 10 100 1000

% o

f re-

iden

tifie

d pa

tient

s

Distinguishability

GW

AS

-rel

ated

dis

ease

s

96.5% of patients are vulnerable [4]

Anonymized data that remain useful for GWAS**

Patient identities can be disclosed through diagnosis codes

Acknowledged as an important advance in the last 10 years of genomic research in a Nature

paper by E. Green, NHGRI Director

Algorithms for EMRs

54


* Loukides et al. Utility-Aware Anonymization of Diagnosis Codes, IEEE Transactions on Information Technology in Biomedicine, In press.

Anonymized data that support clinical case count studies

Low ARE and support for clinical case count studies

Data in the VUMC biobank can be anonymized and remain useful

VUMC Biobank (diagnosis, DNA) of 79K patients

Largest dataset in medical data privacy VUMC experts’ requirements

55

Conclusions

Need for sharing data that remains protected

Several methods to do so

Effective and efficient approaches for transaction data and some of their applications

but still long way to go Complex & heterogeneous data Integration with other data management operations Use in applications Software

56

Is preserving privacy easy?

Can we remove everything that “looks” sensitive!

A lawsuit was filed, Netflix settled the lawsuit “We will find new ways to collaborate with researchers”

57

We need better solutions!

Smarter cities initiative means doing better in



Sensitive information disclosure cannot be prevented by guarding against identity disclosure

** Loukides et al. Efficient and flexible transaction data anonymization. Knowledge and Inf. Systems, In press.

* **

* Loukides et al. Preventing range disclosure in k-anonymised data. Expert Systems with Applications, 2011.

Specialized methods to prevent this attack are required

59


* Loukides et al. On balancing disclosure risk and data utility in transaction data sharing using R-U confidentiality map. Joint UNECE/Eurostat work session on SDC, 2012.

Construct anonymizations with desired trade-off

Allow intuitive comparison between different methods

R-U map and knee point method to track utility/privacy trade-off*

60


* Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants’ Privacy. Journal of American Medical Informatics Association, 2010. (AMIA Best paper award) ** Loukides et al. Anonymization of electronic medical records for validating genome-wide association studies. Proc. of the Nat. Acad. of Science (PNAS), 2010.

96.5%

60.0%

70.0%

80.0%

90.0%

100.0%

1 10 100 1000

% o

f re-

iden

tifie

d pa

tient

s

Distinguishability

GW

AS

-rel

ated

dis

ease

s


Anonymized data that remain useful for GWAS**

Patient identities can be disclosed through diagnosis codes

Acknowledged as an important advance in the last 10 years of genomic research in a Nature

paper by E. Green, NHGRI Director

Algorithms for EMRs

61

Content


Background – data, attacks, methods, scenarios


Conclusions

62

Overview


De-identified data

Privacy model

Privacy component

Utility model

Anonymized data



component



63

Objectives

Privacy How to publish protected data? How to capture data publishers’ privacy requirements?

Data utility How to measure data utility? How to guarantee data utility?

Algorithmic design How to design effective, efficient, scalable algorithms?

Applications / Case studies How to address real anonymization problems?

64

Datasets – BMS1, BMS2 contain click-stream data and POS contains sales transaction data

Evaluation Effectiveness: ARE (query answering measure), protection

Efficiency, scalability

Methods Tree-based, Sample-based vs Baseline (no pruning), Apriori Anonymization*,

Greedy**


Anonymization algorithms –

65

Privacy-preserving data sharing – Models / algorithms

Generalization Suppression

Encryption ???

Methods

66

Privacy-preserving data sharing – Applications / Case studies

96.5%

60.0%

70.0%

80.0%

90.0%

100.0%

1 10 100 1000

% o

f re-

iden

tifie

d pa

tient

s

Distinguishability

GW

AS

-rel

ated

dis

ease

s


Anonymized data guaranteed to remain

useful for genetic studies [5]

“One of the important advances in the last 10 years of genomic research” – E. Green, director of the National Human Genome Research Institute

67

Privacy-preserving data sharing – Prototypes

68

Thanks & Acknowledgements

References

Aris Gkoulalas-Divanis (Ireland) Michail Vlachos (Zurich)

Bradley Malin Joshua Denny

Robert Gwadera

1. L. Sweeney, k-anonymity: a model for protecting privacy. Int.Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems, 2002. 2. Ponema Institute/Symantec corporation, 2010 Annual Study: US cost of a data breach. 3. Open Data White Paper – Unleashing the potential, 2012. http://data.gov.uk/sites/default/files/Open_data_White_Paper.pdf

Questions, comments, requests for materials [email protected] or

http://users.cs.cf.ac.uk/G.Loukides

69

http://data.gov.uk/sites/default/files/Open_data_White_Paper.pdf

mailto:[email protected]

http://users.cs.cf.ac.uk/G.Loukides

Privacy-preserving query answering

Data request

Privacy-protected result

Data repository Data users

Background - scenarios

Interactive

Strong privacy, but few applications and short data lifespan

Data publisher

Data Owners

Data Anonymized Data

Data users

Most popular, but several privacy and data utility challenges

Non-interactive

70


Membership disclosure Individuals’ presence in the published dataset is inferred

Discrimination (refuse employment, pay for insurance premium,…)

* Nergiz et al. Hiding the presence of individuals from shared databases. SIGMOD, 2007.

** Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLOS Genetics, 2008.

Crime, Financial, Medical data* Presence in crime, bankruptcy, HIV-positive databases

DNA mixture data** Not all participants suffer from genetic disorder Patients that suffer from genetic disorder can still be identified

dbGaP withdrew open access to GWAS study results

71

Mixture DNA

Identity and DNA

Efficient and Flexible Data Anonymization · Data Protection Act, EU legislation (95/46/EC,...

Documents

Transcript of Efficient and Flexible Data Anonymization · Data Protection Act, EU legislation (95/46/EC,...