Anonymizing Sequential Releases

38
Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University [email protected] Ke Wang Simon Fraser University [email protected]

description

Anonymizing Sequential Releases. Ke Wang Simon Fraser University [email protected]. Benjamin C. M. Fung Simon Fraser University [email protected]. ACM SIGKDD 2006. Motivation: Sequential Releases. Previous works address single release only. - PowerPoint PPT Presentation

Transcript of Anonymizing Sequential Releases

Anonymizing Sequential Releases

ACM SIGKDD 2006

Benjamin C. M. Fung

Simon Fraser University

[email protected]

Ke Wang

Simon Fraser University

[email protected]

2

Motivation: Sequential Releases

• Previous works address single release only.

• Data are typically released sequentially in multiple versions.– New information become available.– A tailored view for each data sharing purpose.– Separate releases for sensitive and identifying

information.

3

T2: Previous Release

Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4

The join on T1.Job = T2.Job

Pid Name Job Disease Class

1 Alice Banker Cancer c1

2 Alice Banker Cancer c1

3 Bob Clerk HIV c2

4 Bob Driver Cancer c3

5 Cathy Engineer HIV c4

- Alice Banker Cancer c1

- Alice Banker Cancer c1

Do not want Name to be linked to Disease in the join of the two releases.

4

T2: Previous Release

Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4

The join on T1.Job = T2.Job

Pid Name Job Disease Class

1 Alice Banker Cancer c1

2 Alice Banker Cancer c1

3 Bob Clerk HIV c2

4 Bob Driver Cancer c3

5 Cathy Engineer HIV c4

- Alice Banker Cancer c1

- Alice Banker Cancer c1

join sharpens identification:{Bob, HIV} has groups size 1.

5

T2: Previous Release

Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4

The join on T1.Job = T2.Job

Pid Name Job Disease Class

1 Alice Banker Cancer c1

2 Alice Banker Cancer c1

3 Bob Clerk HIV c2

4 Bob Driver Cancer c3

5 Cathy Engineer HIV c4

- Alice Banker Cancer c1

- Alice Banker Cancer c1

join weakens identification:{Alice, Cancer} has groups size 4.

lossy join: combat join attack.

6

T2: Previous Release

Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4

The join on T1.Job = T2.Job

Pid Name Job Disease Class

1 Alice Banker Cancer c1

2 Alice Banker Cancer c1

3 Bob Clerk HIV c2

4 Bob Driver Cancer c3

5 Cathy Engineer HIV c4

- Alice Banker Cancer c1

- Alice Banker Cancer c1

join enables inferences across tables:AliceCancer has 100% confidence.

7

Related Work• k-anonymity [SS98, FWY05, BA05, LDR05, WYC04, WLFW06]

– Quasi-identifier (QID): e.g., {Job, birth date, Zip}.

– The database is made anonymous to its local QID.

– In sequential releases, the database must be made anonymous to a global QID spanning the join of all releases thus far.

Explicit ID (removed)

QID (anonymized to groups of size ≥ k)

Sensitive attributes

8

Related Work

• l-diversity [MGK06]

– Sensitive values are “well-represented” in each QID group (measured by entropy).

• Confidence limiting [WFY05, WFY06]:

qid s, confidence < h

where qid is a QID group, s is a sensitive value.

9

Related Work

• View releases– T1 and T2 are two views in one release, both

can be modified before the release.– [MW04, DP05] measures information

disclosure of a view set wrt a secret view.– [YWJ05, KG06] detects privacy violation by a

view set over a base table.– Detect, not eliminate, violations.

10

Sequential Release• Sequential release:

– Current release T1. Previous release T2.– T1 was unknown when T2 was released.– T2 cannot be modified when T1 is released.

• Solution #1: k-anonymize all attributes in T1 - excessive distortion.

• Solution #2: generalize T1 based on T2 - monotonically distort the later release.

• Solution #3: anonymize a complete cohort of all potential releases at one time – must predict all future releases

11

Intuition of Our Approach

• A lossy join hides the true join relationship to cripple a global QID.

• Generalize T1 so that the join with T2 becomes lossy enough to disorient the attacker.

• Two general privacy notions: (X,Y)-anonymity and (X,Y)-linkability, where X and Y are sets of attributes.

12

(X,Y)-Privacy• k-anonymity: # of distinct records for each

QID group ≥ k.

• (X,Y)-anonymity: # of distinct Y values for each X group ≥ k.

• (X,Y)-linkability: the maximum confidence of having a Y value given having a X value is ≤ k.

• Generalize k-anonymity [SS98] and confidence limiting [WFY05, WFY06].

13

Example: (X,Y)-AnonymityPid Job Zip PoB Test

1 Banker 123 Canada HIV

1 Banker 123 Canada Diabetes

1 Banker 123 Canada Eye

2 Clerk 456 Japan HIV

2 Clerk 456 Japan Diabetes

2 Clerk 456 Japan Eye

2 Clerk 456 Japan Heart• k-anonymity uses # of records as

anonymity, fails to ensure k distinct patients.

14

Example: (X,Y)-Anonymity• Anonymity wrt patients (instead of

records): – X = {Job, Zip, PoB} and Y = Pid– Each X group is linked to at least k distinct

values on Pid.

• Anonymity wrt tests:– X = {Job, Zip, PoB} and Y = Test– Each X group is linked to at least k distinct

tests.

15

Example: (X,Y)-LinkabilityPid Job Zip PoB Test

1 Banker 123 Canada HIV

2 Banker 123 Canada HIV

3 Banker 123 Canada HIV

4 Banker 123 Canada Diabetes

5 Clerk 456 Japan Diabetes

6 Clerk 456 Japan Diabetes

• {Banker,123,Canada} HIV (75% confidence).• With Y = Test, (X,Y)-linkability states that no

test can be inferred from a X group with confidence > a given threshold.

16

Problem Statement

• The data holder made previous release T2 and now makes current release T1, where T2 and T1 are projections of the same underlying table.

• Want to ensure (X,Y)-privacy on the join of T1 and T2, where X and Y are attribute sets on the join.

• Sequential anonymization: generalize T1 on X ∩ att(T1) so that the join satisfies (X,Y)-privacy and T1 remains as useful as possible.

17

Generalization / Specialization

• Each generalization replaces all child values with the parent value. – A cut contains exactly one

value on every root-to-leaf

path.

• Alternatively, each specialization replaces the parent value with a consistent child value in the record.

JobANY

Professional Admin

Engineer Lawyer Banker Clerk

18

Match Function

• The attacker applies prior knowledge to match the records in T1 and T2.

• So, the data holder applies such prior knowledge in sequential anonymization

• We consider prior knowledge:

– schema information of T1 and T2.

– taxonomies for attributes.

– the inclusion-exclusion principle.

19

Match Function• Let t1 T1 and t2 T2.• Inclusion Predicate: t1.A matches t2.A if

they are on the same generalization path for attribute A.– e.g., Male matches Single Male.

• Exclusion Predicate: t1.A matches t2.B only if they are not semantically inconsistent (based on common sense).– To exclude impossible matches.– e.g., Male and Pregnant are semantically

inconsistent, so are Married Male and 6 Month Pregnant.

20

Algorithm Overview

Top-Down SpecializationInput: T1, T2, (X,Y)-privacy, a taxonomy tree for each attribute

in X1=X ∩ att(T1). Output: a generalized T1 satisfying the privacy requirement.

1. generalize every value of Aj to ANYj where Aj X1;

2. while there is a valid candidate in ỤCutj do

3. find the winner w of highest Score(w) from ỤCutj;

4. specialize w on T1 and remove w from ỤCutj;

5. update Score(v) and the valid status for all v in ỤCutj;6. end while

7. output the generalized T1 and ỤCutj;

21

Anti-Monotone Privacy

• Theorem 1: On a single table, (X,Y)-privacy is anti-monotone wrt specialization on X: if violated, remains violated after a specialization.

• On the join of T1 and T2, (X,Y)-privacy is not anti-monotone wrt specialization of T1.– Specializing T1 may create dangling records,

e.g., by specializing “CA” into “LA” and “San Francisco”, “LA” records in T1 no longer match “San Francisco” records in T2.

22

Anti-Monotone Privacy

• Theorem 2: Assume that T1 and T2 are projections of the same underlying table, (X,Y)-privacy on the join of T1 and T2 is anti-monotone wrt specialization of T1 on X ∩ att(T1).

23

Score Metric

• Each specialization gains some information and loses some privacy. We maximize gain per loss

• InfoGain(v) is measured on T1.• PrivLoss(v) is measured on the join of T1 and T2.

24

ChallengesEach specialization affects the matching of

join, Score(v), and privacy checking.• rejoining T1 and T2 for each specialization is

too expensive. • Materializing the join is impractical because

a lossy join can be very large.

Our solution: Incrementally maintains some count statistics without executing the join

– extension of Top-Down Specialization [FWY05][WFY05]

25

Empirical Study

• The Adult data set. 45222 records. Categorical attributes only.

Department Attribute # of

Leaves

# of

Levels

Taxation

(T1)

Education (E) 16 5

Occupation (O) 14 3

Work-class (W) 8 5

Common

(T1 & T2)

Marital-status (M) 7 4

Relationship (Ra) 6 3

Sex (S) 2 2

Immigration

(T2)

Native-country (Nc) 40 5

Race (Ra) 5 3

Schema for T1 and T2

• T1 contains the Class Income level

27

Empirical Study• Classification metric

– Classification error on the generalized testing set of T1.

• Distortion metric [SS98]– 1 unit of distortion for generalization of each

value in each record.– Normalized by the number of records.

28

(X,Y)-Anonymity• TopN attributes: most important for classification.• Join attributes are Top3 attributes.• X contains

– TopN attributes in T1 (to ensure that the generalization is performed on important attributes),

– all join attributes,– all attributes in T2 (to ensure X is global).

29

Distortion of (X,Y)-anonymity• Ki denotes the key in Ti.• XYD: our method with Y = K1.• KAD: k-anonymization on QID=att(T1).

30

Classification error of (X,Y)-anonymity• XYE: our method with Y = K1.• XYE(row): our method with Y={K1,K2}.• BLE: the unmodified data.• KAE: k-anonymization on QID=att(T1).• RJE: removing all join attributes from T1.

31

(X,Y)-Linkability• Y contains TopN attributes.

– If not important, simply remove them.

• X contains the rest of the attributes in T1 and T2.

• Focus on classification error because no previous work studies distortion for (X,Y)-linkability.

32

Classification error of (X,Y)-linkability• XYE: our method with Y = TopN.• BLE: the unmodified data.• RJE: removing all join attributes from T1.• RSE: removing all attributes in Y from T1.

33

(X,Y)-anonymity (k=40) (X,Y)-linkability (k=90%)

Scalability

34

Conclusion• Previous k-anonymization focused on a single

release of data.• Studied the sequential anonymization problem

when data are released sequentially and a global QID may span several releases.

• Introduced lossy join to hide the join relationship and weaken the global QID.

• Addressed challenges due to large size of lossy join.

• Extendable to more than two releases T2,…,Tp.

35

References[BA05] R. Bayardo and R. Agrawal. Data privacy

through optimal k-anonymization. In IEEE ICDE, pages 217.228, 2005.

[DP05] A. Deutsch and Y. Papakonstantinou. Privacy in database publishing. In ICDT, 2005.

[FWY05] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE ICDE, pages 205.216, April 2005.

[KG06] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In ACM SIGMOD, Chicago, IL, June 2006.

36

References[LDR05] K. LeFevre, D. J. DeWitt, and R.

Ramakrishnan. Incognito: Efcient full-domain k-anonymity. In ACM SIGMOD, 2005.

[MGK06] A. Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: Privacy beyond k-anonymity. In IEEE ICDE, 2006.

[MW04] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, 2004.

[SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In IEEE Symposium on Research in Security and Privacy, May 1998.

37

References[WFY05] K. Wang, B. C. M. Fung, and P. S. Yu.

Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466.473, November 2005.

[WFY06] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's condence: An alternative to k-anonymization. Knowledge and Information Systems: An International Journal, 2006.

[WYC04] K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE ICDM, November 2004.

38

References[WLFW06] R. C. W. Wong, J. Li., A. W. C. Fu, and

K. Wang. (,k)-anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In ACM SIGKDD, 2006.

[YWJ05] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. In VLDB, 2005.