Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital...

44
Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign

Transcript of Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital...

Page 1: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Differential Privacy: Is it the dawn of data science tomorrow?

Dr. Zhenjie Zhang,Advanced Digital Sciences Center

University of Illinois at Urbana Champaign

Page 2: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

The Force of Big Data is Huge

• Health Care– Disease Study

• Internet-based Economy– E-Commerce– Online Advertising

Page 3: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

The Dark Side: Data Privacy

• Personal Sensitive Information– Medical Prescription -> Disease– Movie Rent History -> Sexual

Orientation– Trajectories -> Home

Page 4: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Privacy incident: the MGIC case• Time: mid-1990s• Publisher: Massachusetts Group Insurance Commission (MGIC) • Data released: “anonymized” medical records• Result: A PhD student at MIT was able to identify the medical record of the

governor of Massachusetts

Birth Date Gender ZIP Disease

1960/01/01 F 10000 flu

1965/02/02 M 20000 dyspepsia

1970/03/03 F 30000 pneumonia

1975/04/04 M 40000 gastritis

Medical Records

Name Birth Date Gender ZIP

Alice 1960/01/01 F 10000

Bob 1965/02/02 M 20000

Cathy 1970/03/03 F 30000

David 1975/04/04 M 40000

Voter Registration List

match

Page 5: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Birth Data Gender 10000 Disease

1960/01/01 F 10000 flu

1965/02/02 M 20000 dyspepsia

1970/03/03 F 30000 pneumonia

1975/04/04 M 40000 gastritis

Name Birth Date Gender ZIP

Alice 1960/01/01 F 10000

Bob 1965/02/02 M 20000

Cathy 1970/03/03 F 30000

David 1975/04/04 M 40000

Privacy incident: the MGIC case• Research [Golle 06] shows that 63% of Americans can

be uniquely identified by {date of birth, gender, zip code}

Medical RecordsVoter Registration List

match

Page 6: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Birth Data Gender 10000 Disease

1960/01/01 F 10000 flu

1965/02/02 M 20000 dyspepsia

1970/03/03 F 30000 pneumonia

1975/04/04 M 40000 gastritis

Name Birth Date Gender ZIP

Alice 1960/01/01 F 10000

Bob 1965/02/02 M 20000

Cathy 1970/03/03 F 30000

David 1975/04/04 M 40000

Lesson Learned• What went wrong?• Intuition: Although the identifiers are removed from

the data, some quasi-identifiers remain• Can we solve the problem by removing quasi-

identifiers?

Medical RecordsVoter Registration List

Unfortunately, no.

Page 7: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Privacy incident: the AOL case• In 2006, AOL released an “anonymized” log of their

search engine to support research• Example of the log:

• Each user only has an ID, i.e., no identifier or quasi-identifier is released

• However, the New York Time was able to identify a user from the log

User ID Query Date/Time …

4417749 “Data privacy workshop location” … …

4417749 “COE price” … …

4417749 “Jurong Point opening hours” … …

Page 8: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Privacy incident: the AOL case• What the New York Time did:

– Find all log entries for AOL user 4417749– Multiple queries for businesses and services in

Lilburn, GA (population 11K)– Several queries for Jarrett Arnold– Lilburn has 14 people with the last name

Arnold– NYT contacts them, finds out AOL User

4417749 is Thelma Arnold• The CTO of AOL resigned after the incident

Page 9: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Lesson Learned• What went wrong?• Intuition: Although all identifiers and quasi-

identifiers are removed, the users’ behavior traces (i.e., their search keywords) reveal their identities

• The same problem occurred in another incident in 2006

Page 10: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Privacy incident: the Netflix case• In 2006, the Netflix movie rental service released some

movie ratings made by its users, for a competition with a 1M USD prize

• Example of data:

• Each user only has an ID, i.e., no identifier or quasi-identifier is released

• However, two researchers from U. Texas were able to link some users to some online identities

User ID Movie Rating Date

123 Scary Movie 1 5 2006.07.01

123 Scary Movie 2 4 2006.07.08

123 Scary Movie 3 4 2006.07.15

Page 11: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Privacy incident: the Netflix case

• What the researchers did:– Go to a movie review site IMDB, and get the ratings made by

the IMDB users, as well as the dates– Match the an IMDB user to a Netflix user, if both users give the

same ratings to the same movies on similar dates

User ID Movie Rating Date

123 Scary Movie 1 5 2006.07.01

123 Scary Movie 2 4 2006.07.08

123 Scary Movie 3 4 2006.07.16

IMDB ID Movie Rating Date

456 Scary Movie 1 5 2006.07.01

456 Scary Movie 2 4 2006.07.09

456 Scary Movie 3 4 2006.07.15

Page 12: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Privacy incident: the Netflix case

• In general, 99% of users can be identified with 8 ratings + dates

• Result: Netflix was sued; case settled out of court

User ID Movie Rating Date

123 Scary Movie 1 5 2006.07.01

123 Scary Movie 2 4 2006.07.08

123 Scary Movie 3 4 2006.07.16

IMDB ID Movie Rating Date

456 Scary Movie 1 5 2006.07.01

456 Scary Movie 2 4 2006.07.09

456 Scary Movie 3 4 2006.07.15

Page 13: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Lessons learned• Now we know that it is risky to publish

detailed records of individual data, since– quasi-identifiers may reveal identities– behavior information may reveal identities, too

• What if we don’t release detailed records, but only aggregate information?

• Answer: it could still fail to protect privacy

Page 14: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Agenda

• Basics of Differential Privacy• Optimization and Use Cases• Limitations of Differential Privacy

Page 15: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Is It a Privacy Leakage?

Yes and No

Page 16: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Why Yes?

• Every individual contributes to the average height

Average Height = (Total Height of Others + My Height) / Singapore Population

Page 17: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Why No?• Nobody can infer your height from the

statistics itself• Let us consider a special case

– Average Height: 50 cm– Bob is 40 cm, and he happens to know Stuart is 50

cm height

Page 18: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

What does the story tell?• Background knowledge of adversary

– The adversaries are not innocent!• The impact of individual record matters

– Including Kevin, the average height is 50 cm– Excluding Kevin, the average height is 45 cm

Page 19: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Differential Privacy for Nutshell• Question-Answering Interface between data

and human– No direct access to the database– The answer is (almost) the same, regardless of the

existence of any individual record in the database– Even if the adversary knows everybody in the

database except Kevin, he cannot infer any information of Kevin by looking at the results

Page 20: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

How to enforce Differential Privacy?

Average Height?

Calculation

Exact Answer: 50 cm Sensitivity: 17 cm

What’s the maximal impact of individual record?

Random Answer: 47 cm Noise: -3 cm

Random Num Generation

Noise Injection

Privacy Budget: 0.5

Page 21: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Privacy Budget: Tradeoff Between Utility and Privacy

• Privacy Budget is a positive real number– How much privacy you want to trade for the

accuracy of the result?

Sensitivity: 17 cm

What’s the maximal impact of individual record?

Noise: -3 cm

Privacy Budget: 0.5

Noise Scale: 10

Page 22: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Random Number Generation• Dwork et al. [2003] Laplace Mechanism

– Given the sensitive value, generate the noise from a Laplace distribution

Scale=Sensitivity/Budget

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Page 23: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Mathematical Interpretation of Laplace Mechanism

• Neighbor Databases

Name Height

Bob 40

Kevin 60

Stuart 50

Name Height

Kevin 60

Stuart 50

Name HeightBob 40

Kevin 60

Name Height

Bob 40

Stuart 50

Page 24: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Mathematical Interpretation of Laplace Mechanism

• Privacy Guarantee– Query Q on database D– Any neighbor database D’– Any possible answer R– Privacy budget epsilon– Under Laplace mechanism, we have

Page 25: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Utility-Privacy Tradeoff of Laplace Mechanism

• Smaller privacy budget epsilon– The system operates more similarly on neighbor

databases– Higher noise with large scale, reversely

proportional privacy budget

Page 26: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Differential Privacy is Like a Reservoir

• Each query consumes privacy budget• When privacy budget is used up, the database

cannot be queried any more

Privacy Budget

Average Height?

Number of Eyes?

Average Weight?

47 cm

Privacy BudgetPrivacy Budget

5

Page 27: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Why is Differential Privacy Popular?

• Almost the strongest background knowledge assumption

• Nice composition property• Tradeoff between privacy and utility• High Efficiency

– Sensitivity: Pre-calculated– Epsilon: specified by the user– Noise generation: constant time

Page 28: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Agenda

• Basics of Differential Privacy• Optimization and Use Cases

– Histogram Publication– Counting Publication– Data Synthesis

• Limitations of Differential Privacy

Page 29: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Do we need optimization?• Laplace mechanism is universally applicable,

but– The sensitivity is sometimes too high, e.g. median

Value

R1 0

R2 0

R3 100

R4 100

Value

R1 0

R2 0

R3 100

Median=50 Median=0

Page 30: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Do we need optimization?

• Laplace mechanism is universally applicable, but– Budget consumption is fast under multiple queries

Query Privacy budget

Alice What’s the average height? 0.5

Bob What’s the average height? 0.5

Chris What’s the average height? 0.5

Page 31: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Overview of General Tricks• Decompose privacy budget over steps

– Histogram publication• Transform original query into new queries

with smaller sensitivity– Learning tasks, e.g. classification– Group counting queries

• Data Transformation/Compression– Data Synthesis

Page 32: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Query Decomposition Trick

• Recall the composition property

Query 1

Query 2

Page 33: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Query Decomposition: Histogram Publication

• Histogram publication is widely used in data analysis, to support all sorts of statistic queries

• Once a histogram is constructed, we can answer max, min, median and range count, without additional budget consumption

Name

Age HIV+

Frank 42 Y

Bob 31 Y

Mary 28 Y

Dave 43 N

… … …

Page 34: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Query Decomposition: Histogram Publication

• Xu et al. [12] propose a two-step solution– Find the structure of the histogram– Add Laplace noise into the bins

Page 35: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Query Transformation: Group Counting

Q1= xNY+xNJ+xCA+xWA

Q2= xNY+xNJ

Q3= xWA

xNY

xNJ

xCA

xWA

Q1

Q2

Q3

¿

Workload Matrix:

Data: Answer

1 1 1 1

1 1 0 0

0 0 0 1

Sensitivity=2

Page 36: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Query Transformation:Group Counting

• Find an approximate decomposition on the workload matrix, to reduce sensitivity

Workload Matrix: W1 1 1 1

1 1 0 0

0 0 0 1¿

1 1 1

1 0 0

0 0 1

1 1 0 0

0 0 1 0

0 0 0 1

New Workload: W’

Full Rank Strategy Matrix: A

Sensitivity=1

Page 37: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Query Transformation:Group Counting

• Generate results by adding noises on the product of strategy matrix and data vector

1 1 0 0

0 0 1 0

0 0 0 1

xNY

xNJ

xCA

xWA

×

Sensitivity=1

+ Laplace Noise

Smaller Scale

Page 38: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Data TransformationData Synthesis

• Compress -> Noise Insertion -> Coefficient Cutting -> Decompress

2D-Wavelet

Compressive Sensing

Page 39: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Agenda

• Basics of Differential Privacy• Optimization and Use Cases• Limitations of Differential Privacy

Page 40: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Data Collection and Independence Assumption

• Implicit Assumption– All individuals are irrelevant to each other– Counter Example: HIV, genetic disease

Name

Age HIV+

Frank 42 Y

Bob 31 Y

Mary 28 Y

Dave 43 N

… … …

Page 41: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

High Sensitivity Computation

• Non-Convex Optimization– E.g. Deep Learning

Page 42: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Operations on Database

• It is difficult to update a database– Can we query the database again, after certain

attributes are updated?– The general answer is no

Page 43: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Conclusion

• Differential Privacy is the most robust privacy model known so far

• Differential privacy is practical on certain application domains, i.e., histogram, counting.

• The applicability of differential privacy remains limited

• It is actually too difficult to understand, even for computer scientists!

Page 44: Differential Privacy: Is it the dawn of data science tomorrow? Dr. Zhenjie Zhang, Advanced Digital Sciences Center University of Illinois at Urbana Champaign.

Q&A