O'Reilly Webcast: Anonymizing Health Data

147
Anonymizing Health Data Webcast Case Studies and Methods to Get You Started Khaled El Emam & Luk Arbuckle

description

Authors: Khaled El Emam, Luk Arbuckle How can health data be released to analysts and app developers who desperately want it? Under current legislation, the use and disclosure of health data for secondary purposes is limited—patients must either consent to have their data used, which is often difficult to get and can lead to bias, or the data needs to be de-identified (there are some exceptions, but we won't address them in this webinar.) To ensure that end users get data that is anonymized and highly useful, we focus on the HIPAA Privacy Rule De-identification Standard. We've built our risk-based methodology for anonymizing data around the foundation created by HIPAA's Statistical Method. In this webcast we'll share several of the case studies that we've described in our O'Reilly book Anonymizing Health Data, which is devoted to examples of how we anonymized real-world data sets. In almost every case in which we've anonymized data, there have been new and interesting challenges to overcome.

Transcript of O'Reilly Webcast: Anonymizing Health Data

Page 1: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health DataWebcast

Case Studies and Methods to Get You Started

Khaled El Emam & Luk Arbuckle

Page 2: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Part 1 of Webcast: Intro and Methodology

Part 2 of Webcast: A Look at Our Case Studies

Part 3 of Webcast: Questions and Answers

Khaled El Emam & Luk Arbuckle

Page 3: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Part 1 of Webcast: Intro and Methodology

Khaled El Emam & Luk Arbuckle

Page 4: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

To Anonymize or not to Anonymize

Khaled El Emam & Luk Arbuckle

Page 5: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Consent needs to be informed.

To Anonymize or not to Anonymize

Khaled El Emam & Luk Arbuckle

Page 6: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Consent needs to be informed.

Not all health care providers are willing to share their patient’s PHI.

To Anonymize or not to Anonymize

Khaled El Emam & Luk Arbuckle

Page 7: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Consent needs to be informed.

Not all health care providers are willing to share their patient’s PHI.

Anonymization allows for the sharing of health information.

To Anonymize or not to Anonymize

Khaled El Emam & Luk Arbuckle

Page 8: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Consent needs to be informed.

Not all health care providers are willing to share their patient’s PHI.

Anonymization allows for the sharing of health information.

To Anonymize or not to Anonymize

Compelling financial case. Breach cost ~$200 per patient.

Khaled El Emam & Luk Arbuckle

Page 9: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Consent needs to be informed.

Not all health care providers are willing to share their patient’s PHI.

Anonymization allows for the sharing of health information.

To Anonymize or not to Anonymize

Compelling financial case. Breach cost ~$200 per patient.

Khaled El Emam & Luk Arbuckle

Page 10: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Consent needs to be informed.

Not all health care providers are willing to share their patient’s PHI.

Anonymization allows for the sharing of health information.

To Anonymize or not to Anonymize

Privacy protective behaviors by patients.

Compelling financial case. Breach cost ~$200 per patient.

Khaled El Emam & Luk Arbuckle

Page 11: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Masking Standards

Khaled El Emam & Luk Arbuckle

Page 12: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Masking Standards

First name, last name, SSN.

Khaled El Emam & Luk Arbuckle

Page 13: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Masking Standards

Distortion of data—no analytics.

First name, last name, SSN.

Khaled El Emam & Luk Arbuckle

Page 14: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Masking Standards

Creating pseudonyms.

First name, last name, SSN.

Distortion of data—no analytics.

Khaled El Emam & Luk Arbuckle

Page 15: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Masking Standards

Removing a whole field.

Creating pseudonyms.

First name, last name, SSN.

Distortion of data—no analytics.

Khaled El Emam & Luk Arbuckle

Page 16: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Masking Standards

Removing a whole field.

Creating pseudonyms.

Replacing actual values with random ones.

First name, last name, SSN.

Distortion of data—no analytics.

Khaled El Emam & Luk Arbuckle

Page 17: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identification Standards

Khaled El Emam & Luk Arbuckle

Page 18: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identification Standards

Age, sex, race, address, income.

Khaled El Emam & Luk Arbuckle

Page 19: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Minimal distortion of data—for analytics.

Age, sex, race, address, income.

De-identification Standards

Khaled El Emam & Luk Arbuckle

Page 20: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Minimal distortion of data—for analytics.

Age, sex, race, address, income.

De-identification Standards

Safe Harbor in HIPAA Privacy Rule.

Khaled El Emam & Luk Arbuckle

Page 21: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

What’s “Actual Knowledge”?

Privacy Rule

Safe Harbor

Khaled El Emam & Luk Arbuckle

Page 22: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

What’s “Actual Knowledge”?

Info, alone or in combo, that could identify an individual.

Khaled El Emam & Luk Arbuckle

Page 23: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

What’s “Actual Knowledge”?

Info, alone or in combo, that could identify an individual.

Has to be specific to the data set—not theoretical.

Khaled El Emam & Luk Arbuckle

Page 24: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

What’s “Actual Knowledge”?

Info, alone or in combo, that could identify an individual.

Has to be specific to the data set—not theoretical.

Occupation Mayor of Gotham.

Khaled El Emam & Luk Arbuckle

Page 25: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Heuristics, or rules of thumb.

Minimal distortion of data—for analytics.

Age, sex, race, address, income.

Safe Harbor in HIPAA Privacy Rule.

De-identification Standards

Khaled El Emam & Luk Arbuckle

Page 26: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Heuristics, or rules of thumb.

Statistical method in HIPAA Privacy Rule.

Minimal distortion of data—for analytics.

Age, sex, race, address, income.

Safe Harbor in HIPAA Privacy Rule.

De-identification Standards

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
A risk-based methodology is consistent with contemporary standards from regulators and governments, and is the approach we present in our book.
Page 27: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identification Myths

Khaled El Emam & Luk Arbuckle

Page 28: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identification Myths

Myth: It’s possible to re-identify most, if not all, data.

Khaled El Emam & Luk Arbuckle

Page 29: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identification Myths

Myth: It’s possible to re-identify most, if not all, data.

Using robust methods, evidence suggests risk can be very small.

Khaled El Emam & Luk Arbuckle

Page 30: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identification Myths

Myth: It’s possible to re-identify most, if not all, data.

Myth: Genomic sequences are not identifiable, or are easy to re-identify.

Using robust methods, evidence suggests risk can be very small.

Khaled El Emam & Luk Arbuckle

Page 31: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identification Myths

Myth: It’s possible to re-identify most, if not all, data.

Myth: Genomic sequences are not identifiable, or are easy to re-identify.

In some cases can re-identify, difficult to de-identify using our methods.

Using robust methods, evidence suggests risk can be very small.

Khaled El Emam & Luk Arbuckle

Page 32: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

A Risk-based De-identification Methodology

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
This is where things get heavy. We’ll start with some basic principles.
Page 33: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

A Risk-based De-identification Methodology

The risk of re-identification can be quantified.

Khaled El Emam & Luk Arbuckle

Page 34: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

A Risk-based De-identification Methodology

The risk of re-identification can be quantified.

The Goldilocks principle: balancing privacy with data utility.

Khaled El Emam & Luk Arbuckle

Page 35: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
The Goldilocks Principle: the trade-off between perfect data and perfect privacy.
Page 36: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

A Risk-based De-identification Methodology

The risk of re-identification can be quantified.

The Goldilocks principle: balancing privacy with data utility.

The re-identification risk needs to be very small.

Khaled El Emam & Luk Arbuckle

Page 37: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

A Risk-based De-identification Methodology

The risk of re-identification can be quantified.

The Goldilocks principle: balancing privacy with data utility.

De-identification involves a mix of technical, contractual, and other measures.

The re-identification risk needs to be very small.

Khaled El Emam & Luk Arbuckle

Page 38: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Steps in the De-identification Methodology

Step 1: Select Direct and Indirect Identifiers

Step 2: Setting the Threshold

Step 3: Examining Plausible Attacks

Step 4: De-identifying the Data

Step 5: Documenting the Process

Khaled El Emam & Luk Arbuckle

Page 39: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 1: Select Direct and Indirect Identifiers

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
We use masking for direct identifiers, and de-identification for indirect identifiers.
Page 40: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Direct identifiers: name, telephone number, health insurance card number, medical record number.

Step 1: Select Direct and Indirect Identifiers

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
Masking
Page 41: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Direct identifiers: name, telephone number, health insurance card number, medical record number.

Indirect identifiers, or quasi-identifiers: sex, date of birth, ethnicity, locations, event dates, medical codes.

Step 1: Select Direct and Indirect Identifiers

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
De-identification
Page 42: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 2: Setting the Threshold

Khaled El Emam & Luk Arbuckle

Page 43: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Maximum acceptable risk for sharing data.

Step 2: Setting the Threshold

Khaled El Emam & Luk Arbuckle

Page 44: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Maximum acceptable risk for sharing data.

Needs to be quantitative and defensible.

Step 2: Setting the Threshold

Khaled El Emam & Luk Arbuckle

Page 45: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Maximum acceptable risk for sharing data.

Needs to be quantitative and defensible.

Is the data in going to be in the public domain?

Step 2: Setting the Threshold

Khaled El Emam & Luk Arbuckle

Page 46: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Maximum acceptable risk for sharing data.

Needs to be quantitative and defensible.

Is the data in going to be in the public domain?

Extent of invasion-of-privacy when data was shared?

Step 2: Setting the Threshold

Khaled El Emam & Luk Arbuckle

Page 47: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 3: Examining Plausible Attacks

Khaled El Emam & Luk Arbuckle

Page 48: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Recipient deliberately attempts to re-identify the data.

Step 3: Examining Plausible Attacks

Khaled El Emam & Luk Arbuckle

Page 49: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Recipient deliberately attempts to re-identify the data.

Recipient inadvertently re-identifies the data.“Holly Smokes, I know her!”

Step 3: Examining Plausible Attacks

Khaled El Emam & Luk Arbuckle

Page 50: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Recipient deliberately attempts to re-identify the data.

Recipient inadvertently re-identifies the data.

Data breach at recipient’s site, “data gone wild”.

Step 3: Examining Plausible Attacks

Khaled El Emam & Luk Arbuckle

Page 51: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Recipient deliberately attempts to re-identify the data.

Data breach at recipient’s site, “data gone wild”.

Adversary launches a demonstration attack on the data.

Step 3: Examining Plausible Attacks

Khaled El Emam & Luk Arbuckle

Recipient inadvertently re-identifies the data.

Presenter
Presentation Notes
Yahoo!
Page 52: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 4: De-identifying the Data

Khaled El Emam & Luk Arbuckle

Page 53: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 4: De-identifying the Data

Generalization: reducing the precision of a field.Dates converted to month/year, or year.

Khaled El Emam & Luk Arbuckle

Page 54: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 4: De-identifying the Data

Generalization: reducing the precision of a field.

Suppression: replacing a cell with NULL.Unique 55-year old female in birth registry.

Khaled El Emam & Luk Arbuckle

Page 55: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 4: De-identifying the Data

Generalization: reducing the precision of a field.

Suppression: replacing a cell with NULL.

Sub-sampling: releasing a simple random sample.50% of data set instead of all data.

Khaled El Emam & Luk Arbuckle

Page 56: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 5: Documenting the Process

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
From a regulatory perspective, it’s important to document the process that was used to de-identify the data set, as well as the results of enacting that process.
Page 57: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 5: Documenting the Process

Process documentation—a methodology text.

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
From a regulatory perspective, it’s important to document the process that was used to de-identify the data set, as well as the results of enacting that process.
Page 58: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Step 5: Documenting the Process

Results documentation—data set, risk thresholds, assumptions, evidence of low risk.

Khaled El Emam & Luk Arbuckle

Process documentation—a methodology text.

Presenter
Presentation Notes
From a regulatory perspective, it’s important to document the process that was used to de-identify the data set, as well as the results of enacting that process.
Page 59: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

Page 60: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Pr(re-id, attempt) = Pr(attempt) × Pr(re-id | attempt)

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
The probability of an attack will depend on the controls in place to manage the data (mitigating controls).
Page 61: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)Pr(re-id, acquaintance) = Pr(acquaintance) × Pr(re-id | acquaintance)

Presenter
Presentation Notes
On average people tend to have 150 friends. This is called the Dunbar number.
Page 62: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)

T3: Data Breach (“data gone wild”)Pr(re-id, breach) = Pr(breach) × Pr(re-id | breach)

Presenter
Presentation Notes
Based on recent credible evidence, we know that approximately 27% of providers that are supposed to follow the HIPAA Security Rule have a reportable breach every year.
Page 63: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)

T3: Data Breach (“data gone wild”)

T4: Public Data (demonstration attack)Pr(re-id), based on data set only

Presenter
Presentation Notes
We assume that there is an adversary who has background information that can be used to launch an attack.
Page 64: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Choosing Thresholds

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
So we can measure risk under plausible attacks, but how to we set an overall risk threshold?
Page 65: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Choosing Thresholds

Khaled El Emam & Luk Arbuckle

Many precedents going back multiple decades.

Page 66: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Choosing Thresholds

Khaled El Emam & Luk Arbuckle

Many precedents going back multiple decades.Recommended by regulators.

Page 67: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Choosing Thresholds

Khaled El Emam & Luk Arbuckle

Many precedents going back multiple decades.Recommended by regulators.All based on max risk though.

Presenter
Presentation Notes
Max risk is based on the record that has the highest probability of re-identification; average risk when the adversary is trying to re-identify someone they know or all everyone in data set.
Page 68: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Choosing Thresholds

Khaled El Emam & Luk Arbuckle

Many precedents going back multiple decades.Recommended by regulators.All based on max risk though.

Presenter
Presentation Notes
To set the threshold, we can look at the sensitivity of the data and the consent mechanism that was in place (invasion of privacy).
Page 69: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Part 2 of Webcast: A Look at Our Case Studies

Khaled El Emam & Luk Arbuckle

Page 70: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Cross Sectional Data: Research Registries

Khaled El Emam & Luk Arbuckle

Page 71: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Cross Sectional Data: Research Registries

Khaled El Emam & Luk Arbuckle

Better Outcomes Registry & Network (BORN)of Ontario

Page 72: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Cross Sectional Data: Research Registries

Khaled El Emam & Luk Arbuckle

Better Outcomes Registry & Network (BORN)of Ontario

140,000 births per year.

Page 73: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Cross Sectional Data: Research Registries

Khaled El Emam & Luk Arbuckle

Better Outcomes Registry & Network (BORN)of Ontario

140,000 births per year.

Cross-sectional—mothers not traced over time.

Page 74: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Cross Sectional Data: Research Registries

Khaled El Emam & Luk Arbuckle

Better Outcomes Registry & Network (BORN)of Ontario

140,000 births per year.

Cross-sectional—mothers not traced over time.

Process of getting de-identified data from a research registry.

Page 75: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Cross Sectional Data: Research Registries

Khaled El Emam & Luk Arbuckle

Better Outcomes Registry & Network (BORN)of Ontario

140,000 births per year.

Cross-sectional—mothers not traced over time.

Process of getting de-identified data from a research registry.

Page 76: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Researcher Ronnie wants data!

Khaled El Emam & Luk Arbuckle

Page 77: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Researcher Ronnie wants data!

Khaled El Emam & Luk Arbuckle

919,710 recordsfrom 2005-2011

Presenter
Presentation Notes
The data he wants...
Page 78: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Researcher Ronnie wants data!

Khaled El Emam & Luk Arbuckle

919,710 recordsfrom 2005-2011

Presenter
Presentation Notes
The data he wants...
Page 79: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Choosing Thresholds

Khaled El Emam & Luk Arbuckle

Page 80: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Choosing Thresholds

Khaled El Emam & Luk Arbuckle

Average risk of 0.1 for Researcher Ronnie(and the data he specifically requested).

Presenter
Presentation Notes
Based on detailed risk assessment.
Page 81: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Choosing Thresholds

Khaled El Emam & Luk Arbuckle

0.05 if there were highly sensitive variables(congenital anomalies, mental health problems).

Average risk of 0.1 for Researcher Ronnie

Page 82: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

Page 83: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

Low motives and capacity

Page 84: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

Low motives and capacity; low mitigating controls.

Page 85: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

Pr(attempt) = 0.4

Page 86: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)119,785 births out of a 4,478,500 women ( = 0.027)

Presenter
Presentation Notes
Worse case is 2008, prevalence of 0.027.
Page 87: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87

Presenter
Presentation Notes
150/2 friends because only women considered.
Page 88: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)

T3: Data Breach (“data gone wild”)Based on historical data.

Page 89: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)

T3: Data Breach (“data gone wild”)Pr(breach)=0.27

Page 90: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)

T3: Data Breach (“data gone wild”)

T4: Public Data (demonstration attack)

Page 91: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)

T3: Data Breach (“data gone wild”)

Overall riskPr(re-id, T) = Pr(T) x Pr(re-id | T) ≤ 0.1

Page 92: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)Pr(aquaintance) = 1- (1-0.027)150/2 = 0.87

Overall riskPr(re-id, acquaintance) = 0.87 × Pr(re-id | acquaintance) ≤ 0.1

Page 93: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying the Data Set

Khaled El Emam & Luk Arbuckle

Page 94: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Meeting Thresholds: k-anonymity

Khaled El Emam & Luk Arbuckle

k

Page 95: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Meeting Thresholds: k-anonymity

Khaled El Emam & Luk Arbuckle

Page 96: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying the Data Set

Khaled El Emam & Luk Arbuckle

MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.

Page 97: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying the Data Set

Khaled El Emam & Luk Arbuckle

MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.

MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.

Page 98: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying the Data Set

Khaled El Emam & Luk Arbuckle

MDOB in 1-yy; BDOB in wk/yy; MPC of 1 char.

MDOB in 10-yy; BDOB in qtr/yy; MPC of 3 chars.

MDOB in 10-yy; BDOB in mm/yy; MPC of 3 chars.

Page 99: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

Page 100: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

In 2006 Researcher Ronnie asks for 2005.

Page 101: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

In 2006 Researcher Ronnie asks for 2005—deleted.In 2007 Researcher Ronnie asks for 2006.

Page 102: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

In 2006 Researcher Ronnie asks for 2005.In 2007 Researcher Ronnie asks for 2006—deleted.In 2008 Researcher Ronnie asks for 2007.

Page 103: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

In 2006 Researcher Ronnie asks for 2005.In 2007 Researcher Ronnie asks for 2006.In 2008 Researcher Ronnie asks for 2007—deleted.In 2009 Researcher Ronnie asks for 2008.

Page 104: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

In 2006 Researcher Ronnie asks for 2005.In 2007 Researcher Ronnie asks for 2006.In 2008 Researcher Ronnie asks for 2007.In 2009 Researcher Ronnie asks for 2008—deleted.In 2010 Researcher Ronnie asks for 2009.

Page 105: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

In 2006 Researcher Ronnie asks for 2005.In 2007 Researcher Ronnie asks for 2006.In 2008 Researcher Ronnie asks for 2007.In 2009 Researcher Ronnie asks for 2008—deleted.In 2010 Researcher Ronnie asks for 2009.

Can we use the same de-identification scheme every year?

Page 106: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Khaled El Emam & Luk Arbuckle

Page 107: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Khaled El Emam & Luk Arbuckle

Page 108: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

BORN data pertains to very stable populations.

Page 109: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

BORN data pertains to very stable populations.

No dramatic changes in the number or characteristics ofbirths from 2005-2010.

Page 110: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

BORN data pertains to very stable populations.

No dramatic changes in the number or characteristics ofbirths from 2005-2010.

Revisit de-identification scheme every 18 to 24 months.

Page 111: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Year on Year: Re-using Risk Analyses

Khaled El Emam & Luk Arbuckle

BORN data pertains to very stable populations.

No dramatic changes in the number or characteristics ofbirths from 2005-2010.

Revisit de-identification scheme every 18 to 24 months.

Revisit if any new quasi-identifiers are added or changed.

Page 112: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Longitudinal Discharge Abstract Data:State Inpatient Databases

Khaled El Emam & Luk Arbuckle

Page 113: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Longitudinal Discharge Abstract Data:State Inpatient Databases

Khaled El Emam & Luk Arbuckle

Linking a patient’s records over time.

Page 114: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Longitudinal Discharge Abstract Data:State Inpatient Databases

Khaled El Emam & Luk Arbuckle

Linking a patient’s records over time.

Need to be de-identified differently.

Page 115: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Meeting Thresholds: k-anonymity?

Khaled El Emam & Luk Arbuckle

k?

Page 116: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Meeting Thresholds: k-anonymity?

Khaled El Emam & Luk Arbuckle

Page 117: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Meeting Thresholds: k-anonymity?

Khaled El Emam & Luk Arbuckle

Page 118: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying Under Complete Knowledge

Khaled El Emam & Luk Arbuckle

Page 119: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying Under Complete Knowledge

Khaled El Emam & Luk Arbuckle

Page 120: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying Under Complete Knowledge

Khaled El Emam & Luk Arbuckle

Page 121: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying Under Complete Knowledge

Khaled El Emam & Luk Arbuckle

Page 122: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

State Inpatient Database (SID) of California

Khaled El Emam & Luk Arbuckle

Page 123: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

State Inpatient Database (SID) of California

Khaled El Emam & Luk Arbuckle

Researcher Ronnie wants public data!

Page 124: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

State Inpatient Database (SID) of California

Khaled El Emam & Luk Arbuckle

Researcher Ronnie wants public data!

Page 125: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

State Inpatient Database (SID) of California

Khaled El Emam & Luk Arbuckle

Page 126: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

Page 127: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

T1:Deliberate Attempt

Measuring Risk Under Plausible Attacks

Khaled El Emam & Luk Arbuckle

T2: Inadvertent Attempt (“Holly Smokes, I know her!”)

T3: Data Breach (“data gone wild”)

T4: Public Data (demonstration attack)Pr(re-id) ≤ 0.09 (maximum risk)

Presenter
Presentation Notes
K=11
Page 128: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying the Data Set

Khaled El Emam & Luk Arbuckle

Page 129: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying the Data Set

Khaled El Emam & Luk Arbuckle

BirthYear in 5-yy (cut at 1910-);AdmissionYear unchanged;DaysSinceLastService in 28-dd (cut at 7-, 182+);LengthOfStay same as DaysSinceLastService.

Page 130: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

De-identifying the Data Set

Khaled El Emam & Luk Arbuckle

BirthYear in 5-yy (cut at 1910-);AdmissionYear unchanged;DaysSinceLastService in 28-dd (cut at 7-, 182+);LengthOfStay same as DaysSinceLastService.

Presenter
Presentation Notes
Approximate complete knowledge
Page 131: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Connected Variables

Khaled El Emam & Luk Arbuckle

Page 132: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Connected Variables

Khaled El Emam & Luk Arbuckle

QI to QI

Page 133: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Connected Variables

Khaled El Emam & Luk Arbuckle

QI to QISimilar QI? Same generalization and suppression.

Page 134: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Connected Variables

Khaled El Emam & Luk Arbuckle

QI to QISimilar QI? Same generalization and suppression.

QI to non-QI

Page 135: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Connected Variables

Khaled El Emam & Luk Arbuckle

QI to QISimilar QI? Same generalization and suppression.

QI to non-QINon-QI is revealing?Same suppression so both are removed.

Page 136: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Issues Regarding Longitudinal Data

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
Approximate complete knowledge
Page 137: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Issues Regarding Longitudinal Data

Khaled El Emam & Luk Arbuckle

Date shifting—maintaining order of records.

Presenter
Presentation Notes
Approximate complete knowledge
Page 138: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Issues Regarding Longitudinal Data

Khaled El Emam & Luk Arbuckle

Date shifting—maintaining order of records.

Long tails—truncation of records.

Presenter
Presentation Notes
Approximate complete knowledge
Page 139: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Issues Regarding Longitudinal Data

Khaled El Emam & Luk Arbuckle

Date shifting—maintaining order of records.

Long tails—truncation of records.

Adversary power—assumption of knowledge.

Presenter
Presentation Notes
Approximate complete knowledge
Page 140: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Concerns to Think About

Khaled El Emam & Luk Arbuckle

Presenter
Presentation Notes
Approximate complete knowledge
Page 141: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Concerns to Think About

Khaled El Emam & Luk Arbuckle

Free-form text—anonymization.

Presenter
Presentation Notes
Approximate complete knowledge
Page 142: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Concerns to Think About

Khaled El Emam & Luk Arbuckle

Free-form text—anonymization.

Geospatial information—aggregation and geoproxy risk.

Presenter
Presentation Notes
Approximate complete knowledge
Page 143: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Concerns to Think About

Khaled El Emam & Luk Arbuckle

Free-form text—anonymization.

Geospatial information—aggregation and geoproxy risk.

Medical codes—generalization, suppression, shuffling (yes, as in cards).

Presenter
Presentation Notes
Approximate complete knowledge
Page 144: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Other Concerns to Think About

Khaled El Emam & Luk Arbuckle

Free-form text—anonymization.

Geospatial information—aggregation and geoproxy risk.

Medical codes—generalization, suppression, shuffling (yes, as in cards).

Secure linking—linking data through encryption before anonymization.

Presenter
Presentation Notes
Approximate complete knowledge
Page 145: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Part 3 of Webcast: Questions and Answers

Khaled El Emam & Luk Arbuckle

Page 146: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Khaled El Emam & Luk Arbuckle

More Comments or Questions: Contact us!

Page 147: O'Reilly Webcast: Anonymizing Health Data

Anonymizing Health Data

Khaled El Emam & Luk Arbuckle

Khaled El Emam: [email protected]

Luk Arbuckle: [email protected]

More Comments or Questions: Contact us!