A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now...

19
A PRIVACY ANALYTICS WHITE PAPER The De-identification Maturity Model Authors: Khaled El Emam, PhD Waël Hassan, PhD 1

Transcript of A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now...

Page 1: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

A PRIVACY ANALYTICS WHITE PAPER

The De-identification Maturity Model

Authors: Khaled El Emam, PhD Waël Hassan, PhD

1

Page 2: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

Table of Contents......................................................................................................The De-identification Maturity Model 4

.............................................................................................................................................Introduction 4

.......................................................................................................................................DMM Structure 4

...............................................................................................Key De-identification Practice Dimension 6

.....................................................................................................................................P1 – Ad-hoc 6

...................................................................................................................................P2 – Masking 7

.................................................................................................................................P3 – Heuristics 8

...............................................................................................................................P4 – Risk-based 8

.............................................................................................................................P5 – Governance 8

..............................................................................................................The Implementation Dimension 8

..........................................................................................................................I1 - The Initial Level 9

...............................................................................................................I2 – The Repeatable Level 9

.....................................................................................................................I3 – The Defined Level 9

.................................................................................................................I4 – The Measured Level 9

..............................................................The Cumulative Nature of the Implementation Dimension 9

...........................................................................................................The Automation Dimension 10

......................................................................................................A1 – Home-grown Automation 10

.............................................................................................................A2 – Standard Automation 10

.........................................................................................Use of the De-identification Maturity Model 10

............................................................................................The Practice Dimension and Compliance 10

...........................................................................................................................................Scoring 11

....................................................................................................................Process Improvement 12

..........................................................................................................................Ambiguous Cases 12

............................................................................................................................................Case 1 12

............................................................................................................................................Case 2 12

2

Page 3: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

............................................................................................................................................Case 3 13

............................................................................................................................................Case 4 13

...........................................................................................Mapping to the Twelve Characteristics 13

.................................................................................................Quantitative Assessment Scheme 15

............................................................................................................................................References 19

........................................................................................................................About Privacy Analytics 19

3

Page 4: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

The De-identification Maturity ModelIntroductionPrivacy Analytics has developed the “De-identification Maturity Model” or “DMM” as a formal framework for evaluating the maturity of anonymization services within an organization. The framework gauges the level of an organization’s readiness and experience with respect to anonymization in terms of people, processes, technologies and consistent measurement practices. The DMM is used as a measurement tool and enables the enterprise to implement a fact-based improvement strategy. The DMM is a successor to a previous analysis in which we identified twelve criteria for evaluating an anonymization methodology [1], and to an earlier maturity model which we developed to assess the identifiability of data [2]. The criteria used, in both instances, were based on contemporary standards. The set of twelve criteria, although useful for general evaluation purposes, poses two challenges regarding its application: (a) how can these criteria be used to evaluate the anonymization practices of organizations, and (b) do all of the twelve criteria need to be implemented at once?

We have now developed a maturity model based on these twelve criteria: the De-identification Maturity Model. The DMM is intended to serve a number of purposes: it can (a) be used by organizations as a yardstick to evaluate their de-identification practices, (b) provide a roadmap for improvement, helping organizations to determine what they need to do next in order to improve their de-identification practices and (c) allow different units or departments within a larger organization to compare their de-identification practices in a concise and objective way.

Organizations that have a higher maturity score on the DMM are considered to have better and more sophisticated de-identification practices. Higher maturity scores indicate that the organization is able to: (a) defensibly ensure that the risk of re-identification is “very small,” (b) meet regulatory and legal requirements, (c) share more data for secondary purposes using fewer resources (greater efficiency), (d) share higher quality data that better meets the analytical needs of the data recipients, (e) de-identify data through consistent practices, and (f) better estimate the resources and time required to de-identify a data set.

DMM StructureThe DMM has five maturity levels to describe the de-identification practices that an organization has in place, with level 1 being the lowest level of maturity and level 5 being the highest level of maturity.

Borrowing a term from the ISO/IEC 15504 international standard [1] on software process assessment, we will assume that the scope of the DMM is an “organizational unit” (or OU for short). An OU is a general term for entities of all sizes, from small units of a few people assigned to a specific project up to whole enterprises. We therefore deliberately do not define an OU because the definition will be case specific. It may be a particular business unit within a larger enterprise, or it can be a whole ministry of health.

DMM is a descriptive model developed through discussions with, and experiences and observations of, more than 60 OUs over the last five years. The model is intended to capture the stages that an OU passes through as it implements de-identification services. Based on our experiences, the DMM describes the evolutionary stages through which OUs achieve greater levels of sophistication and efficiency.

4

Page 5: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

The DMM has three dimensions, as illustrated in FIGURE 1. The first dimension is the nature of the de-identification methodology that the OU has in place: this is the Key De-identification Practice dimension. The second dimension captures how well these practices are being implemented. This Implementation dimension covers elements such as proper management of de-identification practices, documentation of practices, and their measurement. The third dimension of Automation assesses the degree of automation of the de-identification process. We will examine each of these dimensions in detail below.

Figure 1: The three dimensions of the De-identification Maturity Model.

The DMM can be represented in the form of a matrix as illustrated in Figure 2. This matrix also allows for the explicit scoring of an OU’s de-identification services.

5

Page 6: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

Figure 2: The de-identification maturity matrix showing all three dimensions of the DMM.

Key De-identification Practice Dimension

P1 – Ad-hocAt this level, an OU does not have any defined practices for de-identification. The OU may not even realize that de-identification is necessary. If it is recognized as necessary, it is left to database administrators or analysts to figure out what to do, without much guidance. The methods that are used to de-identify the data are either developed in-house (e.g., invented by a programmer) or picked up by browsing the Internet. At Practice Level 1 (P1), such methods are not proven to be rigorous or defensible. OUs at this level will often lack adequate advice from their legal counsel, have a weak or non-existent privacy or compliance office, and/or have poor communication with this office.

OUs at the Practice Level 1 tend to have a lot of variability in how they de-identify data, as the type and amount of de-identification applied will be depend on the analyst who is performing it, and that analyst’s experience and skill. They may apply various techniques on the data, such as rudimentary masking methods, or other non-reviewed approaches. The quality of the data coming out of the OU will also vary, as the extent of de-identification may be indiscriminately high or low.

Some OUs at Practice Level 1 recognize the low maturity of their de-identification capacities and err on the conservative side by not releasing any data. For OUs at this level which do release data, a data breach would almost certainly be considered notifiable in jurisdictions where there are breach notification laws in place.

6

Page 7: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

P2 – MaskingOUs at this level only implement masking techniques. Masking techniques focus exclusively on direct identifiers such as name, phone number, health plan number, and so on. They include techniques such as pseudonymization, the suppression of fields, and randomization [3].

As we have documented elsewhere [1], masking is not sufficient to ensure that the risk of re-identification is “very small” for the data set. Masking is necessary, but not sufficient, to protect against identity disclosure. Even if good masking techniques are used, it is possible to produce a data set with a high risk of re-identification.

According to our observations, OUs often remain at Practice Level 2 for one or a combination of these three reasons: (a) masking tool vendors erroneously convince them that masking is sufficient to ensure that the risk of re-identification is small, (b) the OU needs to disclose unique identifiers (such as social security numbers or health insurance numbers) to facilitate data matching but is uncomfortable doing so and consequently chooses to implement pseudonymization, and (c) the individuals who are tasked with implementing de-identification lack knowledge of the area and do not know about residual risks from indirect identifiers.

In healthcare, and increasingly outside of healthcare, this method of de-identification alone will not meet the expected standards for protection of personal information (see the table mapping to existing standards below). Therefore, OUs at Practice Level 2 will also have to go through a notification process if they experience a breach, in jurisdictions where there are breach notification laws.

7

Page 8: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

P3 – HeuristicsAt this level OUs have masking techniques in place, and have started to add heuristic methods for protecting indirect identifiers. Heuristic methods are rules-of-thumb that are used to de-identify data. For example, the commonly used “cell size of five” is a rule of thumb, as is the heuristic that “no geographic area with less than 20,000 residents will be released”. Oftentimes these heuristics are just copied from another organization that is believed to be reputable or is perceived to have good practices for managing privacy. A discussion of heuristics can be found elsewhere [4].

This is a significant improvement from Practice Level 2 in that it is starting to look at ways to protect indirect identifiers, such as location, age, dates of service, types of service, etc., that can be used in combination to identify individuals. However, heuristics have two key disadvantages: (a) they do not ensure that the risk of re-identification is “very small” for the OU’s particular data sets, and (b) they may result in too much distortion of the data set. The primary reasons for these disadvantages are that heuristics do not rely on measurement of re-identification risk and do not take the context of the use or disclosure of the data into account.

P4 – Risk-basedRisk-based de-identification involves the use of empirically validated and peer-reviewed measures to determine the acceptable re-identification risk and to demonstrate that the actual risk in the data sets is at or below this acceptable risk level. In addition to measurement, there are specific techniques that take into account the context of the de-identification when deciding on an acceptable risk level.

Because risk can be measured, it is possible to perform only enough de-identification on the data to meet the risk threshold, and no more. In addition to risk measurement, OUs at this level quantify information loss, or the amount of change made to the data. By considering these two types of measures, the OU can ensure that the data experiences minimal change while still meeting the risk threshold requirement. Of course, masking techniques are still used at this level to protect direct identifiers in the data set. These are assumed to be carried over from lower levels on this dimension.

At this level, if practices are implemented consistently, it would be relatively straightforward to make the case that no notification is required if there is a data breach of the de-identified data.

P5 – GovernanceAt the highest level of maturity, masking and risk-based de-identification are applied as described in Practice Level 4. However, now there is a governance framework in place, as well as practices to implement it. Governance practices include performing audits of data recipients, monitoring changes in regulations, and having a re-identification response process.

The Implementation DimensionThe de-identification practices described above can be implemented with different levels of rigor. We define four levels of implementation: Initial, Repeatable, Defined, and Measured. Note, that there is no Implementation dimension for OUs at the P1 level because there are no de-identification practices to implement. Consequently these cells are crossed out in Figure 2.

8

Page 9: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

I1 - The Initial LevelAt the Initial level the de-identification practices are performed by an analyst with no documented process, no specific training, and no performance measurements in place. It is experiential and the quality of the de-identification that is performed will depend largely on the skills and effort of the analyst performing it. This leads to variability in how well and how quickly de-identification can be performed. It also means that the OU is at risk of losing a significant amount of its de-identification expertise if these analysts leave or retire; there is no institutional memory being built up to maintain the practices within the OU.

I2 – The Repeatable LevelAt the Repeatable level the OU has basic project management practices in place to manage the de-identification service. This means that: (a) there are roles and responsibilities defined for performing de-identification, and (b) there is a known high-level process for receiving data, de-identifying it, and then releasing it to the data users. At this level there is a basic structure in place for de-identification.

Also critical at this level is the involvement of the privacy or compliance office in helping to shape de-identification practices. Since these staff would have a more intimate understanding of legislation and regulations, their inputs are advisable at the early stages of implementing de-identification practices.

I3 – The Defined LevelThe Defined level of implementation means that the de-identification process is documented and there is training in it in place. Documentation is critical because it is a requirement in privacy standards. The HIPAA Privacy Rule Statistical Method, for example, explicitly mentions documentation of the process as a compliance requirement. Training ensures that the analysts performing the de-identification will be able to do it correctly and consistently.

In order to comply with regulatory requirements, the nature of the documentation also matters. The purpose of the documentation is to demonstrate to an auditor or an investigator, in the event of a potentially adversarial situation, the precise practices that are used to de-identify the data. An auditor or investigator will be looking at the process documentation for a number of reasons, such as: (a) there has been a breach and the regulator is investigating it; (b) there has been a patient complaint about data sharing and concerns have been expressed in the media, resulting in an audit being commissioned; and/or (c) a client of the OU has complained that data they are receiving from the OU are not de-identified properly. The documentation is then necessary to make the case strongly and convincingly that adequate methods were used to de-identify the data. Based on our experience, a two or three page policy document, for example, will generally not be considered to be sufficient.

I4 – The Measured LevelThe Measured level of implementation pertains to performance measures of the de-identification process being made and used. Measures can be based on tracking of the data sets that are released and of any data sharing agreements. For example, the OU can examine trends of data releases and their types over time to enable better resource allocation; for instance, overlaps in data requests could lead to the creation of standard data sets. Other measures can include data user satisfaction surveys and measures of response times and delays in getting data out.

The Cumulative Nature of the Implementation DimensionThe levels in the Implementation dimension are cumulative in that it is difficult to implement a higher level without having implemented a lower level. For example, meaningful performance measures will be difficult to collect without having a defined process.

9

Page 10: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

The Automation DimensionAutomation is essential for scalability. Any data set that is not trivial in size can only be de-identified using automated tools. Automation becomes critical as data sets become larger and as de-identification needs to be performed regularly (as in data feeds). Without automation, it will be difficult to believe that there is a de-identification process in place.

A1 – Home-grown AutomationAn OU may attempt to develop its own scripts and tools to de-identify data sets. Based on our experiences, solutions developed in-house tend to have fundamental weaknesses in them, or to be incomplete. For example, some OUs may try to develop their own algorithms for creating pseudonyms. We have sometimes seen OUs develop their own hashing schemes for pseudonyms. We strongly advise against this, because they will almost certainly not work correctly, or will be easy to re-identify (one should only use NIST approved hashing schemes). But even for seemingly straightforward masking schemes, de-identification can be reversed if not constructed carefully [1]. It takes a considerable amount of expertise in this area to construct effective masking techniques.

We have also observed home grown de-identification scripts that distort the data too much, because they focus on modifying data to protect privacy without considering data utility.

A2 – Standard AutomationStandard automation means adopting tools that have been used more broadly by multiple organizations and have received scrutiny. These may be publicly available (open source) or commercial tools. The advantage of standard tools is transparency, in that their algorithms have likely been reviewed and evaluated by a larger community. Any weaknesses have been identified and flagged, and the developer(s) of the tools have been made aware of them and likely addressed them.

Data masking is not a trivial task. Developing pseudonymization or randomization schemes that are guaranteed not to be reversible (at least have a low probability) requires significant knowledge in the area of disclosure control. The measurement of re-identification risk is an active area of research among statisticians, and the de-identification of indirect identifiers is a complex optimization problem. The metrics and algorithms of any tools adopted should be proven to be defensible.

Use of the De-identification Maturity ModelThe Practice Dimension and ComplianceAcross multiple jurisdictions, OUs at levels P1 and P2 in the Practice dimension are generally not compliant with de-identification regulations. Practice Level 3 OUs may pass an audit, but with major findings, because they will not be able to provide objective evidence that their de-identification techniques ensure that re-identification risk is “very small” (since there is no measurement in place). The outcome of an audit or an investigation will depend on how strict the auditor is, but in general, de-identified data cannot be defensibly considered to be re-identifiable.

In general, we recommend that Practice Level 3 should be a transitional stage to higher Practice levels.

10

Page 11: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

Figure 3: An example of scoring P2-I3-A1 on the maturity matrix.

ScoringAn OU scored on the DMM has three scores: the Practice score, the Implementation score, and the Automation score. For example, an OU that has purchased a data masking tool and implemented it, and has documented the data masking process and its justifications thoroughly, would be scored at P2-I3-A2. If the masking tool was home grown, but with the same level of documentation and training, then the score would be P2-I3-A1. This is illustrated on the maturity matrix in Figure 3. The check marks in the matrix indicate the three dimensions of the score.

The absolute minimum score to be able to make any defensible claim of compliance with current standards would be a P4-I3-A1 score.

We deliberately did not attach weights to the dimensions as we do not have a parsimonious way of doing so. Based on our experiences we would argue that an OU would be best off improving its P score first, and then focusing on Automation, followed by the Implementation. Without the appropriate de-identification practices (Practice dimension) in place, adequate privacy protection is impossible. The OU must first have reasonable de-identification practices in place. Then, automation is necessary for the de-identification of data sets of any substantial size. Finally, the Implementation dimension should receive attention. Therefore, the priority ranking is Practice > Automation > Implementation.

Improvements beyond the minimum required for compliance may be motivated by a drive to achieve certain outcomes, such as finding efficiencies, improving service quality, and reducing costs. For example, further improvements allow for the release of higher quality data, the release of larger volumes of data, faster data releases, and lower operating costs.

11

Page 12: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

In large OUs there may be multiple departments and units performing de-identification, and each of these departments and units may have a different DMM score. These scores may be quite divergent, reflecting significant heterogeneity within the OU. This makes it challenging to compute or present an OU-wide score. There are three ways to approach such heterogeneity: (a) present a range for each of the three scores to reflect that heterogeneity, (b) present the average for each score, or (c) define multiple department or unit profiles and present a separate assessment for each.

The third approach requires some explanation. In a case where there are two extreme sets of practices within an OU, some very good and some very poor (according to the maturity model), it would be difficult to present a coherent picture of the whole OU. Those departments or units with high maturity scores would be represented, characterized, and described in a “Pioneers” profile. Those departments or units with low maturity scores would be represented by a “Laggards” profile. Each profile’s scores on the three DMM dimensions would be an average or a range of the scores of the departments or units represented.

Process ImprovementThe DMM provides a roadmap for an OU to improve its de-identification practices. The Practice dimension target for an OU that only occasionally uses and discloses data for secondary purposes would be Practice Level 4. The Practice dimension target for an OU that uses and discloses data for secondary purposes on quite a regular basis would be Practice Level 5.

Process improvement plans generally focus on all three dimensions. These may be performed simultaneously or staggered, depending on resources.

It is very reasonable for an OU to skip certain Practice Levels. Recall that the DMM characterizes the natural evolution of de-identification practices that an OU goes through. In a deliberate process improvement context an OU may skip Practice Levels to move directly to the targeted state. For example, an OU at Practice Level 1 (Ad-hoc) would not deliberately move to a still non-compliant Practice Level 2 (Masking), but directly to Practice Level 4 (Risk-based).

As noted above, the Implementation Levels (Initial, Repeatable, Defined, and Measured) are cumulative. This means that skipping Implementation Levels is not recommended (and will not work very well).

Ambiguous CasesWe consider below some grey areas or edge cases that have been encountered in practice. These are intended to help interpret the DMM.

Case 1Assume that an OU is using a masking technique, but that technique is known to be reversible. For example, the method that is used for creating the pseudonym from a social security number or medical record number has weaknesses in that one can infer the original identifier from the pseudonym. Weaknesses with other masking techniques have been described elsewhere [1]. Would that OU be considered at Practice Level 1 (Ad Hoc) or Practice Level 2 (Masking)? In general we would consider this OU to still be at Practice Level 1 since the approach that is used for masking would be considered ad-hoc. To be a true Practice Level 2 OU the masking methods must be known to be strong, in that they cannot be easily reverse engineered by an adversary.

Case 2A Practice Level 2 (Masking) OU has implemented pseudonymization but has not implemented other forms of masking on direct identifiers. Would that OU still be considered at Practice Level 2? If all of the direct identifiers have not been determined and masked properly, then this OU is considered at Practice Level 1 (Ad Hoc). Pseudonymization by itself may not sufficient. Also, sometimes OU’s will create pseudonyms for some direct identifiers but not for others – a form of partial pseudonymization. Again, this would still keep the OU at Practice Level 1.

12

Page 13: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

Case 3A series of algorithms and checklists have been developed and implemented by an OU’s staff to de-identify indirect identifiers. The algorithms always de-identify the data sets the same way, and do not take the context into account. Because the context is not accounted for, this would be considered a Level 3 (Heuristic) OU. Being able to adjust the parameters of the de-identification to account for the data use and disclosure context is important for the DMM definition of Practice Level 4 (Risk-based).

Case 4An OU needs to disclose data to allow the data recipient to link the data set with another data set. The two data sets do not have a unique identifier in common, therefore probabilistic linkage is necessary. This means that indirect identifiers such as the patient’s date of birth, postal code, and date of admission need to be disclosed without any de-identification. In that case, because there is a necessity to disclose fields that are suitable for probabilistic linkage does not mean that the data set is considered to have a “very small” risk of re-identification. Unless the data custodian can demonstrate that the measured risk is “very small,” then this cannot be considered as having a “very small” risk of re-identification. Furthermore, there are alternative approaches one can use for secure probabilistic matching that do not require the sharing of the indirect identifiers.

Mapping to the Twelve CharacteristicsIn an earlier analysis we documented twelve characteristics of anonymization methodologies that are mentioned in contemporary standards, such as guidance and best practice documents from regulators [1], [5]. The following mapping indicates how the DMM maps to these criteria.

There are three conclusions that one can draw from this mapping. First, that the DMM covers the 12 characteristics and therefore it is consistent with existing standards. Second, that the DMM covers some practices that are not mentioned in the standards. There are a number of reasons for this:

• The standards describe a high maturity OU. The DMM covers low maturity as well as high maturity OUs. Hence, we discuss some of the practices we see in low maturity OUs.

• The standards do not cover some practices that we believe are critical for the effective implementation of de-identification, such as automation (the Automation dimension) and performance measurement (the I4 level). These are practical requirements that enable the scaling of de-identification. We have observed that large scale de-identification is becoming the norm because of the volume of health data that is being used and disclosed for secondary purposes.

• Third, we see that the union of current standards describes a P5-I3-A1 OU. As noted above we consider P5 practices most suitable for OUs that perform a significant amount of de-identification. And the standards do not discuss performance measurement and automation

13

Page 14: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

MAPPING THE DMM TO THE TWELVE CHARACTERISTICSMAPPING THE DMM TO THE TWELVE CHARACTERISTICS

CRITERION MAPPING

Is the methodology documented? I3 – this is a requirement of the Implementation Defined Level

Has the methodology received external or independent scrutiny? P5 – this is part of governance

Does the methodology require and have a process for the explicit identification of the data custodian and the data recipients?

P2 onwards – Customizing the fields to mask or to de-identify can depend on the data recipient

Does the methodology require and have a process for the identification of plausible adversaries and plausible attacks on the data?

P4 – this kind of practice would mostly be relevant in a risk-based de-identification approach

Does the methodology require and have a process for the determination of direct identifiers and quasi-identifiers?

This is generally considered in the structure of the Practice dimension

Does the methodology have a process for identifying mitigating controls to manage any residual risks?

P4 – this kind of practice would mostly be relevant in a risk-based de-identification approach

Does the methodology require the measurement of actual re-identification risks for different attacks from the data?

P4 – this kind of practice would mostly be relevant in a risk-based de-identification approach

Is it possible to set, in a defensible way, re-identification risk thresholds?

P4 – this kind of practice would mostly be relevant in a risk-based de-identification approach

Is there a process and template for the implementation of the re-identification risk assessment and de-identification?

P4 – this kind of practice would mostly be relevant in a risk-based de-identification approach

Does the methodology provide a set of data transformations to apply?

I3 – a documented methodology with appropriate training would explain the data transformations

Does the methodology consider data utility?

P2 and P4 – usually masking methodologies have a subjective measure of data utility and risk-based ones a more objective measure of data utility

Does the methodology address de-identification governance?Elements of governance are covered in Practice Level 5 and also in Implementation Level 3.

14

Page 15: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

Quantitative Assessment SchemeWe now discuss a scoring scheme for the maturity model. This is the scoring scheme that we have been using and we found that it provides results that have face validity. Recall that the objective of using the DMM is often to identify weaknesses and put in place a roadmap for improvement, and a scoring scheme that supports that objective would be considered acceptable.

An assessment can be performed by the OU itself or by an external assessor. It consists of a series of questions around each of the dimensions. The response to a question is Yes (and gets a score of 4), No (and gets a score of zero), or is being planned (and gets a score of 1).

For each dimension in the DMM, start at the level 2 questions. Score the level 2 questions and take their average. If the average score is less than or equal to 3 then that OU is at level 1 on that dimension. If the OU has a score greater than 3, then proceed to the next level’s questions, and so on.

KEY DE-IDENTIFICATION PRACTICE DIMENSIONKEY DE-IDENTIFICATION PRACTICE DIMENSIONKEY DE-IDENTIFICATION PRACTICE DIMENSIONKEY DE-IDENTIFICATION PRACTICE DIMENSION

CRITERION Yes No Planned

Level 2 - MaskingLevel 2 - MaskingLevel 2 - MaskingLevel 2 - Masking

Does the OU’s de-identification methodology require and have a process for the determination of direct identifiers and quasi-identifiers?

Does the OU have practices for either deleting direct identifiers or creating pseudonyms from unique direct identifiers (such as MRNs, health insurance numbers, SSNs or SINs) ?

Does the OU have practices for the randomization of other non-unique direct identifiers ?

Is data utility explicitly considered when deciding which direct identifiers to mask and how ?

Does the OU require and have a process for the explicit identification of the data custodian and the data recipients?

Level 3 - HeuristicsLevel 3 - HeuristicsLevel 3 - HeuristicsLevel 3 - Heuristics

Does the OU’s de-identification methodology require and have a process for the determination of direct identifiers and quasi-identifiers?

Does the OU have practices for either deleting direct identifiers or creating pseudonyms from unique direct identifiers (such as MRNs, health insurance numbers, SSNs or SINs) ?

Does the OU have practices for the randomization of other non-unique direct identifiers?

Is data utility explicitly considered when deciding which direct identifiers to mask and how?

15

Page 16: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

Does the OU require and have a process for the explicit identification of the data custodian and the data recipients?

Does the OU have a checklist of indirect identifiers to always remove or generalize (similar in concept to the US HIPAA Privacy Rule Safe Harbor list)?

Does the OU use general ‘rules-of-thumb’ to generalize or aggregate certain indirect identifiers (e.g., never release more than three characters of the postal code) that are always applied for all data releases?

Level 4–Risk-basedLevel 4–Risk-basedLevel 4–Risk-basedLevel 4–Risk-based

Does the OU’s de-identification methodology require and have a process for the determination of direct identifiers and quasi-identifiers?

Does the OU have practices for either deleting direct identifiers or creating pseudonyms from unique direct identifiers (such as MRNs, health insurance numbers, SSNs or SINs)?

Does the OU have practices for the randomization of other non-unique direct identifiers?

Is data utility explicitly considered when deciding which direct identifiers to mask and how?

Does the OU require and have a process for the explicit identification of the data custodian and the data recipients?

Does the OU require and have a process for the identification of plausible adversaries and plausible attacks on the data?

Does the OU have a process for identifying mitigating controls to manage any residual risks?

Does the OU require the measurement of actual re-identification risks for different attacks from the data?

Is it possible to set, in a defensible way, re-identification risk thresholds?

Is there a process and template for the implementation of the re-identification risk assessment and de-identification?

Level 5–GovernanceLevel 5–GovernanceLevel 5–GovernanceLevel 5–Governance

Does the OU’s de-identification methodology require and have a process for the determination of direct identifiers and quasi-identifiers?

16

Page 17: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

Does the OU have practices for either deleting direct identifiers or creating pseudonyms from unique direct identifiers (such as MRNs, health insurance numbers, SSNs or SINs)?

Does the OU have practices for the randomization of other non-unique direct identifiers?

Is data utility explicitly considered when deciding which direct identifiers to mask and how?

Does the OU require and have a process for the explicit identification of the data custodian and the data recipients?

Does the OU require and have a process for the identification of plausible adversaries and plausible attacks on the data?

Does the OU have a process for identifying mitigating controls to manage any residual risks?

Does the OU require the measurement of actual re-identification risks for different attacks from the data?

Is it possible to set, in a defensible way, re-identification risk thresholds?

Is there a process and template for the implementation of the re-identification risk assessment and de-identification?

Does the OU conduct of audits of data recipients (or require third-party audits) to ensure that conditions of data release are being satisfied ?

Does the OU have an explicit process for monitoring changes in relevant regulations and precedents (e.g., court cases or privacy commissioner orders)?

Is there a process for dealing with an attempted or successful re-identification of a released data set?

Has the OU subjected its de-identification practices to external review and scrutiny to ensure that they are defensible?

Does the OU commission “re-identification testing” to ensure that realistic re-identification attacks will have a “very small” probability of success?

Does the OU monitor multiple data releases to the same recipients for overlapping variables that may increase the risk of re-identification?

17

Page 18: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

IMPLEMENTATION DIMENSIONIMPLEMENTATION DIMENSIONIMPLEMENTATION DIMENSIONIMPLEMENTATION DIMENSION

CRITERION Yes No Planned

Level 2 - RepeatableLevel 2 - RepeatableLevel 2 - RepeatableLevel 2 - Repeatable

Does the OU have someone responsible for de-identification?

Does the OU have clearly defined expected inputs and outputs for the de-identification?

Are there templates for data requests and de-identification reports, certificates, and agreements?

Is the privacy or compliance function involved in defining or reviewing the de-identification practices?

Level 3 - DefinedLevel 3 - DefinedLevel 3 - DefinedLevel 3 - Defined

Is the de-identification process documented?

Does the OU have evidence the documented process is followed?

Are all analysts who need to perform or advise on de-identification activities receiving appropriate training?

Level 4 - MeasuredLevel 4 - MeasuredLevel 4 - MeasuredLevel 4 - Measured

Does the OU collect and store performance data, for example, on the number and size of de-identified data sets, the types of fields, the types of data recipients, how long de-identification takes?

Is trend analysis performed on the collected performance data to understand how performance is changing over time?

Are actions taken by the OU to improve performance based on the trend analysis?

Are satisfaction surveys of data recipients conducted and analyzed?

AUTOMATION DIMENSIONAUTOMATION DIMENSIONAUTOMATION DIMENSIONAUTOMATION DIMENSION

CRITERION Yes No Planned

Level 2–Standard AutomationLevel 2–Standard AutomationLevel 2–Standard AutomationLevel 2–Standard Automation

Does the OU use off-the-shelf data masking tools?

Does the OU use off-the-shelf data de-identification tools?

Is the functioning of the masking tools transparent (i.e., documented and reviewed)?

Is the functioning of the de-identification tools transparent (i.e., documented and reviewed)?

18

Page 19: A PRIVACY ANALYTICS WHITE PAPER The De-identification ... Privacy Analyt… · We have now developed a maturity model based on these twelve criteria: the De-identification Maturity

References1. International Standards Organization, http://www.iso.org/iso/home/store/catalogue_ics/

catalogue_detail_ics.htm?csnumber=60555, Accessed April 20132. K. El Emam, Risky Business: Sharing Health Data while Protecting Privacy. Trafford 2013.3. K. El Emam, “Risk-based De-identification of Health Data,” IEEE Security and Privacy, vol. 8, no. 3,

pp. 64–67, 2010.4. El Emam, K., Guide to the De-identification of Personal Health Information. CRC Press (Auerbach),

2013.5. K. El Emam, “Heuristics for De-identifying Health Data,” IEEE Security and Privacy, pp. 72–75, 2008.6. K. El Emam, “The Twelve Characteristics of a De-identification Methodology,” Privacy Analytics Inc.

Contact UsPrivacy Analytics, Inc.800 King Edward AvenueSuite 3042Ottawa, ON, K1N 6N5

Telephone: +1.613.369.4313Email: [email protected]

Copyright © 2013 Privacy Analytics, Inc. All Rights Reserved.

About Privacy AnalyticsPrivacy Analytics Inc. is a world-renowned developer of data anonymization solutions. Its proprietary, integrated de-identification and masking software PARAT enables secondary users of personal information to have granular de-identified data that retains a high level of usefulness. For health information, PARAT operationalizes the HIPAA Privacy Rule De-Identification Standard, enabling users to quickly and efficiently anonymize large quantities of data while complying with legal standards.

Privacy Analytics resulted from the commercialization of the research efforts of the Electronic Health Information Laboratory (EHIL) of the University of Ottawa, which over the past decade has produced more than 150 peer-reviewed papers. The work of EHIL has been influential in the development of regulations and guidance worldwide. With PARAT software, organizations can realize the value of the sensitive data they hold, while still protecting the privacy of individuals when conducting critical research and complex analytics. They can facilitate innovation for the improvement of society and still meet the most stringent legal, privacy and compliance regulations.

Privacy Analytics offers education dedicated to data anonymization and re-identification risk management—Sharing Personal Health Information for Secondary Purposes: An Enterprise Risk Management Framework. Privacy Analytics is proud to be recognized by the Ontario Information and Privacy Commissioner as a Privacy by Design Organizational Ambassador. The company is committed to the Privacy by Design objectives of “ensuring privacy and gaining personal control over one’s information and, for organizations, gaining a sustainable competitive advantage.”

19