Hiding in Plain Sight - Harvard University€¦ · Extract Critical Medical Information from...

© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409

Hiding in Plain Sight:

De-identification of Clinical Narrative

Lynette Hirschman

The MITRE Corporation

October 26, 2015

Harvard

Research Data Access

and Innovation

Symposium


Outline

■ Why De-identified Data?

■ Hiding in Plain Sight: Automated De-identification

■ Balancing Quality, Cost and Scale

■ Lessons learned

2


Acknowledgements

3

■ MITRE:

– John Aberdeen, Sam Bayer, Cheryl Clark, Ben Wellner

■ Vanderbilt:

– Brad Malin, Reyin Yeniterzi, Muqun Li

■ Group Health:

– David Carrell, D.T. Tran, David Cronkite

■ Michigan:

– David Hanauer

■ i2b2 NLP Challenge Organizers:

– Ozlem Uzuner, Peter Szolovits, and many more

The work described here is the result of long term collaborations

among MITRE, Vanderbilt, Group Health, and University of Michigan;

We also gratefully acknowledge the critical contributions of

the i2b2 NLP Challenge Evaluations


MITRE Research Focus: Unlocking Information in Free Text ■ Critical medical observations are

locked in narrative (free text) –

– Patient and hospital records

– Biomedical literature

– Drug indications & adverse events

■ We need to “unlock” this info

– To combine data from multiple structured & unstructured sources

– To support population level studies

■ This will enable:

– Discovery of correlations between patient genotype (genetic variation) and phenotype (e.g., disease, drug response)

– Support for precision medicine, patient safety, drug adverse events

4

■ Extract Critical Medical Information from Medical Records

– Handling negation, uncertainty, temporal information

negation cues

she is never without pain

The Strategy to Unlock the Patient Record

■ Address Privacy and Data Sharing through Automated De-identification

– MIST: MITRE Identification Scrubber Toolkit

■ Create Corpora of Medical Records for Research and Evaluation

– Shared corpora, shared evaluations (i2b21, SHARPn2, BioCreative3)

■ Partner to Gain Access to Medical Use Cases

1Informatics for Integrating Biology & the Bedside (i2b2) 2Strategic Health IT Advanced Research Projects (SHARP), HHS Office of the National Coordinator for Health IT 3Critical Assessment of Information Extraction in Biology 5


The Clinical Data Sharing Wall

Records with protected health information (PHI) cannot be shared due to privacy constraints

Medical record de-identification

is the rate-limiting step for

secondary use applications

Secondary Use

Population Studies

NLP Research

Unstructured

Medical Records

containing PHI

6


■ Find PHI using Natural Language Processing plus

pattern matching techniques

■ Transform PHI

– Redact using [TYPE] replacements

– Resynthesize using surrogates => Hiding in Plain Sight

Protecting Privacy through De-identification

HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman

with long standing hypertension who presented as a walk-in to me at

the on . Recently had been

started q.o.d. on Clonidine since to taper off of the drug.

Was told to start Zestril 20 mg. q.d. again. The patient was sent to

the for direct admission for cardioversion and

anticoagulation, with the Cardiologist, Dr. to follow.

Oak Valley Health Center July 9th

May 5th

Smith Cardiac Unit

Pearson

Sun Hill Medical Center August 12th

June 8th

Jones Cardiac Unit

Faulkner

[FACILITY XX XX XX XX]

[DATE - ZZ]

[DATE - YY]

[DOCTOR XX]

[FACILITY XX XX XX]

Oak Valley Health Center

May 5th

July 9th

Smith Cardiac Unit

Pearson

7


■ 11. Certificate/license numbers

■ 12. Vehicle identifiers and serial numbers

■ 13. Device identifiers and serial numbers

■ 14. Web Universal Resource Locators (URLs)

■ 15. Internet Protocol (IP) address numbers

■ 16. Biometric identifiers, including finger and voice prints

■ 17. Full face photographic images & comparable images

■ Any other unique identifying number, characteristic, or code

What Gets Redacted: HIPAA Identifiers

8

■ 1. Names

■ 2. Geographic subdivisions smaller than a state

■ 3. All elements of dates (except year) for dates directly related to an individual; all ages > 89

■ 4. Telephone numbers

■ 5. Fax numbers

■ 6. Electronic mail addresses

■ 7. Social security numbers

■ 8. Medical record numbers

■ 9. Health plan beneficiary numbers

■ 10. Account numbers


■ Person names

– Must de-identify names of patient and family members

■ Important to preserve coreference within and across records

– Not HIPAA requirement but often redacted: provider names

– Initials

■ Ages

– For longitudinal data, may be necessary to redact or “fuzz” ages

■ Dates

– Must de-identify day/month but dates can be shifted by a random

(but consistent) offset, to preserve relative timing

– This becomes more complicated for longitudinal data

■ Locations

– Patient information: institution, room number, address

– Provider information: institution, address

What Else Gets Redacted and Other Issues

9


Automated De-identification: Two Approaches

■ Pattern Matching approach

■ Machine Learning-based NLP approach

Model Annotated

Training Notes

trainer

Automatically

De-identified Notes

decoder

system institution-specific

name lists

hand-crafted

pattern rules

Automatically

De-identified Notes

Large maintenance effort

requiring skilled developers

Need to develop

annotated training data

10


■ Berman:1

– Redact everything except stop words, UMLS concepts

■ Morrison2:

– Apply MedLEE clinical NLP system to extract medical concepts

■ eMERGE phenotype extraction

– Pool patients with a specific phenotype for statistical power

– PheKB3 has tools to enable cross-site collaboration for

algorithm development, validation, and sharing for reuse

Third Approach: Extract Clinical Info, Leave Behind PHI

11

system Medical Thesaurus

Medical Concept

Extraction

Medical Concepts

assoc w Patient

Hyper-

tension COPD

Diabetes Hyper-

tension

! Berman JJ. Arch Pathol Lab Med 2003, 680-6; 2 Morrison FP, et al: J Am Med Inform Assoc 2009, 16(1):37-9;3 https://www.phekb.org/


■ Machine-learning approach: learn models from

annotated sample notes

■ De-identification system built by annotating

training examples and building models

– Contrasts with pattern matching approaches which require

hand-crafting of patterns and manual maintenance

■ System includes a resynthesis component

– Replaces PHI with surrogate (made-up) identifiers of the

appropriate type for “Hiding in Plain Sight”

MITRE Identification Scrubber Toolkit (MIST)

Model Automatically

De-identified Notes

trainer decoder

12


■ Carafe Conditional Random Fields

– Probabilistic framework for assigning a label sequence to a sequence of

observations

■ Sources of evidence for Carafe

– Lexical features (the words themselves)

– Contextual features (nearby words, punctuation, other tags)

■ “Dr.” and “Attending:” good evidence that next word begins a DOCTOR phrase

– Frequency

■ “of” is a high frequency word, and never appears as part of DOCTOR in training data

Sources of Evidence for Identifying PHI

Dr. Right of the City Hospital. Attending: LIYANI RIGHT , M.D. IH85 YF132/0184

Lexical Lexical

Contextual Contextual

Frequency

Contextual

13


MIST: MITRE Identification Scrubber Toolkit

14

Released 2008 as Open Source: http://mist-deid.sourceforge.net/

The MIST team: John Aberdeen, Sam Bayer, Cheryl Clark,

Lynette Hirschman, Ben Wellner


Resynthesis

15

Find Redactions

Determine Type

Randomly Generate

Replacement Transform

Census

(People)

Gazetteer

(Places)

Institution

Names

■ Problem: Because of privacy considerations, developers

only have access to scrubbed data in many instances

– Often the scrubbed data only has category placeholders, e.g.,

[DATE], [DOCTOR], [PATIENT], etc. where personally identifiable

information (PHI) appeared in the original data

■ To create usable training data, it is necessary to resynthesize

PHI in a manner such that the resulting data is realistic

■ Generate data using pattern

distribution from original

corpus, external distribution

of patterns, input PHI, or

feature probabilities


train model from initial documents

train (better) model from

(more) documents

How to Build a De-identification System

redact or resynthesize

marked documents

mark PHI by hand in initial

documents

mark PHI automatically

in more documents

using model

hand-correct automatically

marked documents

16


MIST 2.0 User Interface

17

Automatically Tagged

Document

Redacted Document

Candidate Replacements

Legend


Balancing Redaction and Interpretability

18

Protect

Privacy AND

Retain

Clinical Value









May 5th

Smith Cardiac Unit

Pearson

[FACILITY XX XX XX XX]

[DATE ZZ]

[DATE YY]

[FACILITY XX XX XX]

[DOCTOR XX]

[NAME XX]


■ Standard measures from the NLP community

– Precision (positive predictive value):

# correctly tagged phrases returned / # phrases returned

– Recall (sensitivity):

# correctly tagged phrases returned / # phrases present

– Balanced F-measure:

harmonic mean of precision & recall: 2*P*R / (P + R)

■ Other useful metrics

– Leaks: non-redacted PHI remaining after de-identification

# false negatives/ # phrases present

– Token level accuracy (computed over words or tokens):

(# true positives + # true negatives) / # total tokens

De-Identification Metrics

19


■ This depends on:

– Note type

– PHI density

– Amount of training data

– Alignment of training data with notes types to be redacted

– Desired balance of recall and precision

– Available budget to create training data

■ How good does automated de-identification need

to be?

– This depends on use case, e.g.,

■ For unlimited data release, no leaks (high recall)!

■ For limited data release, may prefer high precision

How Good is Automated De-identification?

21


Minimum Mean Median Maximum Std.

deviation

Micro precision 0.744 0.941 0.967 0.989 0.068 Micro recall 0.108 0.807 0.882 0.963 0.211

Micro F1 0.193 0.853 0.922 0.976 0.184

Results from i2b2/UT Health Corpus 2014

De-Identification Evaluation

22

Stubbs A, Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth

corpus. J Biomed Inform. 2015 Aug 28. pii: S1532-0464(15)00182-3. doi: 10.1016/j.jbi.2015.07.020.

Aggregate statistics for token-based evaluation of

all submissions from 10 teams –

HIPAA-identified PHI categories

Evaluated on 1304 longitudinal medical records

describing 296 patients


PHI Type P R F AGE 96.69 90 93.22 DATE 97.95 96.98 97.46 ID 97.27 95.7 96.48 INST 93.25 85.26 89.08 LOC 87.38 69.95 77.7 NAME 94.47 94.56 94.52 OTHER 84.68 73.87 78.91 PHONE 94.42 90.87 92.61 All 95.08 91.92 93.48

Results from Deleger et al:

23

L. Deleger, T. Lingren, Y. Ni, M. Kaiser, L. Stoutenborough, K. Marsolo, M. Kouril, K. Molnar, I. Solti,

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research,

J. Biomed. Inform. 50 (2014) 173–183, http://dx.doi.org/10.1016/j.jbi.2014.01.014.

Results from tests on an in-house de-identification system at

Cincinnati Children’s Hospital

Corpus of 250 notes from Cincinnati Children’s Hospital Medical

Center (22 note types)


Variation due to Note Type

■ Short

■ Minimal text

■ Highly structured

– Mostly header info

■ Much more free, narrative text

■ PHI elements throughout

■ Many, varied contexts for PHI elements

24

DISCHARGE SUMMARY

ATTENDING PHYSICIAN

**NAME[XXX WWW], M.D.

PATIENT: **NAME[AAA, BBB] MR NO:**ID-NUM

ADMITTED: **DATE[Jul 12 2000] DISCHARGED:**DATE[Jul 26 2000]

SERVICE: Neurosurgery

PRINCIPAL DIAGNOSIS: Malesuada quis, egestas quis, wisi. Donec ac sapien.

SECONDARY DIAGNOSIS: Ut orci.

PRINCIPAL PROCEDURES: Duis ultricies, metus a feugiat porttitor.

REASON FOR HOSPITALIZATION: Morbi lorem mi **AGE[in 60s]-year-old female who presented to the E.R.

on **DATE[Jul 12 00] eget purus vitae eros ornare adipiscing at 8:00 p.m. that evening. Vestibulum imperdiet

nonummy sem from 8:00-9:00 p.m. Fusce urna magna,neque eget lacus. Maecenas felis nunc, aliquam ac

sapien. Ut orci. Duis ultricies, metus a feugiat porttitor, dolor mauris convallis est, quis mattis lacus ligula eu

augue. Sed facilisis. Morbi lorem mi, tristique vitae, sodales eget, hendrerit sed, erat. Vestibulum. Fusce urna

magna day thirteen, **DATE[Jul 26 00], tincidunt quis Malesuada quis.

DISCHARGE INSTRUCTIONS: Curabitur nunc eros, euismod in, convallis at, vehicula sed consectetuer

posuere, eros mauris dignissim diam, pretium sed pede suscipit. Fusce urna magna,lorem neque eget lacus.

DISCHARGE MEDICATIONS: Steroid taper, Pepcid 20 mg b.i.d. while she is taking her steroids, nimodipine

60 mg p.o. q. 4 hr for an additional 9 days and Dilantin 200 mg p.o. t.i.d.

**NAME[AAA, BBB]

MR NO:**ID-NUM

FOLLOWUP APPOINTMENTS: Morbi lorem miwith Dr. **NAME[WWW] in the **INSTITUTION in one month.

DW/dts/E/0/323 , M.D.

T: **DATE[Jul 27 2000] 1955 **NAME[ZZZ YYY], M.D.

**NAME[M]: **DATE[Jul 27 2000] 1611 Resident, **NAME[VVV UUU TTT]

Job #: **ID-NUM

MD#:**ID-NUM

:E_O_R:

Lab Vital Signs Discharge Summary

S_O_H

Counters Report Type

355,XLl+ApzVmFgU LAB

E_O_H

[Report de-identified (Safe-harbor compliant)

by De-ID v.6.12]

:ADM:01H

:UNIQ:0chs11eo0

:TYP:LAB

:STYP:VS

:ID:**ID-NUM

:NA:**NAME[AAA, BBB CCC]

:DAT:**DATE[Jul 02 2001]

:TIM:17:40

:ACC:1

:PQ:resp get.data

:DOB:**DATE[Nov 13 1938]

:DOC:**NAME[ZZZ, YYY]

:LOC:1123-0

:LAB:BAT: 0 VSign **ID-NUM

DAT: 0 Pulse 0 bpm . 99

DAT: 0 RespRt 0 bpm . 21

DAT: 0 CollBy 0 . . **NAME[XXX WWW], CRT

:E_O_R:


Effect of Training Data on Performance

■ Note type portability

– Four types: Letter,

Discharge, Order, Lab

– Train individual models for

each note type

– Evaluate all models

against all note types

25

Training Docs Models Test Docs

Letter

Discharge

Order

Lab

Training Docs

Hybrid Model

Test Docs

Letter

Discharge

Order

Lab

■ Hybrid model

– Train one large model

using all four note types

– Evaluate hybrid model

against all note types


■ Best results when model is trained from notes of

matching type

■ Hybrid model is comparable to matched model1

– Combines notes from all types

– Has more data

Building the Right Model

26

1R. Yeniterzi, J. Aberdeen, S. Bayer, B. Wellner, L. Hirschman, and B. Malin, Effects of personal

identifier resynthesis on clinical text de-identification, JAMIA, vol. 17, no. 2, p. 159, 2010.


Writing Complexity (Vanderbilt)*

■ Motivation:

– How to deal with hundreds of distinct note types

■ Idea:

– Automatically cluster notes based on writing complexity, and train

models for these clusters (rather than for individual note types)

■ Vanderbilt experiments

– 4500 Vanderbilt patient notes of various types

– Average F-measure results:

■ 0.917 for writing complexity clusters

■ 0.911 for note type clusters; range from 0.842 Family History to 0.964

Clinical Communication

■ 0.811 for random clusters

– Published in the International Journal of Medical Informatics

27

*Li, M., Carrell, D., Aberdeen, J., Hirschman, L. and Malin, BA. De-identification of Clinical Narratives

through Writing Complexity Measures, Int J Med Inform. 2014 Oct;83(10):750-67. doi:

10.1016/j.ijmedinf.2014.07.002. Epub 2014 Jul 24.

http://www.ncbi.nlm.nih.gov/pubmed/?term=Li+Malin+Aberdeen


■ MIST evaluated against Vanderbilt hospital records

■ Explore effect of resynthesis on model portability

– Train/Test w all combinations of original (O) X resynthesized (R) data

Effect of Resynthesis on Model Performance

Training Docs Test Docs Models

Orig

(Real IDs)

DE-ID +

Resynth

Resynth

Tested at

Vanderbilt;

Unavailable

to MITRE

R. Yeniterzi, J. Aberdeen, S. Bayer, B. Wellner, L. Hirschman, and B. Malin, Effects of personal

identifier resynthesis on clinical text de-identification, JAMIA, vol. 17, no. 2, p. 159, 2010.

O => O

R => R

O models

R models

28


Recall Results from Different Models

29

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DS LAB LETTER ORDER HYBRID

Re

ca

ll

O => O

R => R

O => R

R => O

Results are dramatic:

Recall for matched models > 95% across note types

Model based on resynthesized data works BADLY for real data!

Resynthesis regularizes data,

resulting in a poorer match against real data

95%

Recall

Matched

Mismatched


■ Models trained from larger numbers of documents tend

to outperform models trained on fewer documents

■ Question: how many training items are needed to

obtain good performance?

Experiment: Training Size

Training Docs Test Docs Models

30


Training Items and F-measure

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1000 10000

To

ken

F-m

easu

re

Training Items

AGE

DATE

ID-NUM

NAME

PLACE

31


■ Measure 1: Leakage

How much PHI is leaked with human de-

identification?

– Humans are not perfect: recall rates in the 94-98% range

■ Measure 2: Inter-annotator agreement

How well do humans agree when they tag PHI?

– i2b2/UT corpus 20141: Tag-level based F: 0.90

– Deleger et al2: Tag-level based F: 91.76

How Good Are Humans?

1Stubbs A, Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J

Biomed Inform. 2015 Aug 28. pii: S1532-0464(15)00182-3. doi: 10.1016/j.jbi.2015.07.020. 2L. Deleger, T. Lingren, Y. Ni, M. Kaiser et al., Preparing an annotated gold standard corpus to share with extramural

investigators for de-identification research, J. Biomed. Inform. 50 (2014) 173–183,

Human de-identification is in the same range

as automated de-identification

To improve human de-identification, use more humans!

To improve automated de-identification, use more training data

32


Human Annotation: Counts of Missed PHI1

Counts of overlooked PHI

In 100 Family Practice

notes

Containing 1,093 PHI

instances

Reviewed by

individuals,

pairs or

trios of reviewers

# L

eaked P

HI In

sta

nces

33

1From: Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

D. S. Carrell; D. J. Cronkite; B. A. Malin; J. S. Aberdeen; L. Hirschman, submitted to Meth of Info in Medicine


Manual De-identification of Clinical Notes1

■ Task: De-identification of free text in clinical notes

of electronic health records

– Removal of Personal Health Information: 18 classes of

information per US HIPAA2 regulations

– Annotators identify and redact all types of PHI in a patient note

■ Corpus

– 100 clinical records were de-identified by 4 annotators,

– 1093 PHI instances total (~10 instances per note)

■ Estimated steady state cost – Cost: $7.50/patient note/annotator; $0.70/annotation

– Quality: 95% recall (single annotator)

– Cost for 99% recall: $15.00/patient note (2 annotators)

1From: Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

D. S. Carrell; D. J. Cronkite; B. A. Malin; J. S. Aberdeen; L. Hirschman, submitted to Meth of Info in Medicine

34

© 2015 The MITRE Corporation. All rights reserved.

■ Longitudinal corpus of 301 patients

– Average: ~617 tokens (words) per record

■ Total time: 310 h for double annotation

– 30 min/patient for single annotation

■ Adjudication: 2 months part time

■ Estimated cost @ $30/hour for curation:

– $15.00 per record for double annotation

– Consistent with results from Carrell et al.

Results from i2b2/UT Health Corpus 2014

35

Approved for Public Release. Case No. 15-3409

Stubbs A, Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth

corpus. J Biomed Inform. 2015 Aug 28. pii: S1532-0464(15)00182-3. doi: 10.1016/j.jbi.2015.07.020.


■ Manual: Cost of de-identification scales linearly with

– # of annotators

– # of notes

■ Automated: Cost of de-identification depends on

– Quantity of (matched) training data used

– Target recall/precision levels

■ E.g., for recall >90%, need 100-1000 exemplars of each type

■ However single annotator is generally good enough

■ Example:

– De-identify 1000 notes at recall > 95%

■ Manual cost: 1000 notes * $7.50/note/annotator = $7500

■ Automated cost: 250 notes * $7.50/note/annotator = $1875

Estimating the Cost of De-identification

36


0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

0 1 2 3 4 5 6 7 8 9

F-m

easu

re

Cumulative Person-Hours

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

0 1 2 3 4 5 6 7 8 9

F-m

easu

re

Cumulative Person-Hours

■ Experiment performed

by one of our medical

partners using MIST

■ Small amount of hand-

labeling (~8 hours)

– F-measure: 0.953

■ At $250/hr for physician

time, this is a $2000

effort

■ Performance improves

with more hand-labeling

Bootstrapping a System: Results1

All PHI Types

NAME

DATE

To

ken

38 1D. Hanauer, J. Aberdeen, S. Bayer, B. Wellner, e al.,, Bootstrapping a de-identification system for narrative patient

records: Cost-performance tradeoffs, International Journal of Medical Informatics, vol. 82, no. 9, pp. 821–831, Sep. 2013 37


Resynthesis and Hiding in Plain Sight

38









May 5th

Smith Cardiac Unit

Pearson

Oak Valley Health Center

May 5th

July 9th

Pearson








Sun Hill Medical Center August 12th

June 8th

Smith Cardiac Unit

Faulkner

Resynth Error (but who can tell?)


Preliminary Study of HIPS Detection1

■ Experiment – De-identify 100 family practice notes using a de-identification

system

– Replace PHI with realistic surrogates (“Hiding in Plain Sight”)

– Ask reviewers to: ■ Identify PHI in the surrogatized notes

■ Guess which PHI instances are “leaks”

■ Getting enough leaks to test hypothesis

– Originally not enough residuals – used model trained for different

note type to increase residuals

■ Measured ability of reviewers to detect HIPS PHI

– Detection rate is fairly low – 5-10% of residual leaks found

– Reduced risk of disclosure yielded effective de-id rates >99%,

better than manual methods

39

1D. Carrell, B. Malin, J. Aberdeen, S. Bayer, C. Clark, B. Wellner, and L. Hirschman, Hiding in plain sight: use of realistic

surrogates to reduce exposure of protected health information in clinical text, JAMIA, vol. 20 (2), pp. 342–348, 2012


Hiding In Plain Sight: Example

40


Guessing the Leaks after HIPS

41


Results from Initial HIPS Experiment

42

Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text..

Carrell D, Malin B, Aberdeen J, Bayer S, Clark C, Wellner B, Hirschman L. J Am Med Inform Assoc. 2013 Mar-

Apr;20(2):342-8. doi: 10.1136/amiajnl-2012-001034. Epub 2012 Jul 6.

PHI type# PHI

instances

#

residual

PHI

Predic-

tionsCorrect Recall

Preci-

sion

Predic-

tionsCorrect Recall

Preci-

sion

Patient

Name59 32 2 1 0.03 0.5 5 4 0.13 0.8

Date 228 34 0 0 0 0 8 2 0.06 0.25

ALL 287 66 2 1 0.02 0.5 13 6 0.09 0.46

Test Corpus Reviewer #1 (abstractor) Reviewer #3 (informaticist)

Rates of residual PHI detection in 50 family practice notes de-identified by HIPS

In this experiment, more than 85% of leaks were protected by HIPS


How PHI Leaks Were Detected

43

Explanation why PHI was detected Mitigation strategy

(A) PHI=Patient 1st name. First name was the salutation of formal letter pasted into the chart; and other instances of patient full name didn't match

Propagate name surrogates to all instances

(B) PHI=Patient 1st name. Chart says "Patient is East Indian." Last name (only) was replaced w/ European surrogate making 1st name stand out

Match name surrogates by national origin

(C) PHI=Patient age. Patient age appeared twice, both time = 32. Surrogate generator replaced one w/ 26, the other by 32 (random); reviewer guessed

Use non-zero offsets in surrogates & apply consistently

(D) PHI=Smoking quit date. Date format was "Sep 2010" which reviewer guessed was human (not machine) generated

Add spelling and formatting errors to surrogates


■ Ground truth data set

– Group Health Family Practice notes were tagged for PHI

(reviewed and adjudicated) to create a “ground truth” data set

■ Preparation of surrogatized corpus

– The same set of notes were run through an automated de-

identification system (MIST) to produce surrogatized output

with leaks

■ Human readers were asked to find leaks in the

surrogatized corpus

– Separate teams of readers from two sites (Group Health,

Vanderbilt) tagged PHI in surrogatized documents

– The annotators were then asked to guess which PHI might be

leaks

Replicating Human Detection Experiment

44


0.98 Guess Correct # PII # Leak Guess

Precision Leak

Recall

Effective Recall w

HIPS R1 37 32 909 109 0.86 0.29 0.97 R6 130 48 909 109 0.37 0.44 0.95 R7 28 21 909 109 0.75 0.19 0.98 R8 127 52 909 109 0.41 0.48 0.94 R9 35 17 909 109 0.49 0.16 0.98

Results from Human Detection Experiment

45

Readers from Vanderbilt, Group Health reviewed tagged,

surrogatized notes and guessed which PHI were leaks

Base system recall was 88%

Results show HIPS is ~50% to 80% effective in disguising

leaks, depending on reader

HIPS can cut leak rate in half!


Machine Attack Experiments

■ Objective:

– Assess the ability of a highly-motivated hostile attacker to

detect residual PHI in a HIPS de-identified corpus

■ Scenario:

– Publisher creates and releases a HIPS de-identified corpus

– Attacker divides corpus into train and test, and trains a model

■ Attacker (manually) annotates all PHI found in the training corpus

■ Attacker applies model to test set

■ Attacker manually inspects output – assumes PHI NOT tagged by

model are real leaks

■ Experiments underway at Vanderbilt and GHC

– Effectiveness of a machine attack is being evaluated through

simulation and comparison with actual corpus statistics

46


■ All de-identification methods “leak” – Human de-identification recall is in the same range as

automated de-identification

■ Automated de-identification has advantages – It can scale cheaply

■ Cost is in preparation of training data

■ Costs estimated to be on the order of a few thousand dollars, depending on diversity of note types

– It provides protection comparable to manual de-identification especially with use of “Hiding in Plain Sight”

■ But – automated de-identification still not widely used – Only 2 out of 8 IRBs surveyed relied solely on automated de-

identification

– Only 2 of 8 respondents said they would approve use of de-identification software if the system removed 99% of PHI

Observations

47


How to Scale the Data Wall

■ Change the cost-quality balance for automated de-

identification

– Hiding in Plain Sight reduces “leakage” at no extra cost

– Ensemble taggers (combining output from multiple taggers) may

also reduce leaks at no extra cost

■ Lower the barriers to IRB approval

– Develop metrics relevant to IRB concerns: current metrics

overestimate leakage -- not all leaks are the same

■ Understand the cost of NOT sharing data

– Not sharing impedes precision medicine, population based

studies, pharmacovigilance

– Lack of clinical data impedes development of algorithms to mine

narrative data in patient medical records

48


49

Thank you!

Questions?

Hiding in Plain Sight - Harvard University€¦ · Extract Critical Medical Information from...

Documents

Transcript of Hiding in Plain Sight - Harvard University€¦ · Extract Critical Medical Information from...