Hiding in Plain Sight - Harvard University€¦ · Extract Critical Medical Information from...
Transcript of Hiding in Plain Sight - Harvard University€¦ · Extract Critical Medical Information from...
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Hiding in Plain Sight:
De-identification of Clinical Narrative
Lynette Hirschman
The MITRE Corporation
October 26, 2015
Harvard
Research Data Access
and Innovation
Symposium
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Outline
■ Why De-identified Data?
■ Hiding in Plain Sight: Automated De-identification
■ Balancing Quality, Cost and Scale
■ Lessons learned
2
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Acknowledgements
3
■ MITRE:
– John Aberdeen, Sam Bayer, Cheryl Clark, Ben Wellner
■ Vanderbilt:
– Brad Malin, Reyin Yeniterzi, Muqun Li
■ Group Health:
– David Carrell, D.T. Tran, David Cronkite
■ Michigan:
– David Hanauer
■ i2b2 NLP Challenge Organizers:
– Ozlem Uzuner, Peter Szolovits, and many more
The work described here is the result of long term collaborations
among MITRE, Vanderbilt, Group Health, and University of Michigan;
We also gratefully acknowledge the critical contributions of
the i2b2 NLP Challenge Evaluations
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
MITRE Research Focus: Unlocking Information in Free Text ■ Critical medical observations are
locked in narrative (free text) –
– Patient and hospital records
– Biomedical literature
– Drug indications & adverse events
■ We need to “unlock” this info
– To combine data from multiple structured & unstructured sources
– To support population level studies
■ This will enable:
– Discovery of correlations between patient genotype (genetic variation) and phenotype (e.g., disease, drug response)
– Support for precision medicine, patient safety, drug adverse events
4
■ Extract Critical Medical Information from Medical Records
– Handling negation, uncertainty, temporal information
negation cues
she is never without pain
The Strategy to Unlock the Patient Record
■ Address Privacy and Data Sharing through Automated De-identification
– MIST: MITRE Identification Scrubber Toolkit
■ Create Corpora of Medical Records for Research and Evaluation
– Shared corpora, shared evaluations (i2b21, SHARPn2, BioCreative3)
■ Partner to Gain Access to Medical Use Cases
1Informatics for Integrating Biology & the Bedside (i2b2) 2Strategic Health IT Advanced Research Projects (SHARP), HHS Office of the National Coordinator for Health IT 3Critical Assessment of Information Extraction in Biology 5
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
The Clinical Data Sharing Wall
Records with protected health information (PHI) cannot be shared due to privacy constraints
Medical record de-identification
is the rate-limiting step for
secondary use applications
Secondary Use
Population Studies
NLP Research
Unstructured
Medical Records
containing PHI
6
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Find PHI using Natural Language Processing plus
pattern matching techniques
■ Transform PHI
– Redact using [TYPE] replacements
– Resynthesize using surrogates => Hiding in Plain Sight
Protecting Privacy through De-identification
HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman
with long standing hypertension who presented as a walk-in to me at
the on . Recently had been
started q.o.d. on Clonidine since to taper off of the drug.
Was told to start Zestril 20 mg. q.d. again. The patient was sent to
the for direct admission for cardioversion and
anticoagulation, with the Cardiologist, Dr. to follow.
Oak Valley Health Center July 9th
May 5th
Smith Cardiac Unit
Pearson
Sun Hill Medical Center August 12th
June 8th
Jones Cardiac Unit
Faulkner
[FACILITY XX XX XX XX]
[DATE - ZZ]
[DATE - YY]
[DOCTOR XX]
[FACILITY XX XX XX]
Oak Valley Health Center
May 5th
July 9th
Smith Cardiac Unit
Pearson
7
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ 11. Certificate/license numbers
■ 12. Vehicle identifiers and serial numbers
■ 13. Device identifiers and serial numbers
■ 14. Web Universal Resource Locators (URLs)
■ 15. Internet Protocol (IP) address numbers
■ 16. Biometric identifiers, including finger and voice prints
■ 17. Full face photographic images & comparable images
■ Any other unique identifying number, characteristic, or code
What Gets Redacted: HIPAA Identifiers
8
■ 1. Names
■ 2. Geographic subdivisions smaller than a state
■ 3. All elements of dates (except year) for dates directly related to an individual; all ages > 89
■ 4. Telephone numbers
■ 5. Fax numbers
■ 6. Electronic mail addresses
■ 7. Social security numbers
■ 8. Medical record numbers
■ 9. Health plan beneficiary numbers
■ 10. Account numbers
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Person names
– Must de-identify names of patient and family members
■ Important to preserve coreference within and across records
– Not HIPAA requirement but often redacted: provider names
– Initials
■ Ages
– For longitudinal data, may be necessary to redact or “fuzz” ages
■ Dates
– Must de-identify day/month but dates can be shifted by a random
(but consistent) offset, to preserve relative timing
– This becomes more complicated for longitudinal data
■ Locations
– Patient information: institution, room number, address
– Provider information: institution, address
What Else Gets Redacted and Other Issues
9
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Automated De-identification: Two Approaches
■ Pattern Matching approach
■ Machine Learning-based NLP approach
Model Annotated
Training Notes
trainer
Automatically
De-identified Notes
decoder
system institution-specific
name lists
hand-crafted
pattern rules
Automatically
De-identified Notes
Large maintenance effort
requiring skilled developers
Need to develop
annotated training data
10
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Berman:1
– Redact everything except stop words, UMLS concepts
■ Morrison2:
– Apply MedLEE clinical NLP system to extract medical concepts
■ eMERGE phenotype extraction
– Pool patients with a specific phenotype for statistical power
– PheKB3 has tools to enable cross-site collaboration for
algorithm development, validation, and sharing for reuse
Third Approach: Extract Clinical Info, Leave Behind PHI
11
system Medical Thesaurus
Medical Concept
Extraction
Medical Concepts
assoc w Patient
Hyper-
tension COPD
Diabetes Hyper-
tension
! Berman JJ. Arch Pathol Lab Med 2003, 680-6; 2 Morrison FP, et al: J Am Med Inform Assoc 2009, 16(1):37-9;3 https://www.phekb.org/
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Machine-learning approach: learn models from
annotated sample notes
■ De-identification system built by annotating
training examples and building models
– Contrasts with pattern matching approaches which require
hand-crafting of patterns and manual maintenance
■ System includes a resynthesis component
– Replaces PHI with surrogate (made-up) identifiers of the
appropriate type for “Hiding in Plain Sight”
MITRE Identification Scrubber Toolkit (MIST)
Model Automatically
De-identified Notes
trainer decoder
12
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Carafe Conditional Random Fields
– Probabilistic framework for assigning a label sequence to a sequence of
observations
■ Sources of evidence for Carafe
– Lexical features (the words themselves)
– Contextual features (nearby words, punctuation, other tags)
■ “Dr.” and “Attending:” good evidence that next word begins a DOCTOR phrase
– Frequency
■ “of” is a high frequency word, and never appears as part of DOCTOR in training data
Sources of Evidence for Identifying PHI
Dr. Right of the City Hospital. Attending: LIYANI RIGHT , M.D. IH85 YF132/0184
Lexical Lexical
Contextual Contextual
Frequency
Contextual
13
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
MIST: MITRE Identification Scrubber Toolkit
14
Released 2008 as Open Source: http://mist-deid.sourceforge.net/
The MIST team: John Aberdeen, Sam Bayer, Cheryl Clark,
Lynette Hirschman, Ben Wellner
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Resynthesis
15
Find Redactions
Determine Type
Randomly Generate
Replacement Transform
Census
(People)
Gazetteer
(Places)
Institution
Names
■ Problem: Because of privacy considerations, developers
only have access to scrubbed data in many instances
– Often the scrubbed data only has category placeholders, e.g.,
[DATE], [DOCTOR], [PATIENT], etc. where personally identifiable
information (PHI) appeared in the original data
■ To create usable training data, it is necessary to resynthesize
PHI in a manner such that the resulting data is realistic
■ Generate data using pattern
distribution from original
corpus, external distribution
of patterns, input PHI, or
feature probabilities
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
train model from initial documents
train (better) model from
(more) documents
How to Build a De-identification System
redact or resynthesize
marked documents
mark PHI by hand in initial
documents
mark PHI automatically
in more documents
using model
hand-correct automatically
marked documents
16
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
MIST 2.0 User Interface
17
Automatically Tagged
Document
Redacted Document
Candidate Replacements
Legend
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Balancing Redaction and Interpretability
18
Protect
Privacy AND
Retain
Clinical Value
HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman
with long standing hypertension who presented as a walk-in to me at
the on . Recently had been
started q.o.d. on Clonidine since to taper off of the drug.
Was told to start Zestril 20 mg. q.d. again. The patient was sent to
the for direct admission for cardioversion and
anticoagulation, with the Cardiologist, Dr. to follow.
Oak Valley Health Center July 9th
May 5th
Smith Cardiac Unit
Pearson
[FACILITY XX XX XX XX]
[DATE ZZ]
[DATE YY]
[FACILITY XX XX XX]
[DOCTOR XX]
[NAME XX]
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Standard measures from the NLP community
– Precision (positive predictive value):
# correctly tagged phrases returned / # phrases returned
– Recall (sensitivity):
# correctly tagged phrases returned / # phrases present
– Balanced F-measure:
harmonic mean of precision & recall: 2*P*R / (P + R)
■ Other useful metrics
– Leaks: non-redacted PHI remaining after de-identification
# false negatives/ # phrases present
– Token level accuracy (computed over words or tokens):
(# true positives + # true negatives) / # total tokens
De-Identification Metrics
19
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ This depends on:
– Note type
– PHI density
– Amount of training data
– Alignment of training data with notes types to be redacted
– Desired balance of recall and precision
– Available budget to create training data
■ How good does automated de-identification need
to be?
– This depends on use case, e.g.,
■ For unlimited data release, no leaks (high recall)!
■ For limited data release, may prefer high precision
How Good is Automated De-identification?
21
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Minimum Mean Median Maximum Std.
deviation
Micro precision 0.744 0.941 0.967 0.989 0.068 Micro recall 0.108 0.807 0.882 0.963 0.211
Micro F1 0.193 0.853 0.922 0.976 0.184
Results from i2b2/UT Health Corpus 2014
De-Identification Evaluation
22
Stubbs A, Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth
corpus. J Biomed Inform. 2015 Aug 28. pii: S1532-0464(15)00182-3. doi: 10.1016/j.jbi.2015.07.020.
Aggregate statistics for token-based evaluation of
all submissions from 10 teams –
HIPAA-identified PHI categories
Evaluated on 1304 longitudinal medical records
describing 296 patients
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
PHI Type P R F AGE 96.69 90 93.22 DATE 97.95 96.98 97.46 ID 97.27 95.7 96.48 INST 93.25 85.26 89.08 LOC 87.38 69.95 77.7 NAME 94.47 94.56 94.52 OTHER 84.68 73.87 78.91 PHONE 94.42 90.87 92.61 All 95.08 91.92 93.48
Results from Deleger et al:
23
L. Deleger, T. Lingren, Y. Ni, M. Kaiser, L. Stoutenborough, K. Marsolo, M. Kouril, K. Molnar, I. Solti,
Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research,
J. Biomed. Inform. 50 (2014) 173–183, http://dx.doi.org/10.1016/j.jbi.2014.01.014.
Results from tests on an in-house de-identification system at
Cincinnati Children’s Hospital
Corpus of 250 notes from Cincinnati Children’s Hospital Medical
Center (22 note types)
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Variation due to Note Type
■ Short
■ Minimal text
■ Highly structured
– Mostly header info
■ Much more free, narrative text
■ PHI elements throughout
■ Many, varied contexts for PHI elements
24
DISCHARGE SUMMARY
ATTENDING PHYSICIAN
**NAME[XXX WWW], M.D.
PATIENT: **NAME[AAA, BBB] MR NO:**ID-NUM
ADMITTED: **DATE[Jul 12 2000] DISCHARGED:**DATE[Jul 26 2000]
SERVICE: Neurosurgery
PRINCIPAL DIAGNOSIS: Malesuada quis, egestas quis, wisi. Donec ac sapien.
SECONDARY DIAGNOSIS: Ut orci.
PRINCIPAL PROCEDURES: Duis ultricies, metus a feugiat porttitor.
REASON FOR HOSPITALIZATION: Morbi lorem mi **AGE[in 60s]-year-old female who presented to the E.R.
on **DATE[Jul 12 00] eget purus vitae eros ornare adipiscing at 8:00 p.m. that evening. Vestibulum imperdiet
nonummy sem from 8:00-9:00 p.m. Fusce urna magna,neque eget lacus. Maecenas felis nunc, aliquam ac
sapien. Ut orci. Duis ultricies, metus a feugiat porttitor, dolor mauris convallis est, quis mattis lacus ligula eu
augue. Sed facilisis. Morbi lorem mi, tristique vitae, sodales eget, hendrerit sed, erat. Vestibulum. Fusce urna
magna day thirteen, **DATE[Jul 26 00], tincidunt quis Malesuada quis.
DISCHARGE INSTRUCTIONS: Curabitur nunc eros, euismod in, convallis at, vehicula sed consectetuer
posuere, eros mauris dignissim diam, pretium sed pede suscipit. Fusce urna magna,lorem neque eget lacus.
DISCHARGE MEDICATIONS: Steroid taper, Pepcid 20 mg b.i.d. while she is taking her steroids, nimodipine
60 mg p.o. q. 4 hr for an additional 9 days and Dilantin 200 mg p.o. t.i.d.
**NAME[AAA, BBB]
MR NO:**ID-NUM
FOLLOWUP APPOINTMENTS: Morbi lorem miwith Dr. **NAME[WWW] in the **INSTITUTION in one month.
DW/dts/E/0/323 , M.D.
T: **DATE[Jul 27 2000] 1955 **NAME[ZZZ YYY], M.D.
**NAME[M]: **DATE[Jul 27 2000] 1611 Resident, **NAME[VVV UUU TTT]
Job #: **ID-NUM
MD#:**ID-NUM
:E_O_R:
Lab Vital Signs Discharge Summary
S_O_H
Counters Report Type
355,XLl+ApzVmFgU LAB
E_O_H
[Report de-identified (Safe-harbor compliant)
by De-ID v.6.12]
:ADM:01H
:UNIQ:0chs11eo0
:TYP:LAB
:STYP:VS
:ID:**ID-NUM
:NA:**NAME[AAA, BBB CCC]
:DAT:**DATE[Jul 02 2001]
:TIM:17:40
:ACC:1
:PQ:resp get.data
:DOB:**DATE[Nov 13 1938]
:DOC:**NAME[ZZZ, YYY]
:LOC:1123-0
:LAB:BAT: 0 VSign **ID-NUM
DAT: 0 Pulse 0 bpm . 99
DAT: 0 RespRt 0 bpm . 21
DAT: 0 CollBy 0 . . **NAME[XXX WWW], CRT
:E_O_R:
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Effect of Training Data on Performance
■ Note type portability
– Four types: Letter,
Discharge, Order, Lab
– Train individual models for
each note type
– Evaluate all models
against all note types
25
Training Docs Models Test Docs
Letter
Discharge
Order
Lab
Training Docs
Hybrid Model
Test Docs
Letter
Discharge
Order
Lab
■ Hybrid model
– Train one large model
using all four note types
– Evaluate hybrid model
against all note types
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Best results when model is trained from notes of
matching type
■ Hybrid model is comparable to matched model1
– Combines notes from all types
– Has more data
Building the Right Model
26
1R. Yeniterzi, J. Aberdeen, S. Bayer, B. Wellner, L. Hirschman, and B. Malin, Effects of personal
identifier resynthesis on clinical text de-identification, JAMIA, vol. 17, no. 2, p. 159, 2010.
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Writing Complexity (Vanderbilt)*
■ Motivation:
– How to deal with hundreds of distinct note types
■ Idea:
– Automatically cluster notes based on writing complexity, and train
models for these clusters (rather than for individual note types)
■ Vanderbilt experiments
– 4500 Vanderbilt patient notes of various types
– Average F-measure results:
■ 0.917 for writing complexity clusters
■ 0.911 for note type clusters; range from 0.842 Family History to 0.964
Clinical Communication
■ 0.811 for random clusters
– Published in the International Journal of Medical Informatics
27
*Li, M., Carrell, D., Aberdeen, J., Hirschman, L. and Malin, BA. De-identification of Clinical Narratives
through Writing Complexity Measures, Int J Med Inform. 2014 Oct;83(10):750-67. doi:
10.1016/j.ijmedinf.2014.07.002. Epub 2014 Jul 24.
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ MIST evaluated against Vanderbilt hospital records
■ Explore effect of resynthesis on model portability
– Train/Test w all combinations of original (O) X resynthesized (R) data
Effect of Resynthesis on Model Performance
Training Docs Test Docs Models
Orig
(Real IDs)
DE-ID +
Resynth
Resynth
Tested at
Vanderbilt;
Unavailable
to MITRE
R. Yeniterzi, J. Aberdeen, S. Bayer, B. Wellner, L. Hirschman, and B. Malin, Effects of personal
identifier resynthesis on clinical text de-identification, JAMIA, vol. 17, no. 2, p. 159, 2010.
O => O
R => R
O models
R models
28
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Recall Results from Different Models
29
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS LAB LETTER ORDER HYBRID
Re
ca
ll
O => O
R => R
O => R
R => O
Results are dramatic:
Recall for matched models > 95% across note types
Model based on resynthesized data works BADLY for real data!
Resynthesis regularizes data,
resulting in a poorer match against real data
95%
Recall
Matched
Mismatched
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Models trained from larger numbers of documents tend
to outperform models trained on fewer documents
■ Question: how many training items are needed to
obtain good performance?
Experiment: Training Size
Training Docs Test Docs Models
30
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Training Items and F-measure
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000 10000
To
ken
F-m
easu
re
Training Items
AGE
DATE
ID-NUM
NAME
PLACE
31
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Measure 1: Leakage
How much PHI is leaked with human de-
identification?
– Humans are not perfect: recall rates in the 94-98% range
■ Measure 2: Inter-annotator agreement
How well do humans agree when they tag PHI?
– i2b2/UT corpus 20141: Tag-level based F: 0.90
– Deleger et al2: Tag-level based F: 91.76
How Good Are Humans?
1Stubbs A, Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J
Biomed Inform. 2015 Aug 28. pii: S1532-0464(15)00182-3. doi: 10.1016/j.jbi.2015.07.020. 2L. Deleger, T. Lingren, Y. Ni, M. Kaiser et al., Preparing an annotated gold standard corpus to share with extramural
investigators for de-identification research, J. Biomed. Inform. 50 (2014) 173–183,
Human de-identification is in the same range
as automated de-identification
To improve human de-identification, use more humans!
To improve automated de-identification, use more training data
32
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Human Annotation: Counts of Missed PHI1
Counts of overlooked PHI
In 100 Family Practice
notes
Containing 1,093 PHI
instances
Reviewed by
individuals,
pairs or
trios of reviewers
# L
eaked P
HI In
sta
nces
33
1From: Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.
D. S. Carrell; D. J. Cronkite; B. A. Malin; J. S. Aberdeen; L. Hirschman, submitted to Meth of Info in Medicine
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Manual De-identification of Clinical Notes1
■ Task: De-identification of free text in clinical notes
of electronic health records
– Removal of Personal Health Information: 18 classes of
information per US HIPAA2 regulations
– Annotators identify and redact all types of PHI in a patient note
■ Corpus
– 100 clinical records were de-identified by 4 annotators,
– 1093 PHI instances total (~10 instances per note)
■ Estimated steady state cost – Cost: $7.50/patient note/annotator; $0.70/annotation
– Quality: 95% recall (single annotator)
– Cost for 99% recall: $15.00/patient note (2 annotators)
1From: Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.
D. S. Carrell; D. J. Cronkite; B. A. Malin; J. S. Aberdeen; L. Hirschman, submitted to Meth of Info in Medicine
34
© 2015 The MITRE Corporation. All rights reserved.
■ Longitudinal corpus of 301 patients
– Average: ~617 tokens (words) per record
■ Total time: 310 h for double annotation
– 30 min/patient for single annotation
■ Adjudication: 2 months part time
■ Estimated cost @ $30/hour for curation:
– $15.00 per record for double annotation
– Consistent with results from Carrell et al.
Results from i2b2/UT Health Corpus 2014
35
Approved for Public Release. Case No. 15-3409
Stubbs A, Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth
corpus. J Biomed Inform. 2015 Aug 28. pii: S1532-0464(15)00182-3. doi: 10.1016/j.jbi.2015.07.020.
© 2015 The MITRE Corporation. All rights reserved.
■ Manual: Cost of de-identification scales linearly with
– # of annotators
– # of notes
■ Automated: Cost of de-identification depends on
– Quantity of (matched) training data used
– Target recall/precision levels
■ E.g., for recall >90%, need 100-1000 exemplars of each type
■ However single annotator is generally good enough
■ Example:
– De-identify 1000 notes at recall > 95%
■ Manual cost: 1000 notes * $7.50/note/annotator = $7500
■ Automated cost: 250 notes * $7.50/note/annotator = $1875
Estimating the Cost of De-identification
36
© 2015 The MITRE Corporation. All rights reserved.
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
0 1 2 3 4 5 6 7 8 9
F-m
easu
re
Cumulative Person-Hours
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
0 1 2 3 4 5 6 7 8 9
F-m
easu
re
Cumulative Person-Hours
■ Experiment performed
by one of our medical
partners using MIST
■ Small amount of hand-
labeling (~8 hours)
– F-measure: 0.953
■ At $250/hr for physician
time, this is a $2000
effort
■ Performance improves
with more hand-labeling
Bootstrapping a System: Results1
All PHI Types
NAME
DATE
To
ken
38 1D. Hanauer, J. Aberdeen, S. Bayer, B. Wellner, e al.,, Bootstrapping a de-identification system for narrative patient
records: Cost-performance tradeoffs, International Journal of Medical Informatics, vol. 82, no. 9, pp. 821–831, Sep. 2013 37
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Resynthesis and Hiding in Plain Sight
38
HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman
with long standing hypertension who presented as a walk-in to me at
the on . Recently had been
started q.o.d. on Clonidine since to taper off of the drug.
Was told to start Zestril 20 mg. q.d. again. The patient was sent to
the for direct admission for cardioversion and
anticoagulation, with the Cardiologist, Dr. to follow.
Oak Valley Health Center July 9th
May 5th
Smith Cardiac Unit
Pearson
Oak Valley Health Center
May 5th
July 9th
Pearson
HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman
with long standing hypertension who presented as a walk-in to me at
the on . Recently had been
started q.o.d. on Clonidine since to taper off of the drug.
Was told to start Zestril 20 mg. q.d. again. The patient was sent to
the for direct admission for cardioversion and
anticoagulation, with the Cardiologist, Dr. to follow.
Sun Hill Medical Center August 12th
June 8th
Smith Cardiac Unit
Faulkner
Resynth Error (but who can tell?)
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Preliminary Study of HIPS Detection1
■ Experiment – De-identify 100 family practice notes using a de-identification
system
– Replace PHI with realistic surrogates (“Hiding in Plain Sight”)
– Ask reviewers to: ■ Identify PHI in the surrogatized notes
■ Guess which PHI instances are “leaks”
■ Getting enough leaks to test hypothesis
– Originally not enough residuals – used model trained for different
note type to increase residuals
■ Measured ability of reviewers to detect HIPS PHI
– Detection rate is fairly low – 5-10% of residual leaks found
– Reduced risk of disclosure yielded effective de-id rates >99%,
better than manual methods
39
1D. Carrell, B. Malin, J. Aberdeen, S. Bayer, C. Clark, B. Wellner, and L. Hirschman, Hiding in plain sight: use of realistic
surrogates to reduce exposure of protected health information in clinical text, JAMIA, vol. 20 (2), pp. 342–348, 2012
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Hiding In Plain Sight: Example
40
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Guessing the Leaks after HIPS
41
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Results from Initial HIPS Experiment
42
Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text..
Carrell D, Malin B, Aberdeen J, Bayer S, Clark C, Wellner B, Hirschman L. J Am Med Inform Assoc. 2013 Mar-
Apr;20(2):342-8. doi: 10.1136/amiajnl-2012-001034. Epub 2012 Jul 6.
PHI type# PHI
instances
#
residual
PHI
Predic-
tionsCorrect Recall
Preci-
sion
Predic-
tionsCorrect Recall
Preci-
sion
Patient
Name59 32 2 1 0.03 0.5 5 4 0.13 0.8
Date 228 34 0 0 0 0 8 2 0.06 0.25
ALL 287 66 2 1 0.02 0.5 13 6 0.09 0.46
Test Corpus Reviewer #1 (abstractor) Reviewer #3 (informaticist)
Rates of residual PHI detection in 50 family practice notes de-identified by HIPS
In this experiment, more than 85% of leaks were protected by HIPS
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
How PHI Leaks Were Detected
43
Explanation why PHI was detected Mitigation strategy
(A) PHI=Patient 1st name. First name was the salutation of formal letter pasted into the chart; and other instances of patient full name didn't match
Propagate name surrogates to all instances
(B) PHI=Patient 1st name. Chart says "Patient is East Indian." Last name (only) was replaced w/ European surrogate making 1st name stand out
Match name surrogates by national origin
(C) PHI=Patient age. Patient age appeared twice, both time = 32. Surrogate generator replaced one w/ 26, the other by 32 (random); reviewer guessed
Use non-zero offsets in surrogates & apply consistently
(D) PHI=Smoking quit date. Date format was "Sep 2010" which reviewer guessed was human (not machine) generated
Add spelling and formatting errors to surrogates
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ Ground truth data set
– Group Health Family Practice notes were tagged for PHI
(reviewed and adjudicated) to create a “ground truth” data set
■ Preparation of surrogatized corpus
– The same set of notes were run through an automated de-
identification system (MIST) to produce surrogatized output
with leaks
■ Human readers were asked to find leaks in the
surrogatized corpus
– Separate teams of readers from two sites (Group Health,
Vanderbilt) tagged PHI in surrogatized documents
– The annotators were then asked to guess which PHI might be
leaks
Replicating Human Detection Experiment
44
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
0.98 Guess Correct # PII # Leak Guess
Precision Leak
Recall
Effective Recall w
HIPS R1 37 32 909 109 0.86 0.29 0.97 R6 130 48 909 109 0.37 0.44 0.95 R7 28 21 909 109 0.75 0.19 0.98 R8 127 52 909 109 0.41 0.48 0.94 R9 35 17 909 109 0.49 0.16 0.98
Results from Human Detection Experiment
45
Readers from Vanderbilt, Group Health reviewed tagged,
surrogatized notes and guessed which PHI were leaks
Base system recall was 88%
Results show HIPS is ~50% to 80% effective in disguising
leaks, depending on reader
HIPS can cut leak rate in half!
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
Machine Attack Experiments
■ Objective:
– Assess the ability of a highly-motivated hostile attacker to
detect residual PHI in a HIPS de-identified corpus
■ Scenario:
– Publisher creates and releases a HIPS de-identified corpus
– Attacker divides corpus into train and test, and trains a model
■ Attacker (manually) annotates all PHI found in the training corpus
■ Attacker applies model to test set
■ Attacker manually inspects output – assumes PHI NOT tagged by
model are real leaks
■ Experiments underway at Vanderbilt and GHC
– Effectiveness of a machine attack is being evaluated through
simulation and comparison with actual corpus statistics
46
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
■ All de-identification methods “leak” – Human de-identification recall is in the same range as
automated de-identification
■ Automated de-identification has advantages – It can scale cheaply
■ Cost is in preparation of training data
■ Costs estimated to be on the order of a few thousand dollars, depending on diversity of note types
– It provides protection comparable to manual de-identification especially with use of “Hiding in Plain Sight”
■ But – automated de-identification still not widely used – Only 2 out of 8 IRBs surveyed relied solely on automated de-
identification
– Only 2 of 8 respondents said they would approve use of de-identification software if the system removed 99% of PHI
Observations
47
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
How to Scale the Data Wall
■ Change the cost-quality balance for automated de-
identification
– Hiding in Plain Sight reduces “leakage” at no extra cost
– Ensemble taggers (combining output from multiple taggers) may
also reduce leaks at no extra cost
■ Lower the barriers to IRB approval
– Develop metrics relevant to IRB concerns: current metrics
overestimate leakage -- not all leaks are the same
■ Understand the cost of NOT sharing data
– Not sharing impedes precision medicine, population based
studies, pharmacovigilance
– Lack of clinical data impedes development of algorithms to mine
narrative data in patient medical records
48
© 2015 The MITRE Corporation. All rights reserved. Approved for Public Release. Case No. 15-3409
49
Thank you!
Questions?