Panel: Automatic Clinical Text De-Identification: Is It ... · Kvinna med hjrtsvikt,...
Transcript of Panel: Automatic Clinical Text De-Identification: Is It ... · Kvinna med hjrtsvikt,...
-
Panel: Automatic Clinical Text De-Identification: Is It Worth It, and Could It Work for Me?
Hercules Dalianis Clinical Text Mining Group Department of Computer and Systems Sciences (DSV) [email protected]
-
Background
• Starting 2007 • Karolinska University Hospital, Stockholm • Greater Stockholm (City Council) 2 million inhabitants • 1800 beds/inpatients • 550 clinical units
Hercules Dalianis, MEDINFO 2013 2
-
TakeCare EPR system
• Swedish electronic patient record system, now owned by CompuGroup Medical
• Centralized, text file based • Built on APL programming language • Data transferred to MySQL database to make it
manageable (Intelligence)
Hercules Dalianis, MEDINFO 2013 3
-
Ethical permission
• What type of research will be carried out • How will it be carried out • No social security number • No personal names • Safe guard of data
Hercules Dalianis, MEDINFO 2013 4
-
Encryption and safe guard
• Encrypted server • Password protected • Locked into an alarmed room • Server locked to a rack • No Internet connection • Few people have access to this server (that have
to sign security paper) => Probably safer than at the hospital
Hercules Dalianis, MEDINFO 2013 5
-
Trust, Trust and more Trust • Good contacts with hospital management • They decide for the whole hospital/all clinical units • No psychiatric or veneric diseases, no paperless refugees
Hercules Dalianis, MEDINFO 2013 6
-
• We obtained 1 million patient records from 550 clinical units from the year 2006-2010
• In several extracts that also continue • Each patient have an unique social security
number, from birth to dead Replaced by a serial number
• All patient names removed • The rest including sensitive text is present
Hercules Dalianis, MEDINFO 2013 7
Stockholm EPR Corpus
-
DEID work
• Yes, we did it also to obtain an overview of what problems may occur
• We followed HIPAA*) but adapted it for Swedish conditions
*) Health Insurance Portability and Accountability Act
Hercules Dalianis, MEDINFO 2013 8
-
Hercules Dalianis
The Stockholm EPR PHI*) corpus
• 100 electronic patient records (EPRs) in Swedish
• Five clinics: Neurology, Orthopaedia, Infection, Dental Surgery and Nutrition
• 20 patients from each clinic, 50% men, 50% women • 380 000 tokens • Three annotators annotated the whole corpus
*) Protected Health Information 9
-
Hercules Dalianis 10
28 PHI-classes
• Account_Number, Age, Age_Over_89, Biometric_Identifier, Date_Part, Full_Date, Year,
First_Name, Last_Name, Patient_First_Name,
Patient_Last_Name, Relative_First_Name,
Relative_Last_Name, Clinician_First_Name,
Clinician_Last_Name, Location, Country, Municipality,
Organization, Street_Address, Town, Health_Care_Unit,
Device_Identifier_and_Serial_Number, Ethnicity,
Fax_Number, Phone_Number, Relation, Uncertain
-
Hercules Dalianis 11
-
Consensus eight annotation classes
• Age • Date_Part • Full_Date • First_Name • Last_Name, • Health_Care_Unit • Location • Phone_Number
Hercules Dalianis 12
-
Annotation classes and instances
• Age 56 • Full date 710 • Date part 500 • First name 923 • Last name 928 • Location 1 021 • Health care unit 148 • Phone number 135 Sum: 4 421
Hercules Dalianis 13
-
• 380 000 tokens • 4 421 sensitive instances • ~ 1 percent sensitive information
Hercules Dalianis 14
-
Eight annotation classes training and test using Stanford NER-CRF
Hercules Dalianis 15
-
• 0.95-0.74 precision, • 0.83-0.36 recall • 0.90-0.49 F-score • The 8 annotation classes and the words • The rest is Black box
– Window breadth – Distance between words etc
Hercules Dalianis 16
Conditional Random fields à la Stanford NER
-
Research on Stockholm EPR Corpus
• DEID and Resynthesis • Factuality level detection of diagnoses • Negation detection • Detecting the amount of hospital-acquired
infections (HAI) • Detection of adverse drug events • Comorbidities
Hercules Dalianis, MEDINFO 2013 17
-
Conclusion
• Preferably to work on original data • Too costly and difficult to de-identify data • Not safe enough • De-identification makes the data too noisy.
Hercules Dalianis, MEDINFO 2013 18
-
References
• Velupillai, S., H. Dalianis, M. Hassel and G. H. Nilsson. 2009. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. International Journal of Medical Informatics (2009), doi:10.1016/j.ijmedinf.2009.04.005
• Dalianis, H. and S. Velupillai. 2010. De-identifying Swedish Clinical Text - Refinement of a Gold Standard and Experiments with Conditional Random Fields, Journal of Biomedical Semantics 2010, 1:6 (12 April 2010)
Hercules Dalianis, MEDINFO 2013 19
-
• Alfalahi, A., S. Brissman and H. Dalianis. 2012. Pseudonymisation of person names and other PHIs in an
annotated clinical Swedish corpus. In the Proceedings of the
Third Workshop on Building and Evaluating Resources for
Biomedical Text Mining (BioTxtM 2012) held in conjunction
with LREC 2012, May 26, Istanbul, pp 49-54
Hercules Dalianis, MEDINFO 2013 20
-
Comorbidities in Comorbidity-view
• Which ICD-10 codes co-occur with which other ones
Hercules Dalianis 21
-
Hercules Dalianis 22
Comorbidity View
-
Hercules Dalianis 23
-
Hercules Dalianis 24
-
Hercules Dalianis 25
123 H - IVA 322916614D 2007-08-21 9:12 1944 Kvinna Anamnesis Kvinna med hjrtsvikt, förmaksflimmer, angina pectoris. Ensamstående änka. Tidigare CVL med sequelae högersidig hemipares och afasi. Tidigare vårdad för krampanfall misstänkt apoplektisk. Inkommer nu efter att ha blivit hittad på en stol och sannolikt suttit så över natten. Inkommer nu för utredning. Sonen Johan är med.
Example record (Anonymized manually)
-
23 H - IVA 322916614D 2008-08-21 10:54 1944 Kvinna Bedömning Grav hjärtsvikt efter hjärtinfarkt x 2 inklusive eoisod med asystoli och HLR. EF 20-25%. Neurologisk påverkan med hösidig svaghet. Blodprov. Odlingar tas i blod och urin. Remiss skickas pulm-rtg enl dr Svenssons anteckning. Atelektaser. Pneumoni, I110. Hjärtinsufficiens, ospecificerad, I509
Hercules Dalianis 26
-
Hercules Dalianis 27
(English translation) 123 H - IVA 322916614D 2008-08-21 9:12 1944 Woman Anamnesis
Woman with hert failures, atrial fibrillation, and angina pectoris. Single widow. Former CVL with sequele, rght hemiparesis and aphasia. Prior hospital care for seizures, suspected to be apoepeleptic. Arrive to hospital after being found in a chair and probably been sitting there over night. Arrive for further investigation and care. Accompanied by her son Johan.
-
Hercules Dalianis 28
123 H - IVA 322916614D 2008-08-21 10:54 1944 Woman Assessment/Plan Severe heart failure after heart infarction x 2. including episode with heart arrest and acute heart arrest treatment. Ejection fracture (EF) 20-25%. Neurological symptoms with right sided hemiparesis. Blood samples. Culture for blood and urine. Referral for pulmonary x-ray according to dr Svensson’s notes. Atelectases. Pneumonia, I110. Heart failure, unspecified, I509.
-
Automatic Clinical Text De-Identification: Is It Worth It, and
Could It Work for Me?
Stéphane M. Meystre Biomedical Informatics, University of Utah, USA Hercules Dalianis Computer and Systems Sciences, Stockholm University, Sweden
Pierre Zweigenbaum ILES, LIMSI-CNRS, France
Medinfo 2013Copenhagen, August 23, 2013
-
De-identificationPrivacy and confidentiality of clinical dataIn the U.S., the HIPAA (Health Insurance Portability and Accountability Act) protects the confidentiality of patient data.The Common Rule protects the confidentiality of research subjects. These laws typically require the informed consent of the patient and approval of the IRB to use data for research purposes, but these requirements are waived if data are de-identified.
De-identification means that explicit identifiers are hidden or removed.Often used interchangeably with anonymization, but the latter implies that the data cannot be linked to identify the patient (i.e., de-identified is often far from anonymous). Scrubbing is also sometimes used as a synonym of de-identification.
-
De-identificationPrivacy and confidentiality of clinical dataIn the U.S., the HIPAA (Health Insurance Portability and Accountability Act) protects the confidentiality of patient data.The Common Rule protects the confidentiality of research subjects. These laws typically require the informed consent of the patient and approval of the IRB to use data for research purposes, but these requirements are waived if data are de-identified.
De-identification means that explicit identifiers are hidden or removed.Often used interchangeably with anonymization, but the latter implies that the data cannot be linked to identify the patient (i.e., de-identified is often far from anonymous). Scrubbing is also sometimes used as a synonym of de-identification.
-
De-identification (cont.)According to the HIPAA, the Safe Harbor Methodology requires the following PHI to be removed:
1. Names2. All geo-subdivisions smaller
than a State3. All elements of dates
(except year)4. Phone numbers5. Fax numbers6. Electronic mail addresses7. Social Security numbers8. Medical record numbers9. Health plan beneficiary
numbers
10.Account numbers11.Certificate/license numbers12.Vehicle identifiers and serial numbers13.Device identifiers and serial numbers14.Web Universal Resource Locators15.Internet Protocol address numbers16.Biometric identifiers, including finger and
voice prints17.Full face photographic images and any
comparable images18.Any other unique identifying number,
characteristic, or code
-
De-identification (cont.)Manual text de-identification is a lengthy and costly process (about 90 s per document).
NLP can be used to automatically de-identify electronic clinical documents.
Several NLP-based applications have been developed for clinical text de-identification, but:
• they are developed for one or a few clinical note types,• in a specific institution or specialty,• to detect and remove/hide certain categories of PHI only...Overall, their generalizability is a problem, but a problem that can be improved.
-
De-identification (cont.)Manual text de-identification is a lengthy and costly process (about 90 s per document).
NLP can be used to automatically de-identify electronic clinical documents.
Several NLP-based applications have been developed for clinical text de-identification, but:
• they are developed for one or a few clinical note types,• in a specific institution or specialty,• to detect and remove/hide certain categories of PHI only...Overall, their generalizability is a problem, but a problem that can be improved.
-
PresentersHercules Dalianis, PhD
Professor in Computer and Systems Sciences, at the Stockholm University, Sweden.De-identifying Swedish health records
Pierre Zweigenbaum, PhD
Director of Research at the CNRS, in the LIMSI, Orsay, France.De-identification of French clinical records
Stéphane Meystre, MD, PhD
Assistant Professor in Biomedical Informatics, at the University of Utah, USA.De-identification of clinical documents at the U.S. VHA, and issues related with de-identification (impact, risk for re-identification)
-
Automatic VHA Clinical Text De-Identification
Stéphane M. MeystreBiomedical Informatics, University of Utah
Medinfo 2013Copenhagen, August 23, 2013
-
VA clinical data de-identificationVA Center for Healthcare Informatics Research (CHIR) de-identification project:
National project to advance the methodology for automated de-identification of patient data with a systematic approach of evaluating existing de-identification systems, exploring innovative methods and techniques for de-identification, and combining the best-performing ones in a best-of-breed application.
Also includes the evaluation of the level of anonymity of de-identified clinical notes, and the impact of text de-identification on subsequent uses of the clinical notes.
-
VA clinical data de-identificationVA Center for Healthcare Informatics Research (CHIR) de-identification project:
National project to advance the methodology for automated de-identification of patient data with a systematic approach of evaluating existing de-identification systems, exploring innovative methods and techniques for de-identification, and combining the best-performing ones in a best-of-breed application.
Also includes the evaluation of the level of anonymity of de-identified clinical notes, and the impact of text de-identification on subsequent uses of the clinical notes.
-
Existing data de-identification evaluationLiterature review of related publications:Large variety of PHI categories detected
-
Existing data de-identification evaluationLiterature review of related publications:Large variety of PHI categories detected
-
Existing data de-identification evaluationLiterature review of related publications:Large variety of PHI categories detected
-
Existing data de-identification evaluationLiterature review of related publications:Large variety on methods used.
-
"Out-of-the-box" evaluation:
Text de-identification systems- Rule-based systems:
•
HMS Scrubber (Beckwith et al., 2006);•
MeDS (Friedlin and McDonald, 2008); and•
MIT deid system (Neamatullah et al., 2008).
- Machine learning-based systems: •
MITRE Identification Scrubber Toolkit (MIST) (Aberdeen et
al., 2010) •
Health Information DE-identification (HIDE) system
(Gardner and Xiong, 2009).
Traditional NER system• Stanford NER system (Finkel et al., 2005)
Existing data de-identification evaluation
-
"Out-of-the-box" evaluation (cont.):
Training:- Rule-based systems run "out-of-the-box"
- Machine learning-based systems trained with other corpus of 225 randomly selected VHA clinical documents, manually annotated for PHI (names).
- Stanford NER system run with trained models available with its distribution.
Testing with corpus of 50 randomly selected VHA clinical documents, manually annotated for PHI (names).
Existing data de-identification evaluation
-
"Out-of-the-box" evaluation results:
System Precision Recall F2-measure
HMS Scrubber
MeDS
MIT deid
MIST
HIDE
Stanford NER
0.150 0.675 0.397
0.149 0.768 0.419
0.636 0.893 0.826
0.865 0.319 0.356
0.975 0.376 0.429
0.692 0.723 0.716
Existing data de-identification evaluation
-
"Out-of-the-box" evaluation results:
System Precision Recall F2-measure
HMS Scrubber
MeDS
MIT deid
MIST
HIDE
Stanford NER
0.150 0.675 0.397
0.149 0.768 0.419
0.636 0.893 0.826
0.865 0.319 0.356
0.975 0.376 0.429
0.692 0.723 0.716
Existing data de-identification evaluation
-
"Out-of-the-box" evaluation results:
System Precision Recall F2-measure
HMS Scrubber
MeDS
MIT deid
MIST
HIDE
Stanford NER
0.150 0.675 0.397
0.149 0.768 0.419
0.636 0.893 0.826
0.865 0.319 0.356
0.975 0.376 0.429
0.692 0.723 0.716
Existing data de-identification evaluation
-
"Out-of-the-box" evaluation results:
System Precision Recall F2-measure
HMS Scrubber
MeDS
MIT deid
MIST
HIDE
Stanford NER
0.150 0.675 0.397
0.149 0.768 0.419
0.636 0.893 0.826
0.865 0.319 0.356
0.975 0.376 0.429
0.692 0.723 0.716
Existing data de-identification evaluation
-
Our "best-of-breed" approach (BoB)
Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. JAMIA. 2012 Sep 4.
-
Our "best-of-breed" approach (BoB)
Pre-processing
Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. JAMIA. 2012 Sep 4.
-
Our "best-of-breed" approach (BoB)
Pre-processing
High-sensitivity extraction component
Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. JAMIA. 2012 Sep 4.
-
Our "best-of-breed" approach (BoB)
Pre-processing
High-sensitivity extraction component
False positives filtering component
Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. JAMIA. 2012 Sep 4.
-
NLP pre-processing:• Sentence segmentation (adapted from OpenNLP and
retrained models with VHA clinical text)• Tokenization• Part-of-speech tagging (adapted from OpenNLP and cTAKES
trained models)• Phrase chunking (adapted from OpenNLP and cTAKES
trained models)• LVG normalization (NLM development)
Our "best-of-breed" approach
-
High-sensitivity extraction component:Mostly based on rules (context keywords and regex patterns) and dictionary lookups (Lucene with common English words and frequently occurring names from the 1990 U.S. Census).
Our "best-of-breed" approach
-
High-sensitivity extraction component:Mostly based on rules (context keywords and regex patterns) and dictionary lookups (Lucene with common English words and frequently occurring names from the 1990 U.S. Census).
Dependent on the quality of the patterns and dictionary completenessPHI formats and instances not supported will be missed!!
Our "best-of-breed" approach
-
High-sensitivity extraction component:Mostly based on rules (context keywords and regex patterns) and dictionary lookups (Lucene with common English words and frequently occurring names from the 1990 U.S. Census).
Add machine learning based on sequence labeling (traditional NER tasks): Stanford coreNLP library (CRF) trained to recognize person names (using our VHA training corpus).Goal is to maximize recall, even if precision is altered.
Dependent on the quality of the patterns and dictionary completenessPHI formats and instances not supported will be missed!!
Our "best-of-breed" approach
-
False positives filtering component:Based on machine learning classifiers
• Classifies candidate annotations as true or false positives• Support Vector Machine classifier (LIBSVM, RBF kernel) and various features (lexical, morphological, syntactic, and method used to detect PHI (name) candidate)
PersonNames
Trainingmodel
Positive training examplescorrect annotations derived from the high-sensitivity extraction component
Negative training examplesincorrect annotations derived from the high-sensitivity extraction component
Our "best-of-breed" approach
-
Summative evaluation with reference standard of 800 VHA clinical notes (500 training, 300 testing):
Evaluation of BoB
PHI categories MITR
Patient Name 0.590Relative Name 0.600Healthcare Provider Name
0.319Other Person Name 0.111Street City 0.828State Country 0.689Deployment 0.057ZIP Code 1Healthcare Units 0.008Other Organizations 0.033Date 0.399Age > 89 0.250Phone Number 0.494Electronic Address 1SSN 1Other ID Number 0.117Overall macro-averaged
0.468
Precision 0.311Recall 0.350F1-measure 0.329F2-measure 0.341
Ove
rall
µ-av
erag
ed
Some PHI categories have very low recall because of missing rules/patterns or dictionary entries.
-
Summative evaluation with reference standard of 800 VHA clinical notes (500 training, 300 testing):
Evaluation of BoB
PHI categories MIT RulesR R
Patient Name 0.590 0.972Relative Name 0.600 0.960Healthcare Provider Name
0.319 0.920Other Person Name 0.111 1Street City 0.828 0.962State Country 0.689 0.953Deployment 0.057 1ZIP Code 1 1Healthcare Units 0.008 0.832Other Organizations 0.033 0.824Date 0.399 0.963Age > 89 0.250 1Phone Number 0.494 0.989Electronic Address 1 1SSN 1 1Other ID Number 0.117 0.978Overall macro-averaged
0.468 0.960
Precision 0.311 0.362Recall 0.350 0.928F1-measure 0.329 0.521F2-measure 0.341 0.707
Ove
rall
µ-av
erag
ed
Rules/patterns and dictionary entries specific to VHA clinical notes were required (e.g., date pattern for formats like ‘09/09/09@1200’), and dictionary fuzzy-matches were also added.
-
Summative evaluation with reference standard of 800 VHA clinical notes (500 training, 300 testing):
Evaluation of BoB
PHI categories MIT Rules CRF Rules+CRFR R R R
Patient Name 0.590 0.972 0.953 0.992Relative Name 0.600 0.960 0.960 0.960Healthcare Provider Name
0.319 0.920 0.898 0.963Other Person Name 0.111 1 0.667 1Street City 0.828 0.962 0.872 0.974State Country 0.689 0.953 0.757 0.973Deployment 0.057 1 -- 1ZIP Code 1 1 -- 1Healthcare Units 0.008 0.832 0.755 0.914Other Organizations 0.033 0.824 0.549 0.912Date 0.399 0.963 0.917 0.977Age > 89 0.250 1 -- 1Phone Number 0.494 0.989 -- 0.989Electronic Address 1 1 -- 1SSN 1 1 -- 1Other ID Number 0.117 0.978 -- 0.978Overall macro-averaged
0.468 0.960 -- 0.977
Precision 0.311 0.362 -- 0.346Recall 0.350 0.928 -- 0.961F1-measure 0.329 0.521 -- 0.509F2-measure 0.341 0.707 -- 0.709
Ove
rall
µ-av
erag
ed
CRFs allowed detecting PHI missing in rules/patterns or dictionaries,but added significant noise.
-
Summative evaluation with reference standard of 800 VHA clinical notes (500 training, 300 testing):
Evaluation of BoB
PHI categories MIT Rules CRF Rules+CRF BoB fullBoB fullR R R R R P
Patient Name 0.590 0.972 0.953 0.992 0.980
0.707Relative Name 0.600 0.960 0.960 0.960 0.920 0.707Healthcare Provider Name
0.319 0.920 0.898 0.963 0.9430.707
Other Person Name 0.111 1 0.667 1 0.888
0.707
Street City 0.828 0.962 0.872 0.974 0.943 0.679State Country 0.689 0.953 0.757 0.973 0.878 0.751Deployment 0.057 1 -- 1 0.887 0.859ZIP Code 1 1 -- 1 1 1Healthcare Units 0.008 0.832 0.755 0.914 0.811 0.836Other Organizations 0.033 0.824 0.549 0.912 0.725 0.578Date 0.399 0.963 0.917 0.977 0.971 0.934Age > 89 0.250 1 -- 1 1 0.8Phone Number 0.494 0.989 -- 0.989 0.956 1Electronic Address 1 1 -- 1 1 1SSN 1 1 -- 1 1 0.964Other ID Number 0.117 0.978 -- 0.978 0.917 0.831Overall macro-averaged
0.468 0.960 -- 0.977 0.926 0.841
Precision 0.311 0.362 -- 0.346 0.8360.836Recall 0.350 0.928 -- 0.961 0.9220.922F1-measure 0.329 0.521 -- 0.509 0.8770.877F2-measure 0.341 0.707 -- 0.709 0.9040.904
Ove
rall
µ-av
erag
ed
-
Oscar Ferrandez Escamez (University of Utah, now Nuance)
Brett South (University of Utah and SLC VA)
Shuying Shen (University of Utah and SLC VA)
Jeffrey Friedlin (Regenstrief Institute)
Matthew Maw (SLC VA)
Matthew Samore (University of Utah and SLC VA)
Funding by VA HSR&D (CHIR; HIR 08-374)
Questions and comments:
Acknowledgments
Thank you!
-
Quality of De-Identification, and Impact on Clinical Information
Stéphane M. MeystreBiomedical Informatics, University of Utah
Medinfo 2013Copenhagen, August 23, 2013
-
PHI content varies significantly between various clinical corpora:
Generalizability of de-identification
-
PHI content varies significantly between various clinical corpora:
Generalizability of de-identification
-
PHI content varies significantly between various clinical corpora:
Generalizability of de-identification
-
De-identification applications tested “out-of-the-box” with our VHA corpus: low performance!
•Rule-based systems reach 32-26% recall and 14-42% precision (fully-contained matches, one overall PHI category)
•Machine learning-based systems reach 28-30% recall and 56-58% precision (trained with the i2b2 deid corpus)
Generalizability of de-identification
-
The VHA training and testing corpora• Variety of clinical notes (stratified random sample)• Annotated for all HIPAA categories, some VHA-specific categories (deployment locations, units), and eponyms
• 500 documents for training, 300 documents for testing
The 2006 i2b2 de-identification challenge corpus• Discharge summaries from Partners Healthcare, de-identified and PHI resynthesized with "± realistic" surrogates
• Selection of PHI categories subset of HIPAA (Patient, Doctor, Hospital, IDs, Dates, Phone numbers, Ages)
• 669 documents for training, 220 documents for testing
Generalizability evaluation
-
Applications training and testing:
Train
Test
VHAVHAVHA
VHAVHAVHA
All / Some / No dictionaries*
*Dictionaries used by MIST and HIDE
Generalizability evaluation (cont.)
-
Applications training and testing:
Train
Test
VHAVHAVHA
VHAVHAVHA
All / Some / No dictionaries*
i2b2i2b2i2b2
i2b2i2b2i2b2
No dictionaries*
*Dictionaries used by MIST and HIDE
Generalizability evaluation (cont.)
-
Applications training and testing:
Train
Test
VHAVHAVHA
VHAVHAVHA
All / Some / No dictionaries*
i2b2i2b2i2b2
i2b2i2b2i2b2
No dictionaries* No dictionaries*
i2b2i2b2i2b2
VHAVHAVHA
*Dictionaries used by MIST and HIDE
Generalizability evaluation (cont.)
-
Results (VHA corpus)
MIST* HIDE** BoB
Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)
Overall macro-avg(PHI-type level)
Precision 0.926 0.933 0.836
Recall 0.888 0.863 0.922F1-measure 0.907 0.897 0.877F2-measure 0.895 0.877 0.904
Recall 0.737 0.729 0.926
* Best MIST configuration, with no dictionaries** Best HIDE configuration, with selected dictionaries
Generalizability evaluation (cont.)
-
Results (VHA corpus)
MIST* HIDE** BoB
Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)
Overall macro-avg(PHI-type level)
Precision 0.926 0.933 0.836
Recall 0.888 0.863 0.922F1-measure 0.907 0.897 0.877F2-measure 0.895 0.877 0.904
Recall 0.737 0.729 0.926
* Best MIST configuration, with no dictionaries** Best HIDE configuration, with selected dictionaries
Generalizability evaluation (cont.)
-
Results (VHA corpus)
MIST* HIDE** BoB
Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)
Overall macro-avg(PHI-type level)
Precision 0.926 0.933 0.836
Recall 0.888 0.863 0.922F1-measure 0.907 0.897 0.877F2-measure 0.895 0.877 0.904
Recall 0.737 0.729 0.926
* Best MIST configuration, with no dictionaries** Best HIDE configuration, with selected dictionaries
Generalizability evaluation (cont.)
-
Results (VHA corpus)
MIST* HIDE** BoB
Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)
Overall macro-avg(PHI-type level)
Precision 0.926 0.933 0.836
Recall 0.888 0.863 0.922F1-measure 0.907 0.897 0.877F2-measure 0.895 0.877 0.904
Recall 0.737 0.729 0.926
* Best MIST configuration, with no dictionaries** Best HIDE configuration, with selected dictionaries
Generalizability evaluation (cont.)
-
Results (VHA corpus)
-
Results (VHA corpus)
-
Training with our VHA corpus, and testing with the i2b2 corpus
MIST HIDE BoB
Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)
Overall macro-avg(PHI-type level)
Precision 0.705 0.712 0.691
Recall 0.749 0.576 0.820F1-measure 0.726 0.637 0.750F2-measure 0.740 0.599 0.790
Recall 0.610 0.461 0.664
Results (VHA/i2b2 corpora)
MIST and HIDE with no dictionaries
Generalizability evaluation (cont.)
-
Training with our VHA corpus, and testing with the i2b2 corpus
MIST HIDE BoB
Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)
Overall macro-avg(PHI-type level)
Precision 0.705 0.712 0.691
Recall 0.749 0.576 0.820F1-measure 0.726 0.637 0.750F2-measure 0.740 0.599 0.790
Recall 0.610 0.461 0.664
Results (VHA/i2b2 corpora)
MIST and HIDE with no dictionaries
Generalizability evaluation (cont.)
-
Training with our VHA corpus, and testing with the i2b2 corpus
MIST HIDE BoB
Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)
Overall macro-avg(PHI-type level)
Precision 0.705 0.712 0.691
Recall 0.749 0.576 0.820F1-measure 0.726 0.637 0.750F2-measure 0.740 0.599 0.790
Recall 0.610 0.461 0.664
Results (VHA/i2b2 corpora)
MIST and HIDE with no dictionaries
Generalizability evaluation (cont.)
-
Training with our VHA corpus, and testing with the i2b2 corpus
Results (VHA/i2b2 corpora)
-
Training with our VHA corpus, and testing with the i2b2 corpus
Results (VHA/i2b2 corpora)
-
Training with our VHA corpus, and testing with the i2b2 corpus
Results (VHA/i2b2 corpora)
-
i2b2 corpus and combination with VHA evaluation:• Training and testing with i2b2 corpora allows for good
performance, even if dictionaries less useful (BoB's CRF-based NER helped here).
• Generalizability remains an issue for all systems, when training with one corpus type, and training with another one. Not one system achieved good results (overall macro-averaged recall 46-66%).
• BoB’s design still reaches our goal, with the highest recall among the three systems, and obtaining similar precision results.
Generalizability evaluation (cont.)
-
Some clinical information is more likely to be mistakenly considered as PHI.
Eponyms for example could easily be considered as person names. In our corpus, they represent various categories of clinical information:• Procedures and signs (40% of eponyms): Hartmann, Nissen,
Roux, Whipple, Apgar, Babinski, etc.
•Diseases (36%): Alzheimer, Addison, Asperger, Basedow, Crohn, Cushing, Graves, Hodgkin, Parkinson, Raynaud, etc.
•Devices (18%): Adson, Foley, Kelly, Swan-Ganz, etc.•Anatomical structures (6%): Achilles, His, Langerhans, etc.
Impact on Clinical Information
-
Overlap of 2010 i2b2 challenge concepts and BoB PHI annotations:849 concepts overlapped partly with PHI annotations, reaching an average of 1.78% of all concept annotations.
Partial overlapPartial overlapPartial overlap
i2b2 categories
i2b2 annot.
PHI overlap #
Eponyms Overlap [%]
Problem 19667 187 18 0.95
Test 13833 180 41 1.30
Treatment 14185 482 53 3.40
Impact on Clinical Information (cont.)
-
Partial overlap details:Problem Test Treatment No match
Clinical Eponyms 18 41 53 156Person Names 162 103 383 3074Street or City 2 1 3 433State or Country 12 12 18 905Deployment 0ZIP code 0Healthcare Unit Name 17 53 1289Other Organization Name 9 15 196Date 4 20 1 5436Age > 89 13Phone Number 153Electronic Address 0SSN 0Other ID Number 7 18 9 919Total matches 187 180 482 0No match 19466 13626 13675
Impact on Clinical Information (cont.)
-
Partial overlap details:Problem Test Treatment No match
Clinical Eponyms 18 41 53 156Person Names 162 103 383 3074Street or City 2 1 3 433State or Country 12 12 18 905Deployment 0ZIP code 0Healthcare Unit Name 17 53 1289Other Organization Name 9 15 196Date 4 20 1 5436Age > 89 13Phone Number 153Electronic Address 0SSN 0Other ID Number 7 18 9 919Total matches 187 180 482 0No match 19466 13626 13675
Impact on Clinical Information (cont.)
-
Partial overlap details (cont.):Most overlap happened with Person Names annotations:
Most frequent overlap examples:Person Names - Treatment: Colace, Lopressor, Senna, Contin...Person Names - Problem: MR, E.Coli, Pseudomonas, Addison...Person Names - Test: Apgars, Papanicolaou, SP Stickney, Hct...
PHI i2b2 categ. Overlap %Person Names Treatment 45.11Person Names Problem 19.08Person Names Test 12.13
Impact on Clinical Information (cont.)
-
Partial overlap details (cont.):Most overlap happened with Person Names annotations:
Most frequent overlap examples:Person Names - Treatment: Colace, Lopressor, Senna, Contin...Person Names - Problem: MR, E.Coli, Pseudomonas, Addison...Person Names - Test: Apgars, Papanicolaou, SP Stickney, Hct...
PHI i2b2 categ. Overlap %Person Names Treatment 45.11Person Names Problem 19.08Person Names Test 12.13
76.33% overall
Impact on Clinical Information (cont.)
-
Even an efficient text de-identification system can mistakenly consider clinical information as PHI. This overlap is only 1.78% if considering even partial matches.Another study by Deleger et al. compared automated medications extraction from clinical text before and after de-identification. They found no significant difference.We comparing SNOMED-CT concept annotations by cTAKES before and after de-identification, we found 1.2-3% of concepts lost, depending on de-identification accuracy (partly significant difference). Most concepts “lost” were false positives (e.g., “VA” recognized as “vertebral artery”).
Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. JAMIA. 2012. Aug.2
Impact on Clinical Information (cont.)
-
All methods to assess the risk for re-identification were applied to a small number of structured and coded data (demographics, location), not to narrative text (work by Malin, El Emam etc.).Clinical notes are rich in clinical and social information that can be unique and could be used to re-identify a patient.This risk is significant (23% of 2010 i2b2 corpus documents have unique ICD-9-CM or CPT codes), but limited by access to other identified data sets with clinical codes.
How to limit this risk? Exclude narrative text from de-identified data sets?Require controlled access and data use agreements?Apply anonymization techniques to non-PHI content?
Risk of re-identification
-
De-identificationof French Clinical Texts :The LIMSI Experiments
Cyril Grouin Pierre Zweigenbaum
LIMSI-CNRSOrsay, France
MEDINFO 2013 Panel on De-IdentificationCopenhagen, 23/8/2013
-
De-Identification of French Clinical TextsPrevious Work
I Ruch et al. (2000)I Grouin (2002)I Grouin et al. (2009)I Proux et al. (2009)
-
LIMSI Experiments in De-Identification
Expert-based methodsI Localization of DE-ID to process FrenchI MEDINA
Machine-Learning methodsI CRF-based entity detection
Cross-corpus experimentsI Cardiology (discharge reports, hospital 1)I Fetopathology (multiple report types, OCR’ed,
hospital 2)I Mixed (multiple report types, hospital 3)
-
Expert-based Methods : LocalizationDe-id (Neamatullah et al., 2008)
I Starting from DE-ID, which de-identifies Englishclinical texts
I LexiconsI Patterns
I Translated the lexiconsI Started to translate the patterns, but
I too much dependence on language (word order, etc.)I program not written with localization in mind
I Decided to stop and to develop a new system
-
Expert-based MethodsMEDINA (Grouin et al., 2009)
I LexiconsI General lexicon : inflected forms, lemma, POSI Specific lexicons :
I townsI first namesI last names
I Apply through exact match
I PatternsI Character propertiesI Trigger wordsI Neighborhood of already (de-)identified entities
-
Machine-learning MethodsConditional Random Fields (see Grouin, MEDINFO 2013)
I Linear-chain CRFI Wapiti (Lavergne et al., 2010)I http ://wapiti.limsi.fr/
I Features :
surface features : token, capitalization, digit,punctuation, length
morpho-syntactic : POS via TreeTaggersemantic types : lexicon, CUI via UMLSdistributional analysis : clustering via Brown et al.’s
(1992) algorithmI Automatic feature selection : L1 regularization
http://wapiti.limsi.fr/
-
Evaluation : Cardiology and Fetopathology
Cardiology Corpus
P R F ConfidenceRule-based 0.855 0.830 0.843 [0.821, 0.864]CRF 0.909 0.858 0.883 [0.864, 0.901]
Fetopathology Corpus (OCR’ed, no adaptation)
P R F ConfidenceRule-based 0.678 0.684 0.681 [0.633, 0.729]CRF 0.732 0.565 0.638 [0.585, 0.692]
-
Cardiology Corpus (details)
Rule-based CRFDates (238) 0.920 0.874 0.897 0.987 0.946 0.966Last names (205) 0.903 0.907 0.905 0.892 0.883 0.887First names (109) 0.777 0.927 0.845 0.822 0.890 0.855Hospital (43) 0.500 0.372 0.427 0.931 0.628 0.750Town (22) 0.688 0.500 0.579 0.632 0.545 0.585Zip codes (8) 1.000 1.000 1.000 1.000 0.750 0.857Phone (8) 1.000 1.000 1.000 0.857 0.750 0.800
-
Cardiology vs New, Varied Corpus
P R FMEDINA-Rules
Detection 0.862 0.825 0.846Typing 0.846 0.804 0.824
CRF-otherDetection 0.929 0.798 0.858Typing 0.529 0.428 0.473
CRF-test 10×cvDetection 0.991 0.934 0.962Typing 0.959 0.876 0.916
-
Limitations
I Size of annotated corporaI More precisely, number of training examples
I Should handle “boilerplate” material differentlyI Address in headerI Signature in footer
I Lexicons are always incompleteI Lexicon features may however receive high
confidenceI which may prevent classifier from learning
features with better generalization power
-
Types of featuresGeneralization power
Current token is a clue : learn specific names, locations,etc.Smith
Current token is in a lexical_class : lexicons of names,locations, etc.Michael|Paul|Laura|. . .
Context of current token is a clue : Dr. xxxxxxxxxx , Ph.D.xxxxx has undergone
Current token belongs to a class : xxxxxCapitalizedxxxxxNNPxxxxxdrug see also lexicon
Context_of_current token belongs to a class : xxxxxNNPxxxxx
-
De-Identification and Loss of Information
I A recurring comment / question during presentationsI Does de-identification remove information ?
I Removing identifying pieces of informationI PseudonymizationI Date shifting
I Different goals for de-identificationI Perform Natural Language Processing researchI Publish case reportI . . .
I Inside hospital information systemI Extracted information should be handled
as other structured informationI Apply standard procedures for structured data
-
Thank you
A23_967_MEDINFO2013_Hercules-Deid-panel-Medinfo-aug-23-2014A23_967_MEDINFO2013_Deid-Medinfo2013A23_967_MEDINFO2013_ZweigenbaumMEDINFOPANEL2013