De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of...
-
Upload
elinor-lynch -
Category
Documents
-
view
215 -
download
1
Transcript of De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of...
![Page 1: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/1.jpg)
De-identifying Pathology Reports for Pathology Informatics
James Gardner, Li XiongDepartment of Math and Computer
Science
Fusheng Wang, Andrew Post, Joel Saltz
Center for Comprehensive Informatics
![Page 2: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/2.jpg)
Introduction
• The HIPAA Privacy Rule regulates the use and disclosure of Protected Health Information (PHI)
• De-identification of pathology reports is of critical importance in order to facilitate secondary use of medical records for research
• HIDE (Health Information DE-identification) is an open-source de-id tool based on advanced statistical based de-identification technologies
![Page 3: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/3.jpg)
HIPAA Identifiers
1. Names;
2. All geographical subdivisions smaller than a state;
3. All elements of dates (except year);
4. Phone numbers;
5. Fax numbers;
6. Electronic mail addresses;
7. Social Security numbers;
8. Medical record numbers;
9. Health plan beneficiary numbers;
10. Account numbers;
11. Certificate/license numbers;
12. Vehicle identifiers and serial numbers;
13. Device identifiers and serial numbers;
14. Web Universal Resource Locators (URLs);
15. Internet Protocol (IP) address numbers;
16. Biometric identifiers, including finger and voice prints;
17. Full face photographic images or comparable images; and
18. Any other unique identifying number, characteristic, or code
1. Names;
2. All geographical subdivisions smaller than a state;
3. All elements of dates (except year);
4. Phone numbers;
5. Fax numbers;
6. Electronic mail addresses;
7. Social Security numbers;
8. Medical record numbers;
9. Health plan beneficiary numbers;
10. Account numbers;
11. Certificate/license numbers;
12. Vehicle identifiers and serial numbers;
13. Device identifiers and serial numbers;
14. Web Universal Resource Locators (URLs);
15. Internet Protocol (IP) address numbers;
16. Biometric identifiers, including finger and voice prints;
17. Full face photographic images or comparable images; and
18. Any other unique identifying number, characteristic, or code
• These identifiers have to be removed or• Based on the opinion from an qualified
statistical expert, the risk of identifying an individual is very small
![Page 4: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/4.jpg)
HIDE Overview
• Utilizes the state-of-the-art named entity recognition technique, Conditional Random Fields, for extracting PHI
− Previous tools such as DE-ID and HMS scrubber use rule-based approaches which are labor intensive and not portable
• Provides flexible de-identification options including full de-identification and state-of-the-art statistical de-identification
− Previous tools allow simple removal or substitution of the PHI
• Provides an easy-to-use web-based interface that utilizes the latest web-technologies
• Integrated with caTIES, and caTissue (in progress)
![Page 5: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/5.jpg)
PHI Extraction
• Utilizes state-of-the-art NLP technique, Conditional Random Fields − High accuracy, easy to train, portable
• Combines different feature sets and sampling techniques− Feature sets: dictionary, affix, regular expression and context
• Can use default models or custom trained models− Web interface for annotating and training custom models− A set of reports are loaded and manually labeled− The labeled documents will generate a trained model for
automatically de-identifying new reports
![Page 6: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/6.jpg)
HIDE: De-identification Options
• Full de-identification− safe-harbor, all 18 HIPAA identifiers removed or substituted
• Partial de-identification− limited dataset, all direct HIPAA identifiers removed or
substituted(not for dates, address other than street/P.O.Box)
• Configurable de-identification− A configurable set of identifiers removed or substituted
• Statistical de-identification− Advanced anonymization that guarantees rigorous
statistically acceptable privacy while keeping the utility of the data
![Page 7: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/7.jpg)
Statistical De-identification Example
De-identification satisfying k-anonymity (k=2) (every record is indistinguishable in a group of records with size greater than or equal to k)
![Page 8: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/8.jpg)
(100 reports,10-fold cross validation)
Study 1: PHI Extraction on Emory Pathology Reports
Precision: true positives over the sum of true positives and false positivesRecall (sensitivity): true positives over total actual positivesF1: combination: 2*precision*recall/(precision+ recall)
![Page 9: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/9.jpg)
Study 2: PHI Extraction on i2b2 Reports
• Based on 669 discharge summaries, 10-fold cross validation
• Good precision and recall for most individual PHI identifiers
• Good overall precision and recall for PHI extraction
![Page 10: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/10.jpg)
Study 3: Impact of Different Feature Sets
Dictionary (d), affix (a), regular expression (r) and context (c) features are in order of increasing importance for statistical CRF based PHI extraction
![Page 11: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/11.jpg)
Integrating HIDE with caTIES
• caTIES (cancer Text Information Extraction System) provides tools for de-identification and automated coding of free-text pathology reports
• caTIES provides de-id extensibility through implementing its CaTIES_DeIdentifier interface
• HIDEDeIdentifier, which calls HIDE client API
• Added HIDE de-id option in caTIES installer
• HIDE is bundled with caTIES since release v3.7 (May 2010)
![Page 12: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/12.jpg)
Integrating HIDE with caTissue (in Progress)
• caTissue uses caTIES V2.x and refactored it into caTissue’s workflow
• HIDE integration with caTissue is similar to caTIES
• Implementation and evaluation under going
• Goal: Integration of pathology reports into caTissue installation at Winship Cancer Institute at Emory University
![Page 13: De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel.](https://reader030.fdocuments.us/reader030/viewer/2022032708/56649e755503460f94b75925/html5/thumbnails/13.jpg)
Ongoing Development
• Continue development on HIDE/caTissue integration
• Usability improvement: simplified installation progress
• System improvements− Efficiency and scalability of the system
− Multiple file formats support
− Additional statistical de-identification options