An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain Wendy...

Post on 17-Dec-2015

216 views 1 download

Tags:

Transcript of An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain Wendy...

An NLP Ecosystemfor Development and Use

of Natural Language Processing in the Clinical Domain

Wendy W. Chapman, PhD

Division of Biomedical InformaticsUniversity of California, San Diego

Integrating Data for Analysis, Anonymization, and Sharing

Overview

• The promise of natural language processing (NLP)

• Challenges of developing NLP in the clinical domain

• Challenges in applying NLP in the clinical domain

• iDASH

• Opportunities for sharing and collaboration in NLP

NLP Success

Fresh off its butt-kicking performance on Jeopardy!, IBM’s supercomputer "Watson" has enrolled in medical school at Columbia University,” New York Daily News February 18th 2011

“IBM's computer could very well

herald a whole new era in

medicine." ComputerWorld

February 17, 2011

Dr. Watson??

Clinical NLP Since 1960’s

Why has clinical NLP had little impact on clinical care?

Barriers to Development

• Sharing clinical data difficult– Have not had shared datasets for development and

evaluation– Modules trained on general English not sufficient

• Insufficient common conventions and standards for annotations– Data sets are unique to a lab– Not easily interchangeable

• Limited collaboration– Clinical NLP applications silos and black boxes– Have not had open source applications

• Reproducibility is formidable– Open source release not always sufficient– Software engineering quality not always great– Mechanisms for reproducing results are sparse

Overview

• The promise of natural language processing (NLP)

• Challenges of developing NLP in the clinical domain

• Challenges in applying NLP in the clinical domain

• Developing an NLP ecosystem on iDASH

Security & Privacy Concerns

• Clinical texts have many patient identifiers– 18 HIPAA identifiers

• Names• Addresses

• Items not regulated by HIPAA– tight end for the Steelers

• Unique cases– 50s-year-old woman who is pregnant

• Sensitive information– HIV status

Institutions are reluctant to share dataInstitutions are reluctant to share data

Lack of user-centered development and scalability– Perceived cost of applying NLP outweighs the

perceived benefit (Len D’Avolio)

Overview

• The promise of natural language processing (NLP)

• Challenges of developing NLP in the clinical domain

• Challenges in applying NLP in the clinical domain

• Developing an NLP ecosystem on iDASH

iDASH

• integrating Data• Analysis• Anonymization• Sharing

DataData

Computational Resources

Computational Resources

Software/ToolsSoftware/Tools

Disincentives to Share

• ‘Scooping’ by faster analysts Exposure of potential errors in data

• Resources for preparing data submissions• Maintaining data• Interacting with potential users takes time• Threat of privacy breach when human subjects

are involved– Do not have policies in place– Fallible de-identification, anonymization algorithms

iDASH aims to minimize these disincentivesiDASH aims to minimize these disincentives

nlp-ecosystem.ucsd.edu

Privacy preserving Privacy preserving

• Access control • De-identification • Query counts• Artificial data

generators

• Access control • De-identification • Query counts• Artificial data

generators

DigitalInformed consent

DigitalInformed consent

HIPAA &/or FISMA Compliant Cloud

CustomizableDUAs

CustomizableDUAs

Informed ConsentRegistry

Informed ConsentRegistry

152011 summer internship program funded by NIH U54HL108460

NLP Ecosystem

Data

MT SamplesTools & Services Collaborative

Development Tools

Virtual Machines

Evaluation Workbench

Education

Bibliography

TutorialsResearch

Resources

Guidelines

Schemas

De-Identification

UCSD Clinical Data

TxtVect

Annotation Admin & eHOST

Registry

Tools & Services Collaborative

Knowledge Authoring

Virtual Machines

Evaluation WorkbenchDe-

Identification

TextVect

Annotation Environment

Increase access to NLP

DecreaseBurden of

DevelopingNLP

Collaborative Effort to Build Ecosystem

Registry

orbit

Increase ability to find NLP tools

Registry: orbit.nlm.nih.gov

Len D’Avolio, Dina Demner-Fushman

De-identification service

Increase access to clinical text

De-identification

• Several available de-identification modules• Need to adapt to local text

– Efficient– Secure

• Customizable ensemble de-identification system– Build a de-identified corpus – Incorporate existing de-id modules– Launch as virtual machine– Iterative training, evaluation, and modification by user

• Correct mistakes

• Add regular expressions

Brett South, Stephane Meystre, Oscar Fernandez, Danielle Mowery

TextVect

Increase access to textual features

TextVect

NLM: Abhishek Kumar

collaborative Knowledge Authoring Support Service (cKass)

Decrease the Burden of Customizing an NLP Application

Customizing an IE App

User’s ConceptsCough

DyspneaInfiltrate on CXR

WheezingFever

Cervical Lymphadenopathy

User’s ConceptsCough

DyspneaInfiltrate on CXR

WheezingFever

Cervical Lymphadenopathy

IE OutputIE Output

MapMap

Customizing an IE App

User’s ConceptsCough

DyspneaInfiltrate on CXR

WheezingFever

Cervical Lymphadenopathy

User’s ConceptsCough

DyspneaInfiltrate on CXR

WheezingFever

Cervical Lymphadenopathy

IE Output

Dry cough Productive coughCoughHacking coughBloody cough

IE Output

Dry cough Productive coughCoughHacking coughBloody cough

Which concepts?

Customizing an IE App

User’s ConceptsCough

DyspneaInfiltrate on CXR

WheezingFever

Cervical Lymphadenopathy

User’s ConceptsCough

DyspneaInfiltrate on CXR

WheezingFever

Cervical Lymphadenopathy

IE Output

Temp 38.0CLow-grade temperature

IE Output

Temp 38.0CLow-grade temperature

What is a fever?

Customizing an IE App

User’s ConceptsCough

DyspneaInfiltrate on CXR

WheezingFever

Cervical Lymphadenopathy

User’s ConceptsCough

DyspneaInfiltrate on CXR

WheezingFever

Cervical Lymphadenopathy

IE Output

NECK: no adenopathy

Disorder: adenopathyNegation: negated

IE Output

NECK: no adenopathy

Disorder: adenopathyNegation: negated

Section mapping

KOS-IEKnowledge Organization Systems for Information Extraction

Compile information helpful for IE

User KBUser KB

NLP ToolsNLP Tools

Physician Radiologist Nurse Clinical Researcher Knowledge Engineer.

Decision Support System

Decision Support System

Shared KBShared KB External KBExternal KB

Collaborative Knowledge Base Development: cKASS

LQ Wang, M Conway, F Fana, M Tharp, D Hillert

Knowledge Authoring

Augment user KB with lexical variants, synonyms, and related concepts

• User-driven authoring–Top-down: Provide access to external knowledge sources

• UMLS, Specialist Lexicon, Bioportal

–Bottom-up: Annotate to derive synonyms

• Recommendation-based authoring–Generate lexical variants–Mine external knowledge sources–Mine patient records

Evaluation workbench

Decrease the Burden of Evaluation & Error Analysis

Evaluation Workbench

• Compare the output of two NLP annotators on clinical text• NLP system vs human annotation

• View annotations• Calculate outcome measures • Drill down to all levels of annotation

• Document-level

• Perform error analysis• Future versions will support formal error analysis

Levels of Annotation

• Document – Report classified as Shigellosis

• Group – Section classified as Past Medical History Section

• Utterance – Group of text classified as Sentence

• Snippet – “chest pain” classified as CUI 058273

• Word – “pain” classified as noun)

• Token – “.” classified as EOS marker

34

Document & annotations

Outcome Measures forSelected Annotations

Select Classifications

to View

ReportList

Attributes for Selected

Annotation

Relationships for Selected

AnnotationVA and ONC SHARP: Christensen, Murphy, Frabetti, Rodriguez, Savova

Annotation Environment

Decrease the Burden of Annotation

Challenges to Annotating

• Time consuming– Recruiting & training annotators for high agreement

• Expensive– Domain experts especially expensive– Need for annotation by multiple people

• Challenging to design annotation task– How many annotators?– How should I quantify quality of annotations?

• Logistically challenging– Managing files and batches of reports– Setting up annotation tool

• Reinventing the wheel– Hasn’t someone created a schema for this before?

How can we reduce the burden of annotation?

iDASH Annotation Environment

Annotation Admin eHOST

Web applicationiDASH cloud

Client app on your computer

VA, SHARP, and NIGMS : S Duvall, B South, G Savova, N Elhadad, H Hochheiser

Goal: provide an environment to decrease theBurden of annotation for research and application

Annotator Registry

Annotator Registry

• Enlist for annotation • Certify for annotation tasks

– Personal health information– Part-of-speech tagging– UMLS mapping

• Set pay rate

• Searchable• Available for inclusion in

new annotation taskhttp://idash.ucsd.edu/nlp-annotator-registry

Annotation Admin: Intended Users & Uses

Users• NLP researchers• Annotation administrators

Uses• Manage annotation projects – who annotates what

– Currently done with hundreds of files on hard drive

• Integrate with annotation tool (eHOST)– Download batches of raw reports to annotators– Upload and store annotated reports

• Manage simple annotation projects• Facilitate distributed annotation

1. Assign annotators to a task1. Assign annotators to a task

Annotation Admin

2. Create a Schema2. Create a Schema

Brett South
ehost does this too, so there is some redundancy

3. Assign users and set time expectations3. Assign users and set time expectations

3. Keep track of progress3. Keep track of progress

Tools & Services Collaborative

Knowledge Authoring

Virtual Machines

Evaluation WorkbenchDe-

Identification

TextVect

Annotation Environment

Increase access to NLP

DecreaseBurden of

DevelopingNLP

Collaborative Effort to Build Resources

Registry

Conclusion

• More demand for EHR data– NLP has potential to extend value of narrative clinical reports

• There have been many barriers– To development– To deployment

• Recent developments facilitate collaboration & sharing– Common annotation conventions– Privacy algorithms– Shared datasets– Hosted environments

• iDASH hopes to facilitate – Development of NLP– Application of NLP

Questions | Discussion

Division of Biomedical InformaticsUniversity of California, San Diego

Integrating Data for Analysis, Anonymization, and Sharing

wwchapman@ucsd.edu

iDASH/ShARe Workshop on AnnotationSeptember 29, 2012

La Jolla, CA

iDASH/ShARe Workshop on AnnotationSeptember 29, 2012

La Jolla, CA