Session 35: Text Analytics: You Need More than...

39
Session 35: Text Analytics: You Need More than NLP Eric Just Senior Vice President Health Catalyst

Transcript of Session 35: Text Analytics: You Need More than...

Page 1: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Session 35:

Text Analytics: You Need More than NLP

Eric JustSenior Vice PresidentHealth Catalyst

Page 2: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Learning Objectives

• Why text search is an important part of clinical text analytics

• The fundamentals of how search works

• How clinical text search can be refined with natural language processing (NLP) and other techniques

2

Page 3: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Poll Question #1For my organization, I see text analytics as:

a) Completely unnecessary for analyticsb) A “nice to have” for analyticsc) Very important for a few key areas of analytics d) Mission critical across nearly all areas of analyticse) Unsure or not applicable

Page 4: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

High-Risk Population:Peripheral Arterial Disease PAD

4

ICD/CPTN=9,592

Natural Language Processing (NLP)N=41,741

• PAD affects over 3 million patients per year

• Narrowed arteries reduce blood flow to limbs

• Patients with PAD are considered high risk

Peripheral artery diseaseClaudicationRest painIschemic Limb

Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL, April 2016.

For organizations trying to understand their risk, not being able to find high-risk patients is a problem.

Page 5: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Analytics has a problem

5

Most organizations ignore text analytics – because it is expensive and difficult

Most text analytics requires advanced technical skillsets

Up to 80% of clinical data stored in text

Page 6: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Typical Scenario“As a healthcare system administrator, I want to understand my high-risk population better. I want to find all patients with peripheral arterial disease (PAD). I know there are more patients than I was able to find by simply querying diagnosis and procedure codes.”

6

Data scientist develops PAD text mining algorithm

Algorithm validated The patient cohort is defined Results returned to investigator

Page 7: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Better Scenario“As a healthcare system administrator, I want to understand my high-risk population better. I want to find all patients with peripheral arterial disease (PAD). I know there are more patients than I was able to find by simply querying diagnosis and procedure codes.”

7

Data scientist develops PAD algorithm

Algorithm validatedPAD algorithm run nightly and stored in data warehouse

PAD algorithm output combined with coded data to create PAD registry

Page 8: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

How to Best Leverage Text Analytics

8

PADAlgorithm

DiabetesAlgorithm

Ejection Fraction

Algorithm

Pre-DiabetesAlgorithm

CHFAlgorithm

Hyper-tension

Breast Cancer

Page 9: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Poll Question #2I see text analytics being most important in the area of: (Choose 3, if applicable)

a) Clinical care improvementb) Regulatory reportingc) Researchd) Operational improvemente) Financial analyticsf) Unsure or not applicable

Page 10: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Google

10

Page 11: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Why do we love Google?

11

Simple, effective interface Fast Accurate

Page 12: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

c

12

How Would You Build ‘Google’ For Clinical Text?

Page 13: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

The Basis of Text Search: The Inverted Index

13

Patient is a 67 year old female with NIDDM and hypertension.

Words Document Inverted Index

67 0 {(0,3)}

diabet 1,2 {(1,4),(2,3),(2,7)}

female 0 {(0,6)}

hyperten 0,1 {(0,8),(1,6)}

mother 2 {(2,1)}

niddm 0 {(0,8)}

no 1 {(1,3)}

old 0 {(0,5)}

patient 0,1,2 {(0,0),(1,1), (2,0),(2,4)}

sister 2 {(2,5)}

year 0 {(0,4)}

The patient has no diabetes or hypertension.

Patient’s mother is diabetic. Patient’s sister is diabetic.

Document 0

Document 1

Document 2

Page 14: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Tools To Quickly Index Text and Provide Search Capability

14

• Create index• Maintain index• Search index• Hit ranking• Result sorting• .. Much more

Originally written in 1999Open-Source Java API

Originally written in 2004Open Source Enterprise SearchBuilt on LuceneScalability (distributed indexes)REST APIsPlugin architectureAdditional features over Lucene

Originally written in 2010Open Source Enterprise SearchBuilt on LuceneScalability (distributed indexes)REST APIsPlugin architectureAdditional features over Lucene

Provides the foundation for more advanced search engine capabilities. Most users use through SOLR or ElasticSearch. Used directly by Twitter.

Page 15: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

diabetes Go

Results: 2 records, 0.0 ms

Document 2: Patient’s mother is diabetic. Patient’s sister is diabetic.

Document 1: The patient has no diabetes or hypertension.

Found both diabetes and diabetic (word stemming)

Missed mention of NIDDM(synonyms)

Neither result is relevant to a medical cohort query for diabetics (context)

Page 16: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

What Works?

16

• Simple, familiar interface

• Using inverted index means fast results

What Doesn’t• Results display not optimized for use cases

Need better ability to view aggregate results

• Want more results!

Medical language has many synonyms. (How do we find NIDDM?)

• Want less results!

Context matters for different search types (How do we exclude ‘no diabetes’)

Page 17: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Showing the results

17

• Many users are more interested in exploring aggregate results than reviewing individual records

• Aggregating results opens up to users without access to PHI

Page 18: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

When you say ‘diabetes’ what do you really mean?

"diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulin-dependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellitus without mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young" OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2 diabetes mellitus"

Get More Results: Synonyms

18

Page 19: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Leveraging Medical Terminologies

19

Page 20: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Expanding Search With Terminologies

20

Page 21: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

A more complex example:Diabetic patients who are on an ACE/ARB or who had their microalbumin checked during the calendar year

Queries free text for all reports that contain “Diabetes” AND “(ace OR arb)” AND “microalbumin”

Filtered for reports within the last year

note: terms are selected by synonym finder, or grouped terms of all trade name, generic name, or active medication ingredients

("diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulin-dependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellituswithout mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young"OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2 diabetes mellitus") AND ( ("benazepril" OR "lotensin" OR "captopril" OR "capoten" OR "enalapril" OR "vasotec" OR "epaned" OR "fosinopril" OR "monopril" OR "lisinopril" OR "prinivil" OR "zestril" OR "moexipril" OR "univasc" OR "perindopril" OR "aceon" OR "quinapril" OR "accupril" OR "ramipril" OR "altace" OR "trandolapril" OR "mavik") OR("azilsartan" OR "edarbi" OR "candesartan" OR "atacand" OR "eprosartan" OR "teveten" OR "irbesartan" OR "avapro" OR "telmisartan" OR "micardis" OR "valsartan" OR "diovan" OR "losartan" OR "cozaar" OR "olmesartan" OR "benicar") ) AND ("albumin urine" OR "urine microalbumin" OR "urine microalbumin present”)

21

Page 22: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Get Less Results: ConText Matters

22

ConText is a NLP pattern matching algorithm published in 2009

To be useful for clinical applications such as looking for genotype/phenotype correlations, retrieving patients eligible for a clinical trial, or identifying disease outbreaks, simply identifying clinical conditions in the text is not sufficient—information described in the context of the clinical condition is critical for understanding the patient’s state.

J Biomed Inform. 2009 Oct; 42(5): 839–851.

Detects conditions and whether they are

• Negated (e.g., “ruled out pneumonia”)

• Historical (“past history of pneumonia”)

• Experienced by someone else (e.g., “family history of pneumonia”)

Page 23: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

ConText Algorithm

23

Wendy W. Chapman, David Chu, John N. DowlingJ Biomed Inform. 2009 Oct; 42(5): 839–851.

Chest tightnessNegation: affirmed

Experiencer: patient

Temporality: recent

”No history of chest tightness but family history of CHF.”

ConTextCHF

Negation: affirmed

Experiencer: patient

Temporality: recent

Chest tightnessNegation: negated

Experiencer: patient

Temporality: historical

CHFNegation: affirmed

Experiencer: other

Temporality: historical

Negation trigger Historical trigger Termination

Termination Historical trigger ConditionCondition

Other experiencer trigger

Page 24: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

ConText: Negation

24

The patient had no diabetes or hypertension.

Clinical conditions TerminationNegation trigger Experiencer

DiabetesNegation: negated

Experiencer: patient

Temporality: recent

HypertensionNegation: negated

Experiencer: patient

Temporality: recent

Page 25: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

ConText: Experiencer

25

Patient’s mother has diabetes.

Clinical conditions TerminationExperiencer

Patient’s sister has hypertension.

Clinical conditions TerminationExperiencer

DiabetesNegation: affirmed

Experiencer: other

Temporality: recent

HypertensionNegation: affirmed

Experiencer: other

Temporality: recent

Page 26: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

How?

26

• Analysis of context uses a sentence as an operand

• Identifying sentences in clinical text is not straightforward

Have you ever seen punctuation in a clinical note?

• An NLP analysis pipeline ties it all together

Search results Sentence detection

Entity recognition(i.e. diabetes)

Context Algorithm

Present user with additional filters

NLP Pipeline Frameworks

• Apache Unstructured Information Management Architecture (UIMA)• General Architecture for Text Engineering (GATE)• Natural Language Toolkit (NLTK)

Page 27: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

ConText: Apply to Search Results

27

Filter Diabetes Results

Page 28: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

28

Page 29: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Other Pieces to the NLP Pipeline: Extract Values

29

ef_phrase qualifiers ef_low ef_high ef_mid ef_word

ejection fraction is at least 70-75 is at least 70 75 72.5 NULL

ejection fraction of about 20 of about 20 20 20 NULL

ejection fraction of 60 of 60 60 60 NULL

ejection fraction of greater than 65 of greater than 65 65 65 NULL

ejection fraction of 55 of 55 55 55 NULL

ejection fraction by visual inspection is 65 by visual inspection is 65 65 65 NULL

LVEF is normal is NULL NULL NULL normal

\b(((LV)?EF)|(Ejection\s+Fraction))\s+(?<qualifiers>([^\s\d]+\s+){0,5})\(?(((?<ef_low>\d+)-(?<ef_high>\d+))|(?<ef_mid_txt>\d+)|(?<ef_word>([^\s]*?normal)|(moderate)|(severe)))

Page 30: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Other Extraction Projects• Aortic Root Size

• Blood Pressure

• Breast Cancer ER Biomarker

• Cancer Staging, TNM, and stage

• Abdominal fistula

• Height/Weight/BMI

• Hypoglycemia with low blood sugars

• Microalbumin

• Ankle Brachial Index

30

Page 31: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

High-risk Population:Peripheral Arterial Disease PAD

31

ICD/CPTN=9,592

Natural Language Processing (NLP)N=41,741

• PAD affects over 3 million patients per year

• Narrowed arteries reduce blood flow to limbs

• Patients with PAD are considered high risk

• Measured by Ankle Brachial Index (ABI)

Peripheral artery diseaseClaudicationRest painIschemic Limb

ABI < 0.9N=4,349

Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL, April 2016.

This is a precise patient registry!

Page 32: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Validation

32

• Build studies to review results of query

• Assign to team members to review results

• Randomly selects records to represent study

• Highlights key words for easy chart review

Page 33: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Text Analytics Must Be Interoperable!

33

Diabetes Cohort

PAD Cohort

Ejection Fraction

Tumor Sizes

Validated Text Analytics

Data Warehouse

• Population Analytics• Care Improvement• Operational Improvement• Financial Improvement• Research

Page 34: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

A ‘Late Binding’ Approach to Text Analytics

34

Search:Easy starting

point

Synonym Finding

• Uses terminologies

• Allow user to find synonyms

Context Filtering

• Excludes negated concepts

• Good for cohort queries

Regular Expression

• Extraction of discrete values:• Ejection

Fractions• ABI

Validation

• Expert review of algorithm output

• Performance measurement

Integration

• Operationalize algorithm

• Incorporate into analytics

Many More Techniques

• Section tagging• Entity recognition• N-gram analysis• Document clustering

Page 35: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Final Thought

36

To leverage the power of text analytics…

Make the data accessible first!

Page 36: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Lessons Learned

37

• Using search technology for clinical text is an engaging and accessible entry point for text analytics problems. Searching clinical text is powered by an inverted index that catalogs words present in the documents, which documents they are present in, and their position in the documents.

• Medical terminologies provide a dictionary of relevant terms, synonyms, and logical structures that can enhance clinical text exploration.

• NLP algorithms that are based on the context surrounding clinical terms can identify when the term is negated (“no evidence of pneumonia”) or applies to another person (“patient's grandmother had breast cancer”). Regular expressions can be applied to text to identify patterns and extract discrete values, like ejection fraction and ankle brachial index, that are stored in text.

• Text analytics should be validated and integrated with an enterprise data warehouse where the information extracted from text can be combined with discrete, coded data.

Page 37: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Analytic Insights

AQuestions &

Answers

38

Page 38: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

What You Learned…

39

Write down the key things you’ve learned related to each of the learning objectives after attending

this session

Page 39: Session 35: Text Analytics: You Need More than NLPhasummit.com/wp-content/uploads/2016/05/35-Text-Analytics-You-N… · Analytics has a problem. 5. Most organizations ignore text

Thank You

40