Session 35: Text Analytics: You Need More than...
Transcript of Session 35: Text Analytics: You Need More than...
Session 35:
Text Analytics: You Need More than NLP
Eric JustSenior Vice PresidentHealth Catalyst
Learning Objectives
• Why text search is an important part of clinical text analytics
• The fundamentals of how search works
• How clinical text search can be refined with natural language processing (NLP) and other techniques
2
Poll Question #1For my organization, I see text analytics as:
a) Completely unnecessary for analyticsb) A “nice to have” for analyticsc) Very important for a few key areas of analytics d) Mission critical across nearly all areas of analyticse) Unsure or not applicable
High-Risk Population:Peripheral Arterial Disease PAD
4
ICD/CPTN=9,592
Natural Language Processing (NLP)N=41,741
• PAD affects over 3 million patients per year
• Narrowed arteries reduce blood flow to limbs
• Patients with PAD are considered high risk
Peripheral artery diseaseClaudicationRest painIschemic Limb
Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL, April 2016.
For organizations trying to understand their risk, not being able to find high-risk patients is a problem.
Analytics has a problem
5
Most organizations ignore text analytics – because it is expensive and difficult
Most text analytics requires advanced technical skillsets
Up to 80% of clinical data stored in text
Typical Scenario“As a healthcare system administrator, I want to understand my high-risk population better. I want to find all patients with peripheral arterial disease (PAD). I know there are more patients than I was able to find by simply querying diagnosis and procedure codes.”
6
Data scientist develops PAD text mining algorithm
Algorithm validated The patient cohort is defined Results returned to investigator
Better Scenario“As a healthcare system administrator, I want to understand my high-risk population better. I want to find all patients with peripheral arterial disease (PAD). I know there are more patients than I was able to find by simply querying diagnosis and procedure codes.”
7
Data scientist develops PAD algorithm
Algorithm validatedPAD algorithm run nightly and stored in data warehouse
PAD algorithm output combined with coded data to create PAD registry
How to Best Leverage Text Analytics
8
PADAlgorithm
DiabetesAlgorithm
Ejection Fraction
Algorithm
Pre-DiabetesAlgorithm
CHFAlgorithm
Hyper-tension
Breast Cancer
Poll Question #2I see text analytics being most important in the area of: (Choose 3, if applicable)
a) Clinical care improvementb) Regulatory reportingc) Researchd) Operational improvemente) Financial analyticsf) Unsure or not applicable
10
Why do we love Google?
11
Simple, effective interface Fast Accurate
c
12
How Would You Build ‘Google’ For Clinical Text?
The Basis of Text Search: The Inverted Index
13
Patient is a 67 year old female with NIDDM and hypertension.
Words Document Inverted Index
67 0 {(0,3)}
diabet 1,2 {(1,4),(2,3),(2,7)}
female 0 {(0,6)}
hyperten 0,1 {(0,8),(1,6)}
mother 2 {(2,1)}
niddm 0 {(0,8)}
no 1 {(1,3)}
old 0 {(0,5)}
patient 0,1,2 {(0,0),(1,1), (2,0),(2,4)}
sister 2 {(2,5)}
year 0 {(0,4)}
The patient has no diabetes or hypertension.
Patient’s mother is diabetic. Patient’s sister is diabetic.
Document 0
Document 1
Document 2
Tools To Quickly Index Text and Provide Search Capability
14
• Create index• Maintain index• Search index• Hit ranking• Result sorting• .. Much more
Originally written in 1999Open-Source Java API
Originally written in 2004Open Source Enterprise SearchBuilt on LuceneScalability (distributed indexes)REST APIsPlugin architectureAdditional features over Lucene
Originally written in 2010Open Source Enterprise SearchBuilt on LuceneScalability (distributed indexes)REST APIsPlugin architectureAdditional features over Lucene
Provides the foundation for more advanced search engine capabilities. Most users use through SOLR or ElasticSearch. Used directly by Twitter.
diabetes Go
Results: 2 records, 0.0 ms
Document 2: Patient’s mother is diabetic. Patient’s sister is diabetic.
Document 1: The patient has no diabetes or hypertension.
Found both diabetes and diabetic (word stemming)
Missed mention of NIDDM(synonyms)
Neither result is relevant to a medical cohort query for diabetics (context)
What Works?
16
• Simple, familiar interface
• Using inverted index means fast results
What Doesn’t• Results display not optimized for use cases
Need better ability to view aggregate results
• Want more results!
Medical language has many synonyms. (How do we find NIDDM?)
• Want less results!
Context matters for different search types (How do we exclude ‘no diabetes’)
Showing the results
17
• Many users are more interested in exploring aggregate results than reviewing individual records
• Aggregating results opens up to users without access to PHI
When you say ‘diabetes’ what do you really mean?
"diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulin-dependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellitus without mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young" OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2 diabetes mellitus"
Get More Results: Synonyms
18
Leveraging Medical Terminologies
19
Expanding Search With Terminologies
20
A more complex example:Diabetic patients who are on an ACE/ARB or who had their microalbumin checked during the calendar year
Queries free text for all reports that contain “Diabetes” AND “(ace OR arb)” AND “microalbumin”
Filtered for reports within the last year
note: terms are selected by synonym finder, or grouped terms of all trade name, generic name, or active medication ingredients
("diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulin-dependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellituswithout mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young"OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2 diabetes mellitus") AND ( ("benazepril" OR "lotensin" OR "captopril" OR "capoten" OR "enalapril" OR "vasotec" OR "epaned" OR "fosinopril" OR "monopril" OR "lisinopril" OR "prinivil" OR "zestril" OR "moexipril" OR "univasc" OR "perindopril" OR "aceon" OR "quinapril" OR "accupril" OR "ramipril" OR "altace" OR "trandolapril" OR "mavik") OR("azilsartan" OR "edarbi" OR "candesartan" OR "atacand" OR "eprosartan" OR "teveten" OR "irbesartan" OR "avapro" OR "telmisartan" OR "micardis" OR "valsartan" OR "diovan" OR "losartan" OR "cozaar" OR "olmesartan" OR "benicar") ) AND ("albumin urine" OR "urine microalbumin" OR "urine microalbumin present”)
21
Get Less Results: ConText Matters
22
ConText is a NLP pattern matching algorithm published in 2009
To be useful for clinical applications such as looking for genotype/phenotype correlations, retrieving patients eligible for a clinical trial, or identifying disease outbreaks, simply identifying clinical conditions in the text is not sufficient—information described in the context of the clinical condition is critical for understanding the patient’s state.
J Biomed Inform. 2009 Oct; 42(5): 839–851.
Detects conditions and whether they are
• Negated (e.g., “ruled out pneumonia”)
• Historical (“past history of pneumonia”)
• Experienced by someone else (e.g., “family history of pneumonia”)
ConText Algorithm
23
Wendy W. Chapman, David Chu, John N. DowlingJ Biomed Inform. 2009 Oct; 42(5): 839–851.
Chest tightnessNegation: affirmed
Experiencer: patient
Temporality: recent
”No history of chest tightness but family history of CHF.”
ConTextCHF
Negation: affirmed
Experiencer: patient
Temporality: recent
Chest tightnessNegation: negated
Experiencer: patient
Temporality: historical
CHFNegation: affirmed
Experiencer: other
Temporality: historical
Negation trigger Historical trigger Termination
Termination Historical trigger ConditionCondition
Other experiencer trigger
ConText: Negation
24
The patient had no diabetes or hypertension.
Clinical conditions TerminationNegation trigger Experiencer
DiabetesNegation: negated
Experiencer: patient
Temporality: recent
HypertensionNegation: negated
Experiencer: patient
Temporality: recent
ConText: Experiencer
25
Patient’s mother has diabetes.
Clinical conditions TerminationExperiencer
Patient’s sister has hypertension.
Clinical conditions TerminationExperiencer
DiabetesNegation: affirmed
Experiencer: other
Temporality: recent
HypertensionNegation: affirmed
Experiencer: other
Temporality: recent
How?
26
• Analysis of context uses a sentence as an operand
• Identifying sentences in clinical text is not straightforward
Have you ever seen punctuation in a clinical note?
• An NLP analysis pipeline ties it all together
Search results Sentence detection
Entity recognition(i.e. diabetes)
Context Algorithm
Present user with additional filters
NLP Pipeline Frameworks
• Apache Unstructured Information Management Architecture (UIMA)• General Architecture for Text Engineering (GATE)• Natural Language Toolkit (NLTK)
ConText: Apply to Search Results
27
Filter Diabetes Results
28
Other Pieces to the NLP Pipeline: Extract Values
29
ef_phrase qualifiers ef_low ef_high ef_mid ef_word
ejection fraction is at least 70-75 is at least 70 75 72.5 NULL
ejection fraction of about 20 of about 20 20 20 NULL
ejection fraction of 60 of 60 60 60 NULL
ejection fraction of greater than 65 of greater than 65 65 65 NULL
ejection fraction of 55 of 55 55 55 NULL
ejection fraction by visual inspection is 65 by visual inspection is 65 65 65 NULL
LVEF is normal is NULL NULL NULL normal
\b(((LV)?EF)|(Ejection\s+Fraction))\s+(?<qualifiers>([^\s\d]+\s+){0,5})\(?(((?<ef_low>\d+)-(?<ef_high>\d+))|(?<ef_mid_txt>\d+)|(?<ef_word>([^\s]*?normal)|(moderate)|(severe)))
Other Extraction Projects• Aortic Root Size
• Blood Pressure
• Breast Cancer ER Biomarker
• Cancer Staging, TNM, and stage
• Abdominal fistula
• Height/Weight/BMI
• Hypoglycemia with low blood sugars
• Microalbumin
• Ankle Brachial Index
30
High-risk Population:Peripheral Arterial Disease PAD
31
ICD/CPTN=9,592
Natural Language Processing (NLP)N=41,741
• PAD affects over 3 million patients per year
• Narrowed arteries reduce blood flow to limbs
• Patients with PAD are considered high risk
• Measured by Ankle Brachial Index (ABI)
Peripheral artery diseaseClaudicationRest painIschemic Limb
ABI < 0.9N=4,349
Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL, April 2016.
This is a precise patient registry!
Validation
32
• Build studies to review results of query
• Assign to team members to review results
• Randomly selects records to represent study
• Highlights key words for easy chart review
Text Analytics Must Be Interoperable!
33
Diabetes Cohort
PAD Cohort
Ejection Fraction
Tumor Sizes
Validated Text Analytics
…
Data Warehouse
• Population Analytics• Care Improvement• Operational Improvement• Financial Improvement• Research
A ‘Late Binding’ Approach to Text Analytics
34
Search:Easy starting
point
Synonym Finding
• Uses terminologies
• Allow user to find synonyms
Context Filtering
• Excludes negated concepts
• Good for cohort queries
Regular Expression
• Extraction of discrete values:• Ejection
Fractions• ABI
Validation
• Expert review of algorithm output
• Performance measurement
Integration
• Operationalize algorithm
• Incorporate into analytics
Many More Techniques
• Section tagging• Entity recognition• N-gram analysis• Document clustering
Final Thought
36
To leverage the power of text analytics…
Make the data accessible first!
Lessons Learned
37
• Using search technology for clinical text is an engaging and accessible entry point for text analytics problems. Searching clinical text is powered by an inverted index that catalogs words present in the documents, which documents they are present in, and their position in the documents.
• Medical terminologies provide a dictionary of relevant terms, synonyms, and logical structures that can enhance clinical text exploration.
• NLP algorithms that are based on the context surrounding clinical terms can identify when the term is negated (“no evidence of pneumonia”) or applies to another person (“patient's grandmother had breast cancer”). Regular expressions can be applied to text to identify patterns and extract discrete values, like ejection fraction and ankle brachial index, that are stored in text.
• Text analytics should be validated and integrated with an enterprise data warehouse where the information extracted from text can be combined with discrete, coded data.
Analytic Insights
AQuestions &
Answers
38
What You Learned…
39
Write down the key things you’ve learned related to each of the learning objectives after attending
this session
Thank You
40