Semantic Analysis of Online Health Information Seeking for Cardiovascular Diseases, Ashutosh Jadhav,...
-
Upload
knoesis-center-wright-state-university -
Category
Data & Analytics
-
view
142 -
download
0
description
Transcript of Semantic Analysis of Online Health Information Seeking for Cardiovascular Diseases, Ashutosh Jadhav,...
Semantic Analysis of Online Health Information
Seeking for Cardiovascular Diseases
1
Ashutosh Jadhav
AMIA 2014 Annual Symposium
Washington, DC
• Speaker discloses that he has no
relationships with commercial interests.
Disclosure
Collaborators
Prof. Amit Sheth (PhD Advisor)
Kno.e.sis Center, Wright State
University, OH, USA
Dr. Jyotishman Pathak (Mentor)
Mayo Clinic, Rochester, MN, USA
http://www.internetlivestats.com/internet-users/
Around 3 Billions (40%) of the world population
Around 300 Million (87 %) of the US population
4
Internet Users in the World
Online Health Information Seeking
5
Online Health Resources
6
Online Health Information Seeking
7
According to the Pew Survey, approximately 8 in 10 online
health inquiries initiate from a search engine.
Fox S, Duggan M. Pew Internet & American Life Project. 2013. Health online 2013
• According to Center for Disease Control and
Prevention, in the United States
– CVD is one of the most common chronic
diseases
– the leading cause of death (1 in every 4 deaths)
• CVD is common across all socioeconomic
groups and demographics
• Online health resources are “significant
information supplement” for the patients with
chronic conditions
8
Cardiovascular Diseases (CVD)
Use-case
Motivation
• Although cardiovascular diseases (CVD) affect a large
percentage of the population, few studies have
investigated what and how users search for CVD related
information online
• Such knowledge can be applied to improve the online
health search experience as well as to develop more
advanced next-generation knowledge and content
delivery systems
10
Methods Overview
• Data:
– CVD related search queries
– Limited to United States
• Data timeframe:
– September 2011 to August 2013
• Data collection tool:
– IBM NetInsight On Demand
(Web Analytics tool)
• Dataset size:
– 10 million CVD related SQ
– Significantly large dataset for a
single class of diseases.11
Dataset Creation
12
Top CVD Search Queries
Top 1-5 Queries Top 6-10 Queries
heart attack symptom congestive heart failure
blood pressure chart low blood pressure
how to lower blood pressure stroke symptoms
heart rate normal blood pressure
broken heart syndrome high blood pressure symptoms
Health Categories
• Selected “14 consumer oriented” health categories,
representing health information needs
• Methods
– Focus group study (Published in JMIR)
– Online health information seeking literature
– Empirical data analysis
– Health categories on popular health websites
• The health categories and the classification scheme is reviewed and
validated by the Mayo Clinic clinicians and domain experts.
13
Health Categories
Health Categories Health Categories
1 Symptoms 8 Living with
2 Causes 9 Prevention
3 Risks & Complications 10 Side effects
4 Drugs and Medications 11 Medical devices
5 Treatments 12 Diseases and conditions
6 Tests and Diagnosis 13 Age-group References
7 Food and Diet 14 Vital signs
14
Drugs and Medications: tylenol raise blood pressure, ibuprofen heart rate,
dextromethorphan blood pressure, medications pulmonary hypertension,
Health Categories Example
15
Search Query Health Categories
Heart palpitations with headache Symptoms
Tylenol and blood pressure Medication, Vital sign
Pump for pulmonary
hypertension
Medical device,
Disease
Red wine heart disease Food, Disease
Bypass surgery Treatment
Classification: Possible Approaches
• Statistical Machine Learning algorithms
– Require training data
– For multiclass classification problem with 14 classes, we
need lot of training data
– Training data
• expensive to create as it should be created manually by
domain expert
• Coverage will be limited
– Does not consider semantics of queries
16
Domain Constraint
Classifier trained for one disease may
not work for other diseases as the
symptom, treatment, drugs and
medications varies by the diseases
17
Background Knowledge
• UMLS (Unified Medical Language System)
– Comprises over 1 million biomedical concepts and 5
million concept names
– Incorporates variety of medical vocabularies and concepts,
and maps each concept to semantic types
– Contains Consumer Health Vocabulary (CHV)
• Hair loss => Alopecia
– Quarterly updated with new concepts
18
Semantic
Analysis
• UMLS Semantic Type
– Example: symptom or sign, disease or syndrome
• UMLS Concepts
– Example: blood pressure, heart rate
• UMLS MetaMap
– Tool for recognizing UMLS concepts in the text
19
MetaMap Usage Challenge and Solution
20
Hadoop-MapReduce framework with 16 Nodes
Functional overview of a mapper
Gold Standard Dataset Creation
• Randomly selected 2000 search queries from the analysis
dataset.
• Two domain experts manually annotated 2000 search queries
by labeling one search query with zero, one, or more than
health category
• The annotators first discussed and agreed upon the annotation
scheme.
• To reduce the probability of human errors and subjectivity, the
two annotators discussed together and annotated each query
and created a gold standard dataset with 2000 search queries.
• The gold standard dataset is further divided into training and
testing dataset with 1000 search queries each. 21
22
Health
Category
Categorization
RuleExample
Drugs and
Medications
• ST:
ORCH|PHSU,
CLND, PHSU
• CC: medication,
medicine,
drugs, dose,
dosage, tablet,
pill
• KW: meds
• without CC:
alcohol,
caffeine, fruit,
prevent
• Tylenol raise
blood pressure
• Medications
pulmonary
hypertension
• ibuprofen heart
rate
• Dextromethorph
an blood
pressure
23
Intent classes UMLS Semantic Types (ST), UMLS Concepts (CC) and Keywords (KW)
Symptoms ST: SOSY CC: symptoms, signs
Causes CC: cause, reason
Risks & ComplicationsCC: risk, complications
Drugs and MedicationsST: ORCH|PHSU, CLND, PHSU CC: medication, medicine, drugs, dose,
dosage, tablet, pill KW: meds (without CC: alcohol, caffeine, fruit, prevent)
TreatmentsST: TOPP, FTCN (treatment, surgery), CNCE (treatment), CC: remedy,
remediate (without CC: prevention and ‘Drugs and Medication’ queries)
Tests and DiagnosisST: DIAP, LBPR, LBTR CC: Test, diagnosis (without ST: DIAP| TOPP, CC:
alcohol, blood caffeine)
Food and DietST: FOOD CC: caffeine, recipe, meal, menu, diet, eat, breakfast, lunch, dinner,
alcohol, drink
Living withCC: control, manage, reduce, lower, coping, cure, recover KW: living with,
bring down, low down
Prevention CC: prevent, avoidance, low risk
Side effects CC: side effect KW: side effect
Medical devices ST: MEDD
Diseases and conditions ST: DSYN
Age-group References ST: AGGP
Vital signs
CC: blood pressure, heart rate, pulse rate, temperature, Heart beat, blood
glucose (without high/low blood pressure as we considered them under
‘Diseases and Conditions’)
Evaluation: Micro average
Precision Recall
• Classify 1000 search queries from the testing dataset
using the rule-based classifier
• Based on the evaluation, our classification approach has
very good Micro Average
– Precision: 0.8842,
– Recall: 0.8642
– and F-Score: 0.8723
24
Evaluations: Precision and Recall
Analysis for each Health Category
25
26
Results
No Intent Classes Total QueriesPercentage
Distribution1 Diseases 4,232,398 40.66
2 Vital signs 3,455,809 33.20
3 Symptoms 1,422,826 13.67
4 Living with 1,178,756 11.32
5 Treatments 955,701 9.18
6 Food and Diet 779,949 7.49
7 Med Devices 665,484 6.39
8 Drugs and Medications 603,905 5.80
9 Causes 599,895 5.76
10 Tests & Diagnosis 344,747 3.31
11 Risks and Complication 277,294 2.66
12 Prevention 136,428 1.31
13 Age-group References 87,929 0.84
14 Side effects 25,655 0.25
Total 10,408,921 100
27
Results
8%
48%
40%
4%
0%
Distribution of search queries by number of intent classes in which they are categorized
0
1
2
3
4 and 5
28
Data Analysis Results
29
• Average search query length for CVD is 3.88 words and 22.22 characters
• Around 80% of the CVD search queries have 3 or more words.
• CVD search queries are longer than previously reported non-medical as well
as medical queries
Data Analysis Results
Discussion and Conclusion
• We found that use of MetaMap and UMLS concepts/semantic type
to be a very good approach for customized health categorization
• The top searched health categories for CVD are ‘Diseases and
Conditions’, ‘Vital Sings’, ‘Symptoms’, and ‘Living with’.
• Most of the queries (around 88%) are categorized into either one
or two health categories.
• To the best of our knowledge, there is not much research on
understanding online health information searching for chronic
diseases and especially for CVD.
• This study addresses this knowledge gap and extends our
knowledge about online health information search behavior.