Data Mining and Machine Learning in Population Health Studies
-
Upload
petra-leonard -
Category
Documents
-
view
37 -
download
4
description
Transcript of Data Mining and Machine Learning in Population Health Studies
DATA MINING AND MACHINE LEARNING IN POPULATION
HEALTH STUDIESMarina Sokolova
Dept of ECM and School of EECS, University of Ottawa
Institute for Big Data Analytics
Data Mining
Science and technology that discover new knowledge in large data sets Vast amount of accumulated data
XXX,XXX,XXX records from health insurance companies in the NY state alone
=> automated methods Ever-changing data
New drugs, tests change the problem
=> adaptive methods Beyond human processing capacities
Sept 25, 2014
Structured Data
A B C D E … AT AU AV AW AX
id patient_nbr race gender age … metformin-rosiglitazone
metformin-pioglitazone change
diabetesMed readmitted
18222157C Female [0-10) … No No No No NO
255629189C Female [10-20) … No No Ch Yes >30
386047875A Female [20-30) … No No No Yes NO
… …
1277391171A Male [60-70) … No No Ch Yes <30
… … … … … … … … … … …
101765 41088789C Male [70-80)
…No No Ch Yes NO
101766 31693671C Female [80-90)
…No No Ch Yes NO
101767 1.75E+08C Male [70-80)
…No No No No NOSept 25, 2014
Databases, mostly organizational
Unstructured Data
Text He had an uncomplicated postoperative
course and he was transferred . Advanced his diet on postop day # 4 to a transitional diet ...
Experts fear that Ebola will mutate and become spreadable via cough or sneeze ...
Images
Sept 25, 2014
Privacy Protection
Individuals cannot be uniquely identified from the data set
Mandatory for health data custodians and human subject studies
HIPPA, PHIPA, etc. Privacy-preserving methods
De-identification, i.e. a severing of a data set from the identity of the data contributor, but may include identifying information which could be re-linked by a trusted party in certain situations
Anonymization, i.e., irreversibly severing a data set from the identity of the data contributor
Sept 25, 2014
Data Mining Process
Step 1: Data pre-processing - Sample selection- Noise reduction - Unstructured to structured transformation- Privacy protection
Step 2: Information processing - Record classification - Clustering - Association rule mining
Step 3: Evaluation - Performance assessment- Result interpretation
Sept 25, 2014
Machine Learning
Ability of algorithms to discover properties in previously unseen data, based on known properties found in training data
Algorithmic “muscles” of Data Mining Common tasks:
Classification of instances Clustering of instances
Sept 25, 2014
More on ML tasks
Classification/supervised learning An algorithm assigns data items into pre-
defined categories (e.g., No, < 30, >30) Categories do not over-lap
Binary classification is the most common There could be more than one category for an
item (multi-labelled classification):C + Female + [10-20)
Clustering/unsupervised learning Grouping data items according to their
similarities Clusters usually do not over-lap
Sept 25, 2014
Essential Parts of ML
learning modes training and test stages model selection (validation and testing,
cross-validation, leave-one-out) algorithms (e.g., K-NN, Naïve Bayes,
Support Vector Machines) Performance evaluation
Sept 25, 2014
Learning Modes Classification/Supervised
Data items are labelled One page of a professionally annotated text from a medical
domain - $10,000 600 personal health records - $1,500 for de-identification and
1-2 months for an experienced Research Assistant to extract relevant information ($4,500 + overhead) . Note that we usually need thousands of records!
The most accurate results Clustering/Unsupervised
Data items are not labelled Plenty of such data Hard to evaluate, usually approximate results
Semi-supervised A mixture of labelled and unlabelled data
Sept 25, 2014
Training and Test Stages
Training and test data Data sets are split into non-overlapping parts Training sets are usually bigger than test sets
An algorithm is applied on the training set; Its results are verified either automatically (supervised
learning) or manually (non-supervised learning) The algorithm parameters are adjusted depending on
the results The model with the best results is applied on the
test set Errors are counted on the test set only!
Sept 25, 2014
The Model Selection
Validation and test Divide the initial set into 3 parts (training, validation, test) Use 1 part for training and 1 part for validation Apply on the test part
Cross-validation Divide the initial set into 5 (10) parts Use 4 (9 )parts for training and 1 part for test Repeat 5(10 ) times for a new set of training and test parts
Leave-one-out Use all items but one for training Apply the algorithm on the remaining item Repeat for all data items
Sept 25, 2014
Algorithms
Probability-based (Naïve Bayes) Prototype-based (K-NN) Optimization-based (SVM) Decision-based (Decision Trees)
Sept 25, 2014
Performance Measures
Accuracy = (tp + tn)/(tp + tn + fp + fn)
Precision (Pr) = tp/(tp + fp)
Recall (R) = tp/(tp + fn)
F-score = 2PrR/(Pr + R)
Sept 25, 2014
Positive (Algorithm)
Negative (Algorithm)
Positive (Data) tp fnNegative (Data)
fp tn
Accuracy of Disease Diagnostics
Data Accuracy (%)
Algorithm
Pima Diabetics 76.30 Naive Bayes
Heart Disease 83.70 Naive Bayes
Breast cancer (Haberman)
71.20 K-means
Liver patients 68.78 Decision Trees
Breast cancer (Wisconsin)
97.50 Decision Trees
Sept 25, 2014
New Frontiers: Personal Health Information on the Web
Infodemiology studies the determinants and distribution of health information on the Internet (Gunther Eysenbach, 2004) Google Trends BioCaster
19 % - 28.5 % of all the Internet users to participate in online health-related discussions.
Growth of Internet of Things is expected significantly increase sharing of personal health information Privacy protection has to be adjusted/re-developed
Sept 25, 2014 [email protected] 16
Privacy Protection in Big Data Analytics
17
Personal Health Information
Personal health information (PHI) is information about one’s health discussed by a patient in a clinical setting
PHI is the most vulnerable private information posted online I have a family history of Alzheimer's disease. I have seen
what it does and its sadness is a part of my life. I am already burdened with the knowledge that I am at risk.
We're going for the basic blood tests, the NT scan, and the "Ashkenazi panel" since both XX and I are Jewish from E. European descent.
14/07/2014
Research Questions
Q1. Do people talk about health?
Q2. How do people talk about health?
Q3. What emotions can be found in health discussions?
Sept 25, 2014
Challenges of PHI Retrieval (Information Extraction)
General health information: they are promoting cancer awareness particularly lung cancer
Personal health information: I had a rare condition and half of my lung had to be removed
Irrelevant: I saw a guy chasing someone and screaming at the top of his lungs
Terminology the transfer went well - my RE did it himself which was comforting. 2 embies (grade 1 but slow in development) so I am not holding my breath for a positive
Technical terms Someone with 50 DB hearing aid gain with a total loss of 70 DB may not know that the place is producing 107 DB since it may not appear too loud to him since he only perceives 47 DB
Challenges of PHI Understanding (Semantic Analysis)
14/07/2014Privacy Protection in Big Data
Analytics 20
Sentiment: I am sickened by the thought …
Ailment: I feel sick for awhile; should see my physician
Opinion: I think it is evident that …
Improvement: The benefit is usually evident within a few days of starting it
Humor: don't forget that it's better for your health to enjoy your steak than to resent your sprouts
Complain: After that my health deteriorated …
Privacy Protection in Big Data Analytics
21
Challenges of Medical Electronic Resources
Electronic medical dictionaries are developed to analyze scientific publications the Medical Dictionary for Regulatory
Activities (MedDRA): 8,561 unique terms/86 PHI terms the Systematized Nomenclature of Medicine
Clinical Terms (SNOMED CT): 44,802 unique terms/108 PHI terms
14/07/2014
Privacy Protection in Big Data Analytics
22
Our Approach
Humans in the loop – manual annotation of data samples (Supervised learning)
Advanced methods in data pre-processing Sentence splitting, tokenization, part of speech
tagging, lemmatization for nouns and verbs PHI resource building (e.g., ontology of PHI
terms, HealthAffect lexicon) Use of robust algorithms
Naive Bayes Appropriate evaluation methods
fn estimation
14/07/2014
Data Sources
Online medical forums IVF Hearing loss Newborn screening for rare diseases
Social networks MySpace Twitter Facebook
Sept 25, 2014
Q1. Do people talk about health?
In randomly selected 1000 tweet threads, 15% threads revealed personal health information
In randomly selected 11800 MySpace posts, 6% posts discussed personal health
On IVF forums, participants (women 95%) mostly talk about health
Sept 25, 2014
Q1: It all depends on the context
On HL forums, participants talk about health and quality of life/life style
On newborn screening for rare diseases, parents often discuss privacy and physical hurt; at the same time, they seldom talk about health
In a student network on Facebook, participants do NOT talk about health
Sept 25, 2014
Q2: How people talk about health
Simple language For me the laser treatment had unpleasant
side-effects. …got a huge bump on my forehead, fractured
my nose. Basic concepts
Concussion, thyroid, asthma, fracture, hypothermia
Cold, flu, injury, headache Exception: Hearing Loss discussions
involve more specific terms than other discussionsSept 25, 2014
Q3. What emotions can be found in health discussions?
Range of emotions depends on the content of health issues Positive/negative/neutral on Twitter and HL
forums Gratitude, encouragement, endorsement,
confusion on IVF forums
Strength of emotional disclosure varies Outspoken emotional posts on newborn
screening and IVF Muted emotions on MySpace
Sept 25, 2014
Privacy Protection in Big Data Analytics
28
Performance Evaluation
We detect PHI: False negatives on social networks (11,800
messages) – 0.003/baseline 0.031 False negatives on peer-to-peer networks
(2,300 documents) – 0.000/baseline 0.031 We recognize PHI:
Precision on Twitter (1000 threads) - 0.770/baseline 0.419
We identify PHI-related opinions: F-score on HL forums (3515 sentences) -
0.685/baseline 0.584
14/07/2014
Data Sets Used in Population Health Studies
Indian Liver Patient Dataset http://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29
Breast Cancer Wisconsin (Diagnostic) Data Set http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Haberman's Survival Data Set (breast cancer, 1999)
http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival Many morehttp://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=life&numAtt=&numIns=&type=&sort=nameUp&view=table
Sept 25, 2014
Useful links
Weka 3: Data Mining Software – open source! http://www.cs.waikato.ac.nz/ml/weka/ Support Vector Machine – open source!http://svmlight.joachims.org/ Andrew Ng’s (Stanford) web site with video
lectures on MLhttp://www.academicearth.org/courses/machine-learning Benchmark data sets repositoryhttp://archive.ics.uci.edu/ml/
Sept 25, 2014
Probability-based: Naïve Bayes
Assumes that all the informative features are independent AND identically distributed.
Both assumptions are generally not true.
Sept 25, 2014
1
Pr( | ) ( )n
kk
C X pr X
Being Optimistic Does not Hurt
Naïve Bayes can outperform sophisticated classifiers!
Sept 25, 2014
Training Set
Label
Test Set
Labels
Probability
1 0 11 1 1001
0001 1
1101 0 0110
0111 0 Human intervention is needed
1
1 1 1Pr( | ) 1 1
2 2 4pos x
1
1Pr( | ) 0 0
2neg x
2 2Pr( | ) Pr( | ) 0pos x neg x
Prototype-based: K-nearest neighbor
Uses observations in the training set T closest in the input space to the entry x to form conclusion Y .
Y can be a predicted class label of x. Useful in practical applications
Sept 25, 2014
A closer look at K neighbors
( )
1( )
i k
ix N x
Y x yk
Sept 25, 2014
1. 2-NN: Green 2. 3-NN: Green3. 4-NN: Ambiguous4. 5-NN: Red5. 6-NN: Red6. 7-NN: Red.
Labels for the test example:
Good/bad things about KNN
Only two adjustable parameters: Number of neighbors Closeness (i.e., distance between neighbors)
The output is easy to understand
Highly depends on the training data, population sample
Sept 25, 2014
Optimization-based algorithms: Support Vector Machines
Highly accurate classifiers Extremely popular for publications Seldom used in practice
Sept 25, 2014
j j jj
a c d
Support Vector Machines
Sept 25, 2014
1L
3L
2L
1 :L neg
2 :L pos
3 :L neg
Hyper-planes in action: • various dimensions • linear hyper-planes differ by soft
margins
Labels for the test example:
Good/bad things about SVM
Several adjustable parameters Dimensions of discriminative hyper-planes Kernel functions Soft-margin
Every parameter matters Almost a random choice
Sept 25, 2014
Decision-based algorithms
Decision Trees
Decision Lists
Sept 25, 2014
Can beat SVM when efficiency is as much important as effectiveness!