Data Mining and Machine Learning in Population Health Studies

DATA MINING AND MACHINE LEARNING IN POPULATION

HEALTH STUDIESMarina Sokolova

Dept of ECM and School of EECS, University of Ottawa

Institute for Big Data Analytics

[email protected] 2

Data Mining

Science and technology that discover new knowledge in large data sets Vast amount of accumulated data

XXX,XXX,XXX records from health insurance companies in the NY state alone

=> automated methods Ever-changing data

New drugs, tests change the problem

=> adaptive methods Beyond human processing capacities

Sept 25, 2014

[email protected] 3

Structured Data

A B C D E … AT AU AV AW AX

id patient_nbr race gender age … metformin-rosiglitazone

metformin-pioglitazone change

diabetesMed readmitted

18222157C Female [0-10) … No No No No NO

255629189C Female [10-20) … No No Ch Yes >30

386047875A Female [20-30) … No No No Yes NO

… …

1277391171A Male [60-70) … No No Ch Yes <30

… … … … … … … … … … …

101765 41088789C Male [70-80)

…No No Ch Yes NO

101766 31693671C Female [80-90)

…No No Ch Yes NO

101767 1.75E+08C Male [70-80)

…No No No No NOSept 25, 2014

Databases, mostly organizational

[email protected] 4

Unstructured Data

Text He had an uncomplicated postoperative

course and he was transferred . Advanced his diet on postop day # 4 to a transitional diet ...

Experts fear that Ebola will mutate and become spreadable via cough or sneeze ...

Images

Sept 25, 2014

[email protected] 5

Privacy Protection

Individuals cannot be uniquely identified from the data set

Mandatory for health data custodians and human subject studies

HIPPA, PHIPA, etc. Privacy-preserving methods

De-identification, i.e. a severing of a data set from the identity of the data contributor, but may include identifying information which could be re-linked by a trusted party in certain situations

Anonymization, i.e., irreversibly severing a data set from the identity of the data contributor

Sept 25, 2014

[email protected] 6

Data Mining Process

Step 1: Data pre-processing - Sample selection- Noise reduction - Unstructured to structured transformation- Privacy protection

Step 2: Information processing - Record classification - Clustering - Association rule mining

Step 3: Evaluation - Performance assessment- Result interpretation

Sept 25, 2014

[email protected] 7

Machine Learning

Ability of algorithms to discover properties in previously unseen data, based on known properties found in training data

Algorithmic “muscles” of Data Mining Common tasks:

Classification of instances Clustering of instances

Sept 25, 2014

[email protected] 8

More on ML tasks

Classification/supervised learning An algorithm assigns data items into pre-

defined categories (e.g., No, < 30, >30) Categories do not over-lap

Binary classification is the most common There could be more than one category for an

item (multi-labelled classification):C + Female + [10-20)

Clustering/unsupervised learning Grouping data items according to their

similarities Clusters usually do not over-lap

Sept 25, 2014

[email protected] 9

Essential Parts of ML

learning modes training and test stages model selection (validation and testing,

cross-validation, leave-one-out) algorithms (e.g., K-NN, Naïve Bayes,

Support Vector Machines) Performance evaluation

Sept 25, 2014

[email protected] 10

Learning Modes Classification/Supervised

Data items are labelled One page of a professionally annotated text from a medical

domain - $10,000 600 personal health records - $1,500 for de-identification and

1-2 months for an experienced Research Assistant to extract relevant information ($4,500 + overhead) . Note that we usually need thousands of records!

The most accurate results Clustering/Unsupervised

Data items are not labelled Plenty of such data Hard to evaluate, usually approximate results

Semi-supervised A mixture of labelled and unlabelled data

Sept 25, 2014


Training and Test Stages

Training and test data Data sets are split into non-overlapping parts Training sets are usually bigger than test sets

An algorithm is applied on the training set; Its results are verified either automatically (supervised

learning) or manually (non-supervised learning) The algorithm parameters are adjusted depending on

the results The model with the best results is applied on the

test set Errors are counted on the test set only!

Sept 25, 2014


The Model Selection

Validation and test Divide the initial set into 3 parts (training, validation, test) Use 1 part for training and 1 part for validation Apply on the test part

Cross-validation Divide the initial set into 5 (10) parts Use 4 (9 )parts for training and 1 part for test Repeat 5(10 ) times for a new set of training and test parts

Leave-one-out Use all items but one for training Apply the algorithm on the remaining item Repeat for all data items

Sept 25, 2014


Algorithms

Probability-based (Naïve Bayes) Prototype-based (K-NN) Optimization-based (SVM) Decision-based (Decision Trees)

Sept 25, 2014


Performance Measures

Accuracy = (tp + tn)/(tp + tn + fp + fn)

Precision (Pr) = tp/(tp + fp)

Recall (R) = tp/(tp + fn)

F-score = 2PrR/(Pr + R)

Sept 25, 2014

Positive (Algorithm)

Negative (Algorithm)

Positive (Data) tp fnNegative (Data)

fp tn


Accuracy of Disease Diagnostics

Data Accuracy (%)

Algorithm

Pima Diabetics 76.30 Naive Bayes

Heart Disease 83.70 Naive Bayes

Breast cancer (Haberman)

71.20 K-means

Liver patients 68.78 Decision Trees

Breast cancer (Wisconsin)

97.50 Decision Trees

Sept 25, 2014

New Frontiers: Personal Health Information on the Web

Infodemiology studies the determinants and distribution of health information on the Internet (Gunther Eysenbach, 2004) Google Trends BioCaster

19 % - 28.5 % of all the Internet users to participate in online health-related discussions.

Growth of Internet of Things is expected significantly increase sharing of personal health information Privacy protection has to be adjusted/re-developed

Sept 25, 2014 [email protected] 16

Privacy Protection in Big Data Analytics

17

Personal Health Information

Personal health information (PHI) is information about one’s health discussed by a patient in a clinical setting

PHI is the most vulnerable private information posted online I have a family history of Alzheimer's disease. I have seen

what it does and its sadness is a part of my life. I am already burdened with the knowledge that I am at risk.

We're going for the basic blood tests, the NT scan, and the "Ashkenazi panel" since both XX and I are Jewish from E. European descent.

14/07/2014


Research Questions

Q1. Do people talk about health?

Q2. How do people talk about health?

Q3. What emotions can be found in health discussions?

Sept 25, 2014

Challenges of PHI Retrieval (Information Extraction)

General health information: they are promoting cancer awareness particularly lung cancer

Personal health information: I had a rare condition and half of my lung had to be removed

Irrelevant: I saw a guy chasing someone and screaming at the top of his lungs

Terminology the transfer went well - my RE did it himself which was comforting. 2 embies (grade 1 but slow in development) so I am not holding my breath for a positive

Technical terms Someone with 50 DB hearing aid gain with a total loss of 70 DB may not know that the place is producing 107 DB since it may not appear too loud to him since he only perceives 47 DB

Challenges of PHI Understanding (Semantic Analysis)

14/07/2014Privacy Protection in Big Data

Analytics 20

Sentiment: I am sickened by the thought …

Ailment: I feel sick for awhile; should see my physician

Opinion: I think it is evident that …

Improvement: The benefit is usually evident within a few days of starting it

Humor: don't forget that it's better for your health to enjoy your steak than to resent your sprouts

Complain: After that my health deteriorated …


21

Challenges of Medical Electronic Resources

Electronic medical dictionaries are developed to analyze scientific publications the Medical Dictionary for Regulatory

Activities (MedDRA): 8,561 unique terms/86 PHI terms the Systematized Nomenclature of Medicine

Clinical Terms (SNOMED CT): 44,802 unique terms/108 PHI terms

14/07/2014


22

Our Approach

Humans in the loop – manual annotation of data samples (Supervised learning)

Advanced methods in data pre-processing Sentence splitting, tokenization, part of speech

tagging, lemmatization for nouns and verbs PHI resource building (e.g., ontology of PHI

terms, HealthAffect lexicon) Use of robust algorithms

Naive Bayes Appropriate evaluation methods

fn estimation

14/07/2014


Data Sources

Online medical forums IVF Hearing loss Newborn screening for rare diseases

Social networks MySpace Twitter Facebook

Sept 25, 2014


Q1. Do people talk about health?

In randomly selected 1000 tweet threads, 15% threads revealed personal health information

In randomly selected 11800 MySpace posts, 6% posts discussed personal health

On IVF forums, participants (women 95%) mostly talk about health

Sept 25, 2014


Q1: It all depends on the context

On HL forums, participants talk about health and quality of life/life style

On newborn screening for rare diseases, parents often discuss privacy and physical hurt; at the same time, they seldom talk about health

In a student network on Facebook, participants do NOT talk about health

Sept 25, 2014


Q2: How people talk about health

Simple language For me the laser treatment had unpleasant

side-effects. …got a huge bump on my forehead, fractured

my nose. Basic concepts

Concussion, thyroid, asthma, fracture, hypothermia

Cold, flu, injury, headache Exception: Hearing Loss discussions

involve more specific terms than other discussionsSept 25, 2014


Q3. What emotions can be found in health discussions?

Range of emotions depends on the content of health issues Positive/negative/neutral on Twitter and HL

forums Gratitude, encouragement, endorsement,

confusion on IVF forums

Strength of emotional disclosure varies Outspoken emotional posts on newborn

screening and IVF Muted emotions on MySpace

Sept 25, 2014


28

Performance Evaluation

We detect PHI: False negatives on social networks (11,800

messages) – 0.003/baseline 0.031 False negatives on peer-to-peer networks

(2,300 documents) – 0.000/baseline 0.031 We recognize PHI:

Precision on Twitter (1000 threads) - 0.770/baseline 0.419

We identify PHI-related opinions: F-score on HL forums (3515 sentences) -

0.685/baseline 0.584

14/07/2014


Data Sets Used in Population Health Studies

Indian Liver Patient Dataset http://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29

Breast Cancer Wisconsin (Diagnostic) Data Set http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Haberman's Survival Data Set (breast cancer, 1999)

http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival Many morehttp://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=life&numAtt=&numIns=&type=&sort=nameUp&view=table

Sept 25, 2014

http://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)



http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)



http://archive.ics.uci.edu/ml/datasets/Haberman's+Survival

http://archive.ics.uci.edu/ml/datasets/Haberman's+Survival

http://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=life&numAtt=&numIns=&type=&sort=nameUp&view=table





Useful links

Weka 3: Data Mining Software – open source! http://www.cs.waikato.ac.nz/ml/weka/ Support Vector Machine – open source!http://svmlight.joachims.org/ Andrew Ng’s (Stanford) web site with video

lectures on MLhttp://www.academicearth.org/courses/machine-learning Benchmark data sets repositoryhttp://archive.ics.uci.edu/ml/

Sept 25, 2014

http://www.cs.waikato.ac.nz/ml/weka/

http://svmlight.joachims.org/

http://www.academicearth.org/courses/machine-learning

http://www.academicearth.org/courses/machine-learning

http://archive.ics.uci.edu/ml/


Thank you!

Questions?

Sept 25, 2014


Probability-based: Naïve Bayes

Assumes that all the informative features are independent AND identically distributed.

Both assumptions are generally not true.

Sept 25, 2014

1

Pr( | ) ( )n

kk

C X pr X


Being Optimistic Does not Hurt

Naïve Bayes can outperform sophisticated classifiers!

Sept 25, 2014

Training Set

Label

Test Set

Labels

Probability

1 0 11 1 1001

0001 1

1101 0 0110

0111 0 Human intervention is needed

1

1 1 1Pr( | ) 1 1

2 2 4pos x

1

1Pr( | ) 0 0

2neg x

2 2Pr( | ) Pr( | ) 0pos x neg x


Prototype-based: K-nearest neighbor

Uses observations in the training set T closest in the input space to the entry x to form conclusion Y .

Y can be a predicted class label of x. Useful in practical applications

Sept 25, 2014


A closer look at K neighbors

( )

1( )

i k

ix N x

Y x yk

Sept 25, 2014

1. 2-NN: Green 2. 3-NN: Green3. 4-NN: Ambiguous4. 5-NN: Red5. 6-NN: Red6. 7-NN: Red.

Labels for the test example:


Good/bad things about KNN

Only two adjustable parameters: Number of neighbors Closeness (i.e., distance between neighbors)

The output is easy to understand

Highly depends on the training data, population sample

Sept 25, 2014


Optimization-based algorithms: Support Vector Machines

Highly accurate classifiers Extremely popular for publications Seldom used in practice

Sept 25, 2014

j j jj

a c d


Support Vector Machines

Sept 25, 2014

1L

3L

2L

1 :L neg

2 :L pos

3 :L neg

Hyper-planes in action: • various dimensions • linear hyper-planes differ by soft

margins

Labels for the test example:


Good/bad things about SVM

Several adjustable parameters Dimensions of discriminative hyper-planes Kernel functions Soft-margin

Every parameter matters Almost a random choice

Sept 25, 2014


Decision-based algorithms

Decision Trees

Decision Lists

Sept 25, 2014

Can beat SVM when efficiency is as much important as effectiveness!

Data Mining and Machine Learning in Population Health Studies

Documents

Transcript of Data Mining and Machine Learning in Population Health Studies