Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017...

8
9/4/17 1 Machine Learning for Clinical Care Overview Fall 2017 Thursday, August 31 Outline Describe modeling Introduce learning goals Supervised Machine Learning Unsupervised Machine Learning R or Python (2.7 – Anaconda) Lesson’s Goals QuesTon(s): What is Machine Learning? How is it used in Clinical Data? How can it be used in systems? Goal(s): Learn the difference between Supervised and Unsupervised Machine Learning Learn the metrics that are used to evaluate model effecTveness Learn what packages are available in R and Python What is Machine Learning? Algorithms/Methods for StaTsTcal Learning from data In context of medical analyTcs: Pa\ern recogniTon to calculate probability/risk of certain events Match similar individuals to compare outcomes Valuable References TL: Reference for tools with WEKA BL: Good reference on metrics and classifiers TR: Used in 633 – Good mathemaTcal reference BR: Great staTsTcal machine learning text – theory based Supervised vs. Unsupervised Unsupervised: Use the data to determine some form of underlying organizaTon or structure Supervised: Pa\ern recogniTon/regression. Knowing a specific value: Y = f(X) Being able to predict p(y’ | X’) to generate y’ = f(X’)

Transcript of Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017...

Page 1: Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017 6:37:25 PM

9/4/17  

1  

Machine  Learning  for  Clinical  Care  Overview  

Fall  2017  Thursday,  August  31  

 

Outline  

•  Describe  modeling  •  Introduce  learning  goals  •  Supervised  Machine  Learning  •  Unsupervised  Machine  Learning  •  R  or  Python  (2.7  –  Anaconda)  

Lesson’s  Goals  •  QuesTon(s):  – What  is  Machine  Learning?  – How  is  it  used  in  Clinical  Data?  – How  can  it  be  used  in  systems?  

•  Goal(s):  –  Learn  the  difference  between  Supervised  and  Unsupervised  Machine  Learning  

–  Learn  the  metrics  that  are  used  to  evaluate  model  effecTveness  

–  Learn  what  packages  are  available  in  R  and  Python  

What  is  Machine  Learning?  

•  Algorithms/Methods  for  StaTsTcal  Learning  from  data  

•  In  context  of  medical  analyTcs:    – Pa\ern  recogniTon  to  calculate  probability/risk  of  certain  events  

– Match  similar  individuals  to  compare  outcomes  

Valuable  References  •  TL:  Reference  for  tools  with  WEKA    

•  BL:  Good  reference  on  metrics  and  classifiers    

•  TR:  Used  in  633  –  Good  mathemaTcal  reference  

•  BR:  Great  staTsTcal  machine  learning  text  –  theory  based  

Supervised  vs.  Unsupervised  

•  Unsupervised:  Use  the  data  to  determine  some  form  of  underlying  organizaTon  or  structure  

•  Supervised:  Pa\ern  recogniTon/regression.  Knowing  a  specific  value:  – Y  =  f(X)  – Being  able  to  predict  p(y’  |  X’)  to  generate  y’  =  f(X’)  

Page 2: Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017 6:37:25 PM

9/4/17  

2  

Unsupervised  Machine  Learning  •  Clustering  •  Determining  Similarity  •  Learning  a  meaningful  representaTon  of  data  from  a  large,  noisy  set  of  data  

•  CSCE  633  –  Machine  Learning  or  similar  course  •  Examples:  –  K-­‐Means,  Wards  Hierarchical  Clustering,  ExpectaTon  MaximizaTon  

–  Packages:  R:  Mclust,  hclust,  python:  scikit-­‐learn  

Supervised  Machine  Learning  

•  Know  the  difference  between  two  classes  of  objects  (mulTclass  extends  from  this)  

•  Want  to  find  the  funcTon  that  defines  one  versus  the  other  

•  (Regression  slightly  different)  

Supervised  Machine  Learning:  an  example  from  Duda/Stork  

Supervised  Machine  Learning:  Finding  the  right  input  feature  

Supervised  Machine  Learning:  Finding  the  right  input  feature  

Supervised  Machine  Learning:  using  mulTple  input  features  

Page 3: Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017 6:37:25 PM

9/4/17  

3  

Supervised  Machine  Learning:  non-­‐linear  models  and  overfijng   Machine  Learning  Pipeline  

•  Collect  Input  Data  (e.g.  body-­‐wearable  sensors)  

•  Segment  Appropriately  (e.g.  w.r.t.  Tme)  •  Extract  Input  features  (reduce  dimensionality)  •  Train  Machine  Learning  Model  •  Validate  •  Adjust  Decision  Threshold  based  upon  Costs  

Inpujng  Data  

•  Data  might  be  of  various  types  •  Too  many  dimensions,  how  do  you  reduce?  

Dimensionality  ReducTon:  PCA  

Dimensionality  ReducTon:  Filter  Methods  

•  Use  staTsTcal  tests  to  pick  the  most  important  variables  

•  Likelihood  raTo  test  (and  p-­‐value)  •  InformaTon  gain  •  CorrelaTon  

Dimensionality  ReducTon:  Wrapper  Methods  

•  Build  machine  learning  methods  with  various  sets  of  variables  and  select  the  one  with  the  best  accuracy  

•  How  do  you  measure  accuracy?  •  How  do  you  add  features?  In  what  order?  

Page 4: Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017 6:37:25 PM

9/4/17  

4  

Data  Setup  

•  What  happens  when  features  are:  – 0/1  binary  paTent  histories?  – 1/2/3/4  categorical  variables  with  no  ordinal  nature?  

– ConTnuous  values  in  different  numeric  ranges?  

Inpujng  Data:  NormalizaTon  

•  Is  it  best  to  leave  features  in  their  own  numeric  ranges?  

•  Put  everything  in  a  simple  [0,1]  interval?  [-­‐1,1]?  

•  Normalize  by  standard  centering  and  scaling?  •  Convert  all  categorical  variables  to  one-­‐hot  binary  encoding?  

Inpujng  Data:  Missing  data  

•  Data  will  onen  be  missing  for  a  variety  of  reasons  •  Machine  learning  algorithms  will  need  complete  data  sets  

•  Can  impute  in  a  variety  of  ways:  – Median  value  – Mean  value  –  Create  a  machine  learning  funcTon  from  complete  data  to  esTmate  the  value  

Outputs  

•  Supervised  learning  p(y’  |  X’)  •  Onen,  classificaTon  algorithms  will  output  0/1  (-­‐1/1),  what  are  they  doing?  

•  Decision  threshold  at  50%  •  This  class  is  concerned  about  the  probability  values  

LogisTc  Regression  

•  Linear,  direct  regression  •  What  does  this  do  to  restrict  inputs?  •  How  does  it  opTmize  the  coefficients?  •  F(x)  is  a  probability  

Regression:  Loss  funcTon  

Page 5: Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017 6:37:25 PM

9/4/17  

5  

Regression:  Loss  funcTon:  h\ps://courses.cs.washington.edu/courses/cse547/16sp/slides/logisTc-­‐SGD.pdf  

Regression:  Loss  funcTon:  h\ps://courses.cs.washington.edu/courses/cse547/16sp/slides/logisTc-­‐SGD.pdf  

Ensemble  Methods  

•  MulTple  learners  for  the  same  problem  •  Converge  onto  the  right  learner  •  Can  select  features  while  it  builds  •  RegularizaTon!  (Lasso  and  ElasTc  Net)  

Regression:  Loss  funcTon:  h\ps://courses.cs.washington.edu/courses/cse547/16sp/slides/logisTc-­‐SGD.pdf  

LR  with  Lasso  •  Great  starTng  point  for  all  work  •  Important  parameter  to  learn:  lambda  •  Python:  –  Sci-­‐kit  learn.  LogisTc  Regression,  SGD  Classifier,  SGD  Regressor  (slight  issue  with  sample  weights)  

•  R  Package:  GLMNET  – Using  cv.glmnet  to  find  the  right  model  +  variables  –  Coef  to  get  the  predicted  coefficients  –  Parallel=True  opTon  important  for  R  (with  doParallel  and  foreach  packages)  

Random  Forest  

Source:  h\p://www.iis.ee.ic.ac.uk/icvl/iccv09_tutorial.html  

Page 6: Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017 6:37:25 PM

9/4/17  

6  

Random  Forest:  Advantages  •  No  need  to  normalize  variables  •  No  need  to  standardize  variables  •  Handles  categorical  variables  well  •  MulT-­‐class  classificaTon  •  ProbabiliTes  based  upon  leaf  nodes  •  Internally  tests  variable  importance  and  generates  rank  •  Packages:  

–  Python:  Sci-­‐kit  learn  –  R:  randomForest,  importance=TRUE  –  Ranking  of  variables  by  Gini  or  by  Mean  Decrease  in  Accuracy  –  Important  Param:  number  of  trees  

Gradient  Descent  BoosTng  

Source:  h\p://www.iis.ee.ic.ac.uk/icvl/iccv09_tutorial.html  

Gradient  Descent  BoosTng  •  Package:  XGBoost  (R  and  Python)  •  Understanding  variable  importance  is  tricky  with  xgboost  

•  Important  parameters:  – Number  of  trees  –  Learning  rate  (eta)  – Max.  Depth  of  each  tree  – ObjecTve:  ‘binary:logisTc’  – Nthread:  (parallelizaTon)  

Other  Popular  Methods  

•  Neural  Networks  •  Support  Vector  Machines  •  Nearest  Neighbor  classifiers  •  Other  boosTng  methods  •  Hierarchical  Classifiers  

Support  Vector  Machine  

h\p://docs.opencv.org/doc/tutorials/ml/introducTon_to_svm/introducTon_to_svm.html  

SVM  

•  e1071  in  R,  scikit-­‐learn/libSVM  in  Python  •  Requires  prior  feature  selecTon  techniques  (otherwise  ripe  for  overfijng)  

•  Not  good  for  calibraTon  •  A  number  of  hyperparameters  to  set  •  Kernels  dependent  on  data  type  and  data  size  •  Important  classifier  to  learn  

Page 7: Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017 6:37:25 PM

9/4/17  

7  

Cross-­‐ValidaTon  

•  How  can  you  test  if  your  models  are  accurate?  •  Cannot  train  and  test  on  the  same  people  (except  in  small,  specific  circumstances)  

•  Onen  do  not  have  a  second  data  set  with  the  same  variables  to  test  against  externally  

•  Internal  validaTon  by  splijng  up  the  cohort  

K-­‐fold  straTfied  cross-­‐validaTon  

Load  data  set

Bleeds

Non-­‐Bleeds

Bleeds  20%

Bleeds  20%

Bleeds  20%

Bleeds  20%

Bleeds  20%

Non-­‐Bleeds  20%

Non-­‐Bleeds  20%

Non-­‐Bleeds  20%

Non-­‐Bleeds  20%

Non-­‐Bleeds  20%

Grid  Search  

Load  data  set:  Build  5-­‐fold  

Cross-­‐Validation

Select  Training  Fold

Select  Testing  Fold

Create  5-­‐fold  cross-­‐validation  of  training  set

Use  5-­‐fold  CV  of  Training  to  Select  Features  in  hold-­‐out  environment

Generate  Feature  Ranking:  Build  Model  on  

fold

Validate  Model  and  view  feature  importance  in  

holdout

Generate  Statistics  and  Confidence  

Intervals  on  CV

Repeat  for  each  fold

Clinical  Survival  Analysis  

•  Clinical  predicTon  models  are  onen  logisTc  regression  (or  some  variaTon  of  that  or  poisson  regression)  

•  Clinical  datasets  onen  come  with  one  extra  factor  –  Tme  to  event  

•  Thus,  some  clinical  models  want  to  present  probability  of  an  event  +  likelihood  of  that  event  within  given  windows  of  Tme  

Kaplan-­‐Meier  Survival  

•  At  any  given  point  in  Tme,  the  probability  of  survival  is  raTo  of  surviving  paTents  vs.  paTents  at  risk  

•  At  any  Tme,  if  a  paTent  has  already  died  or  dropped  out  of  being  monitored,  no  longer  considered  in  raTo  

•  R:  survival  •  Python:  lifelines?  scikit-­‐survival?  

Kaplan-­‐Meier  Survival  

0 50 100 150 200 250 300 350

0.0

0.2

0.4

0.6

0.8

1.0

Kaplan−Meier Survival

time (days)

Surv

ival

1234

Page 8: Day 2 - Thursday · Title: Day 2 - Thursday.pptx Author: Bobak Mortazavi Created Date: 9/4/2017 6:37:25 PM

9/4/17  

8  

Cox  ProporTonal  Hazards  •  Another  survival  method      

•  Model’s  the  rate  of  failure  •  The  coefficients  give  comparaTve  effecTveness  informaTon  when  comparing  test  vs.  control  

•  More  on  evaluaTng  the  outputs  next  lecture  •  Packages:  –  R:  survival  –  Python:  lifelines?  

Outputs  

•  ProbabiliTes  vs.  classificaTon  label  •  Odds  RaTos  •  Hazard  RaTos  •  Variable  Importance  •  Decision  Threshold  +  Associated  Metrics