An(Introduc=on(to(( Clinical(Natural(( … Ques&ons!addressed!in!this!½!day!tutorial!!! (((•...

Post on 11-Apr-2018

213 views 1 download

Transcript of An(Introduc=on(to(( Clinical(Natural(( … Ques&ons!addressed!in!this!½!day!tutorial!!! (((•...

1

Leonard  D’Avolio  Dina  Demner-­‐Fushman  Wendy  W.  Chapman    

 

1

An  Introduc=on  to    Clinical  Natural    

Language  Processing    

2

Ques&ons  addressed  in  this  ½  day  tutorial        

 

 

•  What  is  natural  language  processing  (NLP)?  

•  Why  does  it  maDer?  

•  How  is  it  being  used?  •  What  are  the  basic  approaches  to  it?  

•  What  considera=ons  are  there  in  using  it?  

•  How  should  you  evaluate  it?  •  Where  is  the  field  today?  

•  Where  is  it  headed?  

•  How  can  I  learn  more?   2

3

Format      

 

 

•  Focus  on  clinical  NLP  •  Some  discussion  of  literature  &  phenotyping  

•  70%  basic,  30%  intermediate  

•  A  lot  of  material  covered  at  a  high  level  

•  PLEASE  interrupt  with  ques=ons  •  Planned  15  minute  break  

•  Don’t  forget  your  survey  •  Part  2  in  Jefferson  East  

3

4

Outline      

 

 

1.  What  is  NLP  and  how  is  it  used  in  medicine?  (Dina)  

2.  Goals  and  challenges  of  clinical  NLP  (Wendy)  

3.  The  methods  of  NLP  (Leonard)  

4.  Annota=on  &  evalua=on  (Dina)  5.  Implementa=on  considera=ons  (Wendy)  

6.  Current  state,  future  progress,  available  resources  (Leonard)  

4

5 5

 Dina  Demner-­‐  Fushman  

What  is  NLP  and    how  is  it  being  used    

in  medicine?    

Why  natural  language  processing?  

•  Increasing  amounts  of  biomedical  literature  

–  Extrac=ng  facts,  rela=ons,  events    into  knowledge  repositories  (text  mining)  

–  Model  organism  database  cura=on  

–  Ques=on  answering  (TREC  Genomics  track)  

–  Literature  based  discovery  

•  Increasing  demands  for  use  of  EMR  data  –  Phenotyping  for  genomic-­‐

related  analysis  –  Linking  evidence  for  

Evidence-­‐based  medicine  –  Biosurveillance  –  Quality  measures  

•  Majority  of  EMR  data  is  free  text  

7  

•  Classify  

•  Extract  

•  Summarize  

What  is  natural  language  processing?      Electronic  Medical  Records  

MEDLINE  Ar=cles  /  Abstracts  

Natural  Language  Processing  

Structured  Data  (Machine  

interpretable)  

8  

Examples  of  Uses  of  Clinical  NLP    

•  Classify    

•  Extract  

•  Summarize  

BioNLP  Examples  •  Classify    

•  Extract  

•  Summarize  

Classify  a  chief  complaint  into  a  syndrome  category  

“SOB/cough”  =  Respiratory  

8  

Triage  of  ar=cles  likely  to  have  experimental  evidence    

Find  evidence  to    assign  top-­‐level  GO  terms  

9  

Examples  of  Uses  of  Clinical  NLP    

#  of  lymph  nodes  removed  during  colorectal  cancer  surgery  

 

9  

Extract  bio-­‐molecular  events  

 phosphoryla=on  of  TRAF2  -­‐>  (Type:Phosphoryla=on,  

Theme:TRAF2)    

•  Classify    

•  Extract  

•  Summarize  

•  Classify    

•  Extract  

•  Summarize  

 

BioNLP  Examples  

10  

Examples  of  Uses  of  Clinical  NLP    •  Classify    

•  Extract  

•  Summarize  

From  a  H&P  note,  list  chronic  condi=on  

Summarize  family  history  of  prostate  cancer  

10  

BioNLP  Examples  •  Classify    

•  Extract  

•  Summarize  

Summarize  full  text  documents    

 

Gene  Reference  into  func=on  (GeneRif)  

Biomedical  Usage  

11

12 12

 Wendy  Chapman  

Goals  and  Challenges    of  Clinical  NLP  

 

Detect  Nosocomial  Infec=ons  

An=bio=c  Assistant*  (LDS  Hospital)    

 

*  Evans  RS,  et  al.  N  Eng  J  Med  1998  

temperature  

white  blood  cell  count  

infiltrate  compa=ble  with  pneumonia  

.  .  .  

1)  Alert  physician:  pa=ent  might  need  An=infec=ve  therapy  

2)  Suggest  type  and  dose  of  an=bio=c  -­‐  allergies  -­‐   insurance  -­‐   age  -­‐   renal  func=on  

infiltrate  compa=ble  with  pneumonia   chest  x-­‐ray  report  

13  

Phenotyping  Iden=fy  symptoms  that  co-­‐occur  with  lung  cancer  

ED    Report  

NLP  System  

Feature  1:  Feature  2:  

…  Feature  n:  

Classifier  

Predic=ve  of  Lung  Cancer  

Not  

14  

Two  Simple  NLP  Tasks  

1.  Find  all  relevant  phrases  in  ED  Report  

2.  Map  individual  phrases  to  standard  features  

Your  Task  

•  Highlight  every  instance  of  features  in  sample  report  

•  Mark  most  specific  instance  –  E.g.,  “chest  pain”  preferred  over  “pain”  

•  Do  not  mark  =me,  nega=on,  or  uncertainty  

 

 

15

Find  Relevant  Features  in  ED  Report  

Produc=ve  cough  

Dyspnea  

Sinusi=s  

Pneumonia  

Wheezing  

Tachypnea  

Fever  

Rales  

Cervical  adenopathy  

 

Possible  values:  acute,  historical,  absent  16

How  did  you  do?  

17

Why  is  NLP  Difficult?  

Named  en=ty  recogni=on  

Linguis=c  varia=on  Polysemy  

Finding  valida=on  Implica=on  

Contextual  aDribute  assignment  

Nega=on    Uncertainty  Temporality  

Discourse  processing  

Report  structure  Coreference  

18

Linguis=c  Varia=on    Different  Words  with  the  Same  Meaning  

Deriva=on  medias=nal  =  medias=num  

Inflec=on  opacity  =  opaci=es;  cough  =  coughed  

Synonymy  Addison’s  Disease:  Addison  melanoderma,  adrenal  insufficiency,  adrenocor=cal  insufficiency,  asthenia  pigemntosa,  bronzed  disease,  melasma  addisonii,  …  

Chest  wall  tenderness:  chest  wall  did  demonstrate  some  slight  tenderness  when  the  pa=ent  had  pressure  applied  to  the  right  side  of  the  thoracic  cage   19

Polysemy  One  Word  With  Mul=ple  Meanings  

General  polysemy  Pa=ent  was  prescribed  codeine  upon  discharge  The  discharge  was  yellow  and  purulent  

Acronyms  and  Abbrevia=ons    APC:  ac=vated  protein  c,  adenomatosis  polyposis  coli,  adenomatous  polyposis  coli,  an=gen  presen=ng  cell,  aerobic  plate  count,  advanced  pancrea=c  cancer,  age  period  cohort,  alfalfa  protein  concentrated,  allophycocyanin,  anaphase  promo=ng  complex,  anoxic  precondi=oning,  anterior  piriform  cortex,  an=body  producing  cells,  atrial  premature  complex,  …  

20

Nega=on  Approximately  half  of  all  clinical  concepts  in  dictated  

reports  are  negated*  

Explicit  nega=on  “The  medias=num  is  not  widened”  

Medias=nal  widening:  absent  

Implied  absence  without  nega=on  “Lungs  are  clear  upon  ausculta=on”  

Rales/crackles:  absent  Rhonchi:  absent  Wheezing:  absent  

*Chapman  WW,  Bridewell  W,  Hanbury  P,  Cooper  GF,  Buchanan  BG.  Evalua=on  of  nega=on    phrases  in  narra=ve  clinical  reports.  Proc  AMIA  Sym.  2001:105-­‐9.  

21

Uncertainty  

Unsure  

 treated  for  a  presump=ve  sinusi=s  

Reasoning  

 It  was  felt  that  the  pa=ent  probably  had  a  cerebrovascular  accident  involving  the  lev  side  of  the  brain.    Other  differen=als  entertained  were  perhaps  seizure  and  the  pa=ent  being  post-­‐ictal  when  he  was  found,  although  this  considera=on  is  less  likely    

Reason  for  exam  

 R/O  out  pneumonia.   22

Temporality  Clinical  reports  tell  a  story  

Past  medical  history  History  of  CHF  presen=ng  with  shortness  of  lev-­‐sided  chest  pain.  

Hypothe=cal  or  non-­‐specific  men=ons  He  should  return  for  fever  or  increased  shortness  of  breath.  

Temporal  course  of  disease  Pa=ent  presents  with  chest  pain  …  Aver  administra=on  of  nitroglycerin,  the  chest  pain  resolved.  

23

Finding  Valida=on    Men=on  of  a  finding  in  the  text  does  not  guarantee  the  pa=ent  has  the  finding  

She  received  her  influenza  vaccine  His  temperature  was  taken  in  the  ED  

Some  findings  require  values  Fever  

Temperature  38.5C  Oxygen  desatura=on  

Oxygen  satura=on  low  Oxygen  satura=on  85%  on  room  air  

 24

Implica=on  

Audience  for  pa=ent  reports  is  physicians    Lay  people  less  accurate  at  determining  if  a  chest  x-­‐ray  report  shows  evidence  of  Pneumonia    Pneumonia  not  men=oned  in  2/3  of  posi=ve  reports  

Sentence  level  inference  “There  were  hazy  opaci=es  in  the  lower  lobes”  à  

Localized  infiltrate  Report  level  inference  

Localized  infiltrates  à  Probable  pneumonia  

25

Report  Structure  

Anatomic  Loca=on  some=mes  in  sec=on  header  NECK:  no  adenopathy.  

Some  sec=ons  carry  more  weight    IMPRESSION:  atelectasis  

Some  reports  contain  pasted  text  difficult  to  process  

 Cardiovascular:  [  ]  Angina  [  ]  MI  [x  ]  HTN  [  ]  CHF  [  ]  PVD  [  ]  DVT  [  ]  Arrhythmias  [  ]  Previous  PTCA  [  ]  Previous  Cardiac  Surgery  [  ]  Nega=ve  -­‐  Denies  CV  problems    

  26

Coreference  

 Chest  x-­‐ray  again  shows  a  well-­‐circumscribed  nodule  located  in  the  lev  upper  lobe.  The  tumor  has  increased  in  size  since  the  last  exam  with  a  diameter  of  approximately  2  cm.    How  big  is  the  nodule?  Has  the  nodule  increased  in  size?  Where  is  the  tumor?  

27

References  "Mutalik PG, Deshpande A, Nadkarni PM. Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J Am Med Inform Assoc. 2001 Nov-Dec;8(6):598-609."

"Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. 2001 Oct;34(5):301-10."

"Uzuner O, Zhang X, Sibanda T. Machine learning and rule-based approaches to assertion classification. J Am Med Inform Assoc. 2009 Jan-Feb;16(1):109-15."

"Sneiderman CA, Rindflesch TC, Aronson AR. Finding the findings: identification of findings in medical literature using restricted natural language processing. Proc AMIA Annu Fall Symp. 1996:239-43"

"Fiszman M, Chapman WW, Aronsky D, Evans RS, Haug PJ. Automatic detection of acute bacterial pneumonia from chest X-ray reports. J Am Med Inform Assoc. 2000 Nov-Dec;7(6):593-604."

"Zhou L, Melton GB, Parsons S, Hripcsak G. A temporal constraint structure for extracting temporal information from clinical narrative. J Biomed Inform. 2005 Sep 15."

"Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform. 2009 Oct;42(5):839-51."

"

 

28

29 29

 Leonard  D’Avolio  

The  Methods  of    Clinical  NLP  

how  this  stuff  works  

30

 Developing  /  using  NLP  is  a  process    The  NLP  Process        

 

31

Find  the  right  documents  

The  NLP  Process        

 

32

Create  the  “gold  standard”  

The  NLP  Process        

 

33

Train  the  system  The  NLP  Process        

 

34

Evaluate  the  system  

The  NLP  Process        

 

35

Methods  of  NLP    

 

•  A  number  of  approaches  have  evolved    

•  Simple  rules-­‐based  

•  Symbolic,  gramma=cal  NLP  

•  Machine  learning  

•  NLP  can  be  considered  a  series  of  transforms  

Think  “PIPELINE”  

35

36

Research  Scenario    Posi=ve  margins  aver  RRP  =  2  x  4  =mes  risk  of  cancer  recurrence  

Goal:    EXTRACT  MARGIN  STATUS  

36

37

37

38

Simple  Rules-­‐Based  Approach    

38

39

Simple  Rules-­‐Based  Approach    

Heuris=cs,  Probabili=es,  Combina=on  of  the  two  

39

40

Simple  Rules-­‐Based  Approach    

40

41

Simple  Rules-­‐Based  Approach    

41

42

Simple  Rules-­‐Based  Approach    

42

43

Simple  Rules-­‐Based  Approach    

Pros  

Simple  

Regular  expressions  included  in  many  programming  languages  

Great  for  semi-­‐structured  (consistently  formaDed)  targets  

Cons  

PaDerns  must  consider  all  possible  configura=ons.  

43

44

   

Symbolic  or  Gramma=cal  NLP  Approach    

Many  of  the  same  components…  

…plus…  

44

45

   

Symbolic  or  Gramma=cal  NLP  Approach    

POS  tagging  &  phrase  chunking  are  ac=ve  areas  of  research  

45

46

   

46

Some=mes  called  “concept  mapping”  

Symbolic  or  Gramma=cal  NLP  Approach    

47

   

47

Symbolic  or  Gramma=cal  NLP  Approach    

48

   

Pros  

Robust  –  reduces  complexity  by  mapping  to  standard  terms  

Great  for  mapping  large  numbers  of  concepts  

Cons  

Complex  –  more  steps,  more  opportuni=es  to  introduce  error  

Which  controlled  vocabulary?  

Can  be  slow  48

Symbolic  or  Gramma=cal  NLP  Approach    

49

   

Classifica=on  Model  

Several  open  source  ML  packages  available  

(decision  trees,  SVMs,  neural  nets)  

49

Machine  Learning  Approach    

50

   

50

Machine  Learning  Approach    

51

   

Machine  Learning:    Which  ‘features’  to  learn  from?  

 

51

52

   

Pros  

Targeted  approach  =  high  accuracy  

Capable  of  learning  from  examples  

Great  for  extrac=ng  few  predetermined  targets  

Cons  

Requires  manual  training  

New  target  =  new  training  effort  

52

Machine  Learning  Approach    

53

   

Also  used  increasingly  in  POS  tagging  &  mapping  to  ontologies  

53

Machine  Learning  Approach  Not  limited  to  a  step  in  the  pipeline    

54

   

What  if  RegExs  don’t  cut  it?  

Swap  them  out  for  Gramma=cal  NLP  approach  

54

The  Hybrid  Approach    

References  

Natural  language  processing:  Manning  &  Schutze.  Founda=ons  of  Natural  Language  Processing.  MIT  Press.  1999  

Regular  Expressions:  Java  Tutorial,  hDp://java.sun.com/docs/books/tutorial/essen=al/regex/  

Machine  learning:  WiDen  &  Frank.  Data  Mining,  Prac=cal  Machine  Learning  Tools  and  Techniques  with  Java  Implementa=ons.  Academic  Press.  2001  

   

56 56

Dina  Demner-­‐Fushman  

Annota=on  and    Evalua=on  

 

   Manual  Annota=on        

•  Purposes  •  Levels  •  Guidelines  •  Methods  (manual,  assisted)    •  Tools  •  Format  (embedded/standoff)  •  Collec=on  size  (number  needed,  representa=ve  sample)  

•  Annotators  (linguists,  domain  experts,    crowdsourcing)  

•  Annotator  agreement  •  Preserva=on/dissemina=on/repor=ng   57

Annota=on  purposes  

•  System  development    –  Rule  genera=on  (manual  or  automa=c)    –  Sta=s=cal  modeling  –  supervised  machine  learning  (training  +  valida=on/op=miza=on)  

•  Evalua=on  –  Tes=ng  on  a  held-­‐out  set  or  cross-­‐valida=on  

•  Clinical  data  quality  assurance  •  Reusable  collec=on  (corpus)  

58

Annota=on  levels  •  Meta    –  informa=on  about  the  corpus  

•  Document    –  type,  relevancy  to  topic,  quality,  structure  

•  Pragma=c    –  purpose  of  a  sentence  interpreted  in  context  using  world  knowledge,  involves  inference  

•  Discourse    –  contextual  features,  links  between  instances  of  concepts,  or  concepts  across  sentences  

•  Seman=cs    –  formal  representa=on  of  meaning  using  concepts,  frames  

•  Syntax  –  part  of  speech,  phrases,  rela=ons  between  phrases  

•  Lexical     59

60

Meta:    XYZ  hospital,  respiratory  problems,  …  Document:    Pa=ent  #13,  Discharge  summary  #  1,  …  

Annota=on  levels  example  

Annota=on  guidelines  

•  Define  task  and  annota=on  purpose  •  Be  clear  •  Be  concise  •  Avoid  bias  •  Itera=vely  refine  using  representa=ve  sample  

•  Come  to  consensus  

•  Finalize  before  annota=ng  reference  standard  

61

Annota=on  methods    Trade-­‐off  of  manual  vs  assisted  annota=on  

Bias  

Accuracy  /  consistency  

Speed-­‐up  

Training  

 

62

Annota=on  tools  

•  Read  and  write  formaDed  text  (markup  language)  

•  Allow  to  define/  link  annota=on  schema  

•  Provide  for  span  selec=on  &  markup  (color-­‐coding)  

•  Minimize  annota=on  steps  and  naviga=on  

•  Link  ontologies  •  Compute  inter-­‐annotator  agreement  

•  Provide  for  reconcilia=on  of  annotator  disagreement  

•  Provide  web-­‐service/API  63

References  Linguis=c  annota=on:  Wynne  M  (editor).  2005.  Developing  

Linguis4c  Corpora:  a  Guide  to  Good  Prac4ce.  Oxford:  Oxbow  Books.  Available  from  hDp://ahds.ac.uk/linguis=c-­‐corpora/  

Issues:  Hovy  E,  Lavid  J.  Corpus  annota=on  tutorial.  hDp://www.lrec-­‐conf.org/lrec2008/IMG/pdf/Corpus_annota=on.Tutorial-­‐outline.pdf  

Clinical  text  AMIA  NLP-­‐SIG  annota=on  project  Available  from  hDp://understandit.net/r02.01.11/index.php?=tle=Annota=onProjectAnnota=onSchema    

Annotator  agreement:  Hripcsak  G,  Rothschild  AS.  Agreement,  the  f-­‐measure,  and  reliability  in  informa=on  retrieval.  J  Am  Med  Inform  Assoc.  2005  May-­‐Jun;12(3):296-­‐8.  hDp://www.ncbi.nlm.nih.gov/pmc/ar=cles/PMC1090460/  

Overview  Judges  Metrics  

Evalua=on  methods  Large-­‐scale  evalua=ons  

65

Dina  Demner-­‐Fushman  

Evalua=ng  NLP  

Evalua=on  roots  Human/biomedical  studies  

Subjects  Outcomes/Sta=s=cs  

Sovware/NLP  evalua=on  Quality  of  the  algorithm  Quality  of  implementa=on  Quality  of  results  

Human-­‐computer  interac=on  Usability  tes=ng  

Heuris=c  User-­‐centered  Scenario-­‐based  

66

What  is  evaluated?  

Sovware    System  components  Black/Glass-­‐box  (results/algorithm  and  implementa=on)  

Task-­‐specific  (intrinsic/extrinsic)  Manual/automa=c  

Applica=on    Interface  (HCI)  

Qualita=ve/quan=ta=ve  Access  (API,  service)  

Impact/Outcome  Healthcare  process  Pa=ent’s  experience  

67

Judges:  who  is  evalua=ng?  

•  Experts  vs.  convenience  popula=on  vs.  end-­‐users  •  How  many?  

•  Consensus  (reliability,  agreement)  vs.  pyramid  

•  Capturing  judgments  in  reusable  test  collec=ons  

68

69 69

Evalua&on  Metrics    

 

Reference  Standard  

NLP  output    

posi&ve   nega&ve  

posi&ve     a  (TP)     b  (FP)  

nega&ve   c  (FN)   d  (TN)  

Recall  (Sensi=vity)  =  a  /  (a  +  c)    Precision  (PPV)  =  a  /  (a  +  b)    Fall-­‐out  (1-­‐Specificity)  =  b  /  (b  +  d)  =  1  -­‐  d/(b+d)  

69

70 70

Evalua&on  Metrics    

 

F-­‐measure:  harmonic  mean  of  precision  and  recall      

What  if  enumera=ng  all  true  posi=ve/nega=ve  examples  is  not  possible  or  prac=cal?  

Mean  average  precision,  binary  preference  70

Reference  Standard  

NLP  output    

posi&ve   nega&ve  

posi&ve     a  (TP)     b  (FP)  

nega&ve   c  (FN)   d  (TN)  

Sovware  evalua=on  

•  Establish  strong  baseline  –  For  extrac=on  of  pa=ent-­‐oriented  outcomes  from  MEDLINE  abstracts  selec=ng  

3  last  sentences  achieves  75%  accuracy  

•  Select  evalua=on  metrics  appropriate  for  the  task  –  U=lity  measure  for  text  categoriza=on  (Genomics  track)  –  “This  measure  contains  coefficients  for  the  u=lity  of  retrieving  a  relevant  and  

retrieving  a  nonrelevant  document  normalized  by  the  best  possible  score”  hDp://ir.ohsu.edu/genomics/2005protocol.html  

71

End-­‐user  evalua=on  

Use  Log  files  observa=on  Surveys  

Impact  Cost  /  =me  Outcomes  

72

Community-­‐wide  evalua=ons  

Format    Post-­‐hoc  (  TREC  )  Gold  standard  provided    Pros/cons  cost  vs.  coverage  

Clinical  I2b2  cmc  

73

References  Friedman  CP,  WyaD  JC.  Evalua=on  Methods  in  Biomedical  Informa=cs.  2nd  

ed.,  New  York:  Springer,  2006.  

Sparck-­‐Jones  K,  Galliers  JR.  Evalua=ng  Natural  Language  Processing  Systems.  Springer,  1996.  

van  Rijsbergen  C.J.  Informa=on  Retrieval,  2nd  ed.  London:  BuDerworths,  1979.  hDp://www.dcs.gla.ac.uk/Keith/pdf/Chapter7.pdf  

Hripcsak  G,  Wilcox  A.  Reference  standards,  judges,  and  comparison  subjects:  roles  for  experts  in  evalua=ng  system  performance.  J  Am  Med  Inform  Assoc.  2002  Jan-­‐Feb;9(1):1-­‐15.    Available  online  from  hDp://www.ncbi.nlm.nih.gov/pmc/ar=cles/PMC349383  

Passonneau  RJ,  Nenkova  A.  Evalua=ng  Content  Selec=on  in  Human-­‐  or  Machine-­‐Generated  Summaries:  The  Pyramid  Scoring  Method  hDp://www1.cs.columbia.edu/~library/TR-­‐repository/reports/reports-­‐2003/cucs-­‐025-­‐03.pdf  

75 75

 Wendy  Chapman  

Implementa=on    Considera=ons  

76 76

Implemen&ng  NLP    

• Ge�ng  an  NLP  system  up  and  running  •  Case  study    

76

77

Preprocessing   Post-­‐processing  NLP  System  

The  devil  is  in  the  details  

Remove  extraneous  characters  control  characters  foreign  characters  (é)  

Remove  extra  line  feeds,  etc.  

 pul-­‐_monary  

Preserve/enhance  sec=on  labels  “IMPRESSION:_”  

Reformat  to  improve  readability  

De-­‐iden=fy  

 

Preprocessing   Post-­‐processing  NLP  System  

Obtain  source  feeds  

Assess  completeness  

De-­‐duplicate  

Clean,  “sec=onize,”  format  

De-­‐iden=fy  

Load  database  

Hand-­‐off  to  NLP  system  

Quality  assurance  

Slide  courtesy  David  Carrell  78

Preprocessing   Post-­‐processing  NLP  System  

Obtain  source  feeds  

Assess  completeness  

De-­‐duplicate  

Clean,  “sec=onize,”  format  

De-­‐iden=fy  

Load  database  

Sample  

Hand-­‐off  to  NLP  system  

Quality  assurance  

Human Subjects/IRB

Source system manager

Network/database administrator

Programmer

Investigator

Informatics/NLP expert

Clinician (“domain expert”)

Chart abstractor

Slide  courtesy  David  Carrell  A  lot  of  tasks  and  a  lot  of  people  79

80 Slide  courtesy  David  Carrell  

Which  CUIs  map  to  Produc4ve  Cough?  

Which  combina=on  of  radiological  findings  &  aDributes  =  evidence  of  acute  bacterial  pneumonia?  

Does  the  pa=ent  have  a  recurrent  breast  cancer?  

    81

Preprocessing   Post-­‐processing  NLP  System  

Map  NLP  output  to  your    vocabulary  and  

your  task    

Instance  

Report  

Pa=ent  

Case  Study  Case-­‐control  observa=onal  GWAS  study        

 Hypothesis  Biomarkers  in  pa=ents  with  prostate  cancer  can    be  used  to  predict  =me  to  survival,  informing  course  of  care      

82

Targeted  Phenotype    

   

•  Prostate  cancer  •  Co-­‐morbidi=es  •  Basic  demographics  •  Disease  characteris=cs  

•  TNM  Staging  •  Gleason  score  •  PSA  

•  Treatments  administered  •  Surgery  •  Chemo  •  Watchful  wai=ng  

83

84

The  NLP  Process        

 

Defining  the  target        

 

Data  challenges    

   

Prostate  cancer  •  ICD-­‐9  codes  don’t  cut  it    

•  VA  Boston:  18%  of  path  reports  60  days  before  /  aver  1st  ICD-­‐9  were  prostate  cancer  related  

•  No  standardized  =tles  on  path  reports  •  Biopsy?  •  Post-­‐op?  

 Into  NLP  just  to  find  the  right  documents  Phenotyping  with  NLP  is  really  several  projects  in  one    

 

85

86

The  NLP  Process        

 

Extrac=ng  key  variables        

 

Annota=on  challenges    

   

TNM  Staging  Gleason  score  PSA  

•  Gleason  score  different  on  post-­‐op  than  biopsy  •  Pathological  vs.  es=mate  

•  Pa=ent  level  vs.  document  level  •  PSA  at  4  visits  •  Conflic=ng  Gleason  scores  

87

   

•  Start  with  a  wish  list  and  whiDle  down  •  Cost  vs.  benefit  will  become  clear  

•  Define  categorical  variables    •  Versus  highligh=ng  strings  

•  Create  clear  instruc=ons  •  Training  •  Pilot,  pilot,  pilot  

•  Plan  for  several  itera=ons  

88

89

The  NLP  Process        

 

Crea=ng  your  training  /  test  sets        

 

Designing  your  “gold  standard”      

•  Several  variables  =  several  measures  of  accuracy  •  What  if  tumor  staging  F  –  measure  is  .97  but  co-­‐

morbidi=es  is  .6?  •  Effects  must  be  accounted  for  in  study  design      

90

91

The  NLP  Process        

 

Extrac=ng  key  variables        

 

NLP  algorithm  development      

•  Are  you  reinven=ng  the  wheel?*  •  Is  it  important  that  it  scale  

•  Other  projects?  •  Beyond  your  ins=tu=on?    

*PraD,  AW.  1969“Automated  processing  of  medical  English”  Interna=onal  Conference  on  Computa=onal  Linguis=cs,  Sweden  

92

93 93

Current  state,  future  progress,  available  

resources    

 Leonard  D’Avolio  

94

State  of  the  Science    

 

NLP  is  not  “off  the  shelf”  • Opportunity  to  reduce  effort  

Several  approaches  can  yield  similar  performance  

• i2b2  challenge  First  increase  in  open  source  components  

• Weka,  MMTx,  Stanford  parser  

• Lots  of  ‘glue  code’  Now  increase  in  open  source  frameworks  

• GATE,  UIMA  

End-­‐to-­‐end  informa=on  retrieval  using  open  source  frameworks  

•   ARC   94

95

Progression  of  Field—More  Resources    

 

95

“Closed”    Concept  Mapping  Systems  •     MedLEE  •     Knowledge  Map  •     MVCS  

 

Open  Components  •     Stanford  Parser  •     IBM  Parser  •     OpenNLP  •     Weka  (ML)  •     MALLET  (ML)  •     UMLS  •     NegEx  (nega=on)  

 

Open  Frameworks  •     UIMA  •     GATE  

 

Open    Concept  Mapping  Systems  •     MetaMap  •     HITEx  (GATE)  •     Topaz  (GATE)  •     cTAKES  (UIMA)  •     MedKAT  (UIMA)    

Open  Corpora  •     Cincinna=    •     PiDsburgh  NLP          Repository  •     i2b2  •     MIMIC  1  &  2  

Open  IR  Systems  •     ARC  (UIMA  +  MALLET)  

 Tools  Registries  •     RDS  •     ORBIT  •     Eagle-­‐I    Hosted  Environments  •     iDASH  •   VINCI  

 

96

Future  of  NLP    

 

Informa=on  quality  –  context  is  key  •  Error  propagates  in  pipelines  •  Informa=on  not  captured  for  our  secondary  uses  •  Scrap  idealized  test  sets  

Greater  code  reuse      •  Less  glue  code  •  Will  allow  focus  on  improving  specific  components  

Increase  in  open  source  data  sets  &  shared  task  challenges    Drive  adop=on  of  NLP  

•  More  data  driving  greater  demand  /  new  uses  •  Reduce  current  dependency  on  system  developers    

96

Current  Process  

D’Avolio  et.  al.  “Evalua=on  of  a  generalizable  approach  to  informa=on  retrieval  using  the  Automated  Retrieval  Console.”  2011.  17(4)  

What  it  should  be  

D’Avolio  et.  al.  “Evalua=on  of  a  generalizable  approach  to  informa=on  retrieval  using  the  Automated  Retrieval  Console.”  2011.  17(4)  

99

Best  approach  to  NLP?  

VS.  

99

100

100

Best  approach  to  NLP?  

101

101

Worst  approach  to  NLP?  

Resources  WEKA:  hDp://www.cs.waikato.ac.nz/ml/weka/  

MALLET:  hDp://mallet.cs.umass.edu/  

MetaMap:  hDp://mmtx.nlm.nih.gov/  

UMLS:  hDp://www.nlm.nih.gov/research/umls/  

OpenNLP:  hDp://opennlp.sourceforge.net/  

HITEx  (hosted  by  i2b2):  hDps://www.i2b2.org/resrcs/hive.html  

cTAKES:  hDps://cabig-­‐kc.nci.nih.gov/Vocab/KC/index.php/OHNLP_Documenta=on_and_Downloads    

UIMA:  hDp://incubator.apache.org/uima/  

GATE:  hDp://gate.ac.uk/  

ARC:  hDp://research.maveric.org/mig/arc.html  

 

 

 

Resources  (cont)  Topaz:  hDp://www.dbmi.piD.edu/blulab/resources.asp#Topaz  

NegEx:  hDp://code.google.com/p/negex/  

ConText:  hDp://www.dbmi.piD.edu/chapman/ConText.html  

Cincinna=  Pediatric  Corpus:    hDp://www.computa=onalmedicine.org/project/cpc.php  

PiDsburgh  NLP  Repository:  hDp://www.dbmi.piD.edu/blulab/nlprepository.html  

MIT  MIMIC  Repository  (structured  and  unstructured):  

 hDp://mimic.mit.edu/mimic-­‐ii-­‐database.html  

ORBIT  Project:  hDp://orbit.nlm.nih.gov/  

iDASH:  hDp://iDash.ucsd.edu  

 

 

 

104

Contact  informa&on    

 

Leonard  D’Avolio,  PhD  Associate  Center  Director  for  Biomedical  

Informa=cs    MAVERIC,  VA  Boston  Healthcare  System  Leonard.davolio@va.gov    

Wendy  Chapman,  PhD  Division  of  Biomedical  Informa=cs  University  of  CA,  San  Diego  wwchapman@UCSD.edu  

   

Dina  Demner-­‐Fushman,  MD,  PhD  Na=onal  Library  of  Medicine  ddemner@mail.nih.gov    

       

 

104