Bulk Learning on EHR Data

BulkLearningonEHRData

Po-Hsiang Chiu, George Hripcsak Department of Biomedical Informatics

Columbia University

InaNutshell…

Bulk Learning is a batch-phenotyping method/framework that uses multiple diseases collectively (i.e. bulk learning set) as a substrate for model learning and evaluation in which model stacking is used to construct abstract feature representation of low sample complexity in order to reduce training requirements.

Phenotyping

•  Defini<on

source:h?p://www.evolu<on.berkeley.edu/

•  Diseasesandsubtypes•  Concept-drivendiseasecohorts

–  100infec<ousdiseasesasthedomainofstudy(i.e.bulklearningset)–  Phenotypicmodelsassociatedwithlabtests,medicinalprescrip<ons

•  Dimensionalityreduc<on

BulkLearningBasicsI

•  ABatch-phenotypingmethod/framework•  Addressestwocentralissuesinpredic<veanaly<calapproachtocomputa<onalphenotyping–  Featureengineering

•  Medicalontologyforfeaturedecomposi<on•  E.g.MED(h?p://med.dmi.columbia.edu)

– Dataannota<on•  Ensemblelearning(e.g.stackedgeneraliza<on[Wolpert1992])

•  Featureabstrac<onfordimensionalityreduc<on

BulkLearningBasicsII

•  Usesdiagnos<ccodes(e.g.ICD-9)assurrogatelabelstoestablish“approximatepredic<vemodels.”

•  Whysurrogatelabels(e.g.ICD-9)?–  FeaturesextractedfromEHRcanbelarge– Morecompactrepresenta<onofthetrainingdata–  “Free”supervisedsignalsthataresufficientlyclosebutcanbeobtainedwithoutextrawork

•  Objec<ve:Buildsta<s<calmodelsinabstractfeaturespace–  Createasmallannota<onset(i.e.goldstandard)thatservesaproxydatasetfordownstreammodelevalua<ons

BulkLearningBasicsIII•  Whyinspec<ngmul<ple(infec<ous)diseases?

–  Usingmul&plediseasesassubstrateandiden<fyingtheircommonelements–  Examplestackingarchitecture(understackedgeneraliza<onmethod)

Level 1

Level 0

Antibiotic Measure

Urinary Chemistry Measure

Intravenous Chemistry Measure

Microbiology Measure

Level 2

Attributes: Level-0 Probabilities and IndicatorsTarget: Diagnostic Codes (Silver Standard)

Other Phenotypic Measures (e.g. Antiviral)

Attributes: Level-1 Probabilities and ICD-9Target: True Labels (Gold Standard)

⌃

⌃

⌃

⌃

m1

a1

b1

u1

logistic unitsraw

featuresf11

f12

f1j

f21

f2j

f31

f41

f3j

m1

a1

b1

u1

⌃

Level 0 Level 1

Microbiology

An<bio<c

Bloodtest

Urinetest

⌃

⌃

⌃

⌃

⌃

m1

a1

b1

u1

m1(1)

m1g a1g b1g u1g

global2

(i)

(i-1)

(i+1)

a1(1) b1

(1) u1(1)

logistic unitsraw

features

microbiology

antibioticblood test

urine test

f11

f12

f1j

f21

f2j

f31

f41

f3j

Four Example Base Models

127.4Enterobiasis

047.8(Other)viralmeningi<s

009.1Gastroenteri<s...

053.9Herpezzoster

117.9Mycoses

MovingForward•  Summary

–  Bulklearningisaframeworkwithatleastthefollowingsystemchoices•  Thebulklearningset(oftargetcondi<ons)=>basemodels•  Classifica<onalgorithms(guideline:probabilis<cclassifiers+well-calibrated)•  Stackingarchitecture(mul<ple<ers=>levelsofabstrac<ons)•  Strategyforcombiningindividual(local)diseasemodelstoaglobalmodel

–  Advantage:Canuseasmallannotatedsampleformodelconstruc<onandevalua<onwithintheabstractfeaturespace(e.g.level-1data)

•  83clinicalcaseswerelabeledinthisstudy(tobediscussedmorecomprehensively)–  Challenge:Themodelinvolvingtheinterac<onbetweenabstractfeaturesand

ICD-9donotgeneralizewellintotheregionofthedatawheretheICD-9codingwasincorrect

•  Mul<pletypesofsurrogatelabels⌃

⌃

⌃

⌃

⌃

m1

a1

b1

u1

m1(1)

m1(i)

a1(i)

b1(i)

u1(i)

⌃

m1g a1g b1g u1g

local2(i)

global2

(i)

(i-1)

(i+1)

a1(1) b1

(1) u1(1)

(i-1)(i)

(i+1)

Semi-supervisedlearningAc&velearning

Complexdecisionboundary?

Othersurrogatelabels

•  Ongoingandfuturework

T H A N K

Y O U ⌃

⌃

⌃

⌃

m1

a1

b1

u1

f11f12

f1j

f21

f2j

f31

f41

f3j

Reference[1]D.H.Wolpert,Stackedgeneraliza<on,NeuralNetworks.5(1992)241–259.[2]K.M.Ting,I.H.Wi?en,Issuesinstackedgeneraliza<on,J.Ar<f.Intell.Res.10(1999)271–289.[3]J.JinChen,C.ChengWang,R.RunshengWang,UsingStackedGeneraliza<ontoCombineSVMsinMagnitudeandShapeFeatureSpacesforClassifica<onofHyperspectralData,IEEETrans.Geosci.RemoteSens.47(2009)2193-2205.[4]DavidBaorto,JamesCimino,etal.Available:h?p://med.dmi.columbia.edu.Accessdate:Oct20,2016.

MedicalOntology

•  SnapshotofMedicalEn<<esDic<onary(h?p://med.dmi.columbia.edu)

ExampleFeatures

⌃

⌃

⌃

⌃

⌃

m1

a1

b1

u1

m1(1)

m1(i)

a1(i)

b1(i)

u1(i)

⌃

m1g a1g b1g u1g

local2(i)

global2

(i)

(i-1)

(i+1)

a1(1) b1

(1) u1(1)

(i-1)(i)

(i+1)

logistic unitsraw

features

microbiology

antibioticblood test

urine test

2. Compute Base Models

Level-1 Global Unit

Individual Level-1 Local Units

Level-1 abstractfeatures

f11

f12

f1j

f21

f2j

f31

f41

f3j

Four Example Base Models

3. Compute Meta Models (via Ensemble Learning)1. Define Feature Groups Using Medical Ontology

1a. Gather EHR data according to medical concepts

1b. Use Medical Entities Dictionary to delineate feature scopes

1c. Apply feature selection within each

concept group

3a. Per-disease ensembles:compute local level-1 models

3b. Cross-disease ensemble: compute a global

level-1 model

Global level-1 features

Bulk Learning on EHR Data

Science

Transcript of Bulk Learning on EHR Data