How Machine Learning Can Automate Data Standardization in ... · Linear SVM 0.95 6.6 SGD Classifier...

16
1 © 2018 Medidata Solutions, Inc. – Proprietary and Confidential How Machine Learning Can Automate Data Standardization in Clinical Trials Fanyi Zhang, Medidata Solutions, Inc. November, 2018 at Phuse, Frankfurt

Transcript of How Machine Learning Can Automate Data Standardization in ... · Linear SVM 0.95 6.6 SGD Classifier...

1© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

How Machine Learning Can Automate Data Standardization in Clinical Trials

Fanyi Zhang, Medidata Solutions, Inc.November, 2018 at Phuse, Frankfurt

2© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

The Big Data Revolution in Healthcare and Clinical Research?

70 - 80%

20 - 30%

Disparate Data

Solutions

?

Analyze

Decide

Ideate

3© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Demographics

Gender

Age

Standardizing clinical data is mandatory by FDA

“In 2014, FDA mandated that data collection in clinical trials adhere to Study Data Tabulation Model (SDTM) developed by Clinical Data Interchange Standards Consortium (CDISC)”

eCRF Example*: Adverse Event

eCRF Example*: Demographics 1

eCRF Example*: Demographics 2

*Screenshots from Medidata Rave

4© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

>14K >4M 57Kstudies trial

subjectssponsor/site relationships

MEDS: One of the Industry’s Largest Give to Get Clinical and Operational Data

99% in study count*

73% in study count*

90% in number of study sites*

*Growth over 3 year period

5© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

StartDate

Analytics

AE Name

dB’s

Standardize Data Across Multiple Clinical TrialsFrom Rule-Based Engines to Auto-SDTM

AdverseEvents

Study Exposure

Dose

Domains Variables

TreatmentName

30+ Domains, 155 + Variables100K+ Case Report Forms

Age Unit

AgeDemographics

6© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

MEDS-SDTM Standard30+ Domains, 155+ Variables

Age

Demogra-phics

Age

Uni

t

Gen

der

Rac

e

Ethn

icity

Term

Ver

batim

Adverse Event

Seve

rity

Out

com

e

Act

ion

Take

n

Star

t Dat

e

Trea

tmen

t N

ame

Study Exposure

Star

t Dat

e

End

Dat

e

Dos

e

Dos

e U

nit

Lab MedicalHistory

Biomarker

7© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Faster in speed, Better in quality

Learning

Supervised-learning

F1

F2

F3

F4

Model assessment and interpretation

Human experts review and annotate

Human (Expert) in the Loop Machine Learning

Data Source(s)

Human experts formulate definitions

Rule-based labeling +

human annotations

8© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Create Initial Labels

Find Demographic forms

Search for form names that contain “demographic” /

“DM” etc.

Search for field names that contain “age” etc.

Find Age fields

Example from DM domain

1

Data Source(s)

Human experts formulate definitions

Rule-based labeling +

human annotations

9© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Method* CV Accuracy Build time (minutes)Logistic Regressionw/ regularization 0.95 7.5

Linear SVM 0.95 6.6

SGD Classifier 0.95 7.7

Random Forest 0.96 7.8

XGBoost Classifier 0.96 5.1

Form Classifier in 3 fold Cross Validation

Supervised-learning

Bag of Words Feature transformation Model selection2

10© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Predicted ProbabilitiesKeywords RankingRaw Text with Highlighted Words

Exposure FormNot Exposure Form

Exposure FormVital Signs Form

………

11© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

F1F2F3F4

Model assessment and interpretation

1

23

Domain 1

Domain 2

Domain 3

Do we need more labeled data ?

Mod

el P

erfo

rman

ce

(F1-

scor

e)

3

Sample Count

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

12© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Both, Classifier and RBL* agree

Classifier = RBL* = VS

Text in the raw data …

Human experts review and annotate

4

*RBL: Rule-based labels**VS: vital sign forms

Classifier is correct and RBL* is incorrectHuman experts review and annotate

4

Classifier = LB**, RBL* = OTH**

*RBL: Rule-based labels**LB: study exposure forms**OTH: other / unlabeled forms

hematologyText in the raw data …

NOT LB LB

14© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Classifier is incorrect and RBL* is correct

Classifier = LB**, RBL* = BI**

Human experts review and annotate

4

Text in the raw data …

*RBL: Rule-based labels**LB: lab forms**BI: biomarker forms

15© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Learning 5

The i th Review Round

1

0.95

0.9

0.85

0.8

0.75

0.70

Mod

el P

erfo

rman

ce

(F1-

scor

e)

Model improved continuously over 4 rounds of iterations

r1 r2 r3 r4

16© 2018 Medidata Solutions, Inc. – Proprietary and Confidential

Thank you.