based predictive learning for the genetic etiology of...

46
Interactionbased predictive learning for the genetic etiology of complex diseases TIAN ZHENG DEPARTMENT OF STATISTICS, DATA SCIENCE INSTITUTE COLUMBIA UNIVERSITY 1 Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute for Mathematical Sciences National University of Singapore July 10-14, 2017

Transcript of based predictive learning for the genetic etiology of...

Page 1: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Interaction‐based predictive learning for the genetic etiology of complex diseasesTIAN  ZHENGDEPARTMENT  OF  STATISTICS,  DATA  SCIENCE   INSTITUTECOLUMBIA  UNIVERSITY

1

Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences

National University of SingaporeJuly 10-14, 2017

Page 2: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore 2

The Conference on Statistical Learning and Data Science / Nonparametric StatisticsJune 4-6, 2018

Columbia University.Keynote speakers: • Michael I. Jordan• Liza Levina• David Madigan

Banquet speaker: Cathy O'Neil.Program chairs Annie Qu ([email protected]) and Cynthia Rudin([email protected]). Local chair: Tian Zheng ([email protected])

Page 3: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore 3

Page 4: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute
Page 5: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Interaction‐based predictive learning for the genetic etiology of complex diseasesTIAN  ZHENGDEPARTMENT  OF  STATISTICS,  DATA  SCIENCE   INSTITUTECOLUMBIA  UNIVERSITY

5

Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences

National University of SingaporeJuly 10-14, 2017

Page 6: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

AcknowledgementsJoint work with

Shaw-Hwa Lo, Columbia University

Herman Chernoff, Columbia University

Adeline Lo, Princeton University

Funding from NSF

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore 6

Page 7: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Complex genetic diseases “… are caused by multiple genes interacting with each other and with environmental factors to create a gradient of genetic susceptibility to disease. “ (Weeks and Lathrop 1995)

Gene‐gene interactions play an important role in common human disorders, in both disease risks and responses to treatments.

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

7

Ganapathiraju, Madhavi K., et al. "Schizophrenia interactome with 504 novel protein–protein interactions." npj Schizophrenia 2 (2016): 16012.

Page 8: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Gene x Gene Interactions “Interaction” is a biological term and a statistical term. Biology: two or more genes jointly affect an outcome of interest. Statistics: defined under a specific model.

Identify potentially interacting genes via association mapping: sets of genetic loci with statistically

significant association with the disease outcome. via predictive learning: sets of genetic loci that are predictive of the

disease outcome.

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

8

Page 9: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Significance versus predictivity

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

9

Page 10: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Significance versus Predictivity

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

10

Lo, A., Chernoff, H., Zheng, T., & Lo, S. H. (2015). Proceedings of the National Academy of Sciences, 112(45), 13892-13897.

Page 11: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Significance versus predictivity

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

11

-2 0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

N(0, 1)N(3, 32)

Med. Sig. Level: sX= 0.0014Pred. Rate: 1 eX= 0.83

-3 -2 -1 0 1 2 30

24

68

N(0, 1)N(0, 0.052)

Med. Sig. Level: sY= 0.5Pred. Rate: 1 eY= 0.94

Page 12: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE

12

0.0

0.1

0.2

0.3

0.4

10-1 10-2 10-3 10-4 10-5 10-6 10-7

Predictive VSSignificant VS

https://github.com/tz33cu/PartitionRetention

Page 13: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE

13

0.0

0.1

0.2

0.3

0.4

10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 10-10 10-11 10-12

Predictive VSSignificant VS

https://github.com/tz33cu/PartitionRetention

Page 14: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE

14

OR

MAF

True

Pre

dict

ion

Rat

e

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68

Page 15: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE

15

OR

MAF

Pre

dict

ion

rate

in tr

aini

ng s

et

OR

MAF

Pre

dict

ion

rate

in tr

aini

ng s

et

OR

MAF

Pre

dict

ion

rate

in tr

aini

ng s

et

0.64

0.66

0.68

0.70

0.72

0.74

https://github.com/tz33cu/PartitionRetention

Page 16: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE

16

OR

MAF

Chi

sq-T

est P

-val

ue (-

Log

sca

le)

OR

MAF

Chi

sq-T

est P

-val

ue (-

Log

sca

le)

OR

MAF

Chi

sq-T

est P

-val

ue (-

Log

sca

le)

0

5

10

15

20

25

30

35

https://github.com/tz33cu/PartitionRetention

Page 17: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE

17

Significant)set)by)a)test)using)a)small)sample)

Significant)set)by)a)test)using)a)large)sample)

Set)of)variable)modules)with)predic6ve)power)above)certain)threshold)

Significant)set)by)a)test)using)a)huge)sample)

Page 18: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Prediction‐oriented measure

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

18

Page 19: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Predictivity of a variable setNotation

o , … , is a set of dichotomous variables under evaluation.

o Π is the partition based on .

o ∈ , is the disease outcome of interest.

o is conditional distribution of given .

o is conditional distribution of given .

Assume 0.5

Bayes rate for predicting Y12 max

,

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

19

Page 20: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Predictivity of a variable setBayes rate for predicting Y

12 max

,

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

20

,12

14

Maximal potential ability to predict.

Page 21: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Sample estimate? , number of cases; , number of controls

, , number of cases with

, , number of controls with

, , ,

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

21

,12

14

This is the naïve training prediction rate based on Π .

Page 22: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

A better sample measure for predictivity?

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

22

Page 23: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

A better sample measure for predictivity?

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

23

Chernoff, Herman, Shaw-Hwa Lo, and Tian Zheng. “Discovering influential variables: a method of partitions. ”The Annals of Applied Statistics (2009): 1335-1369.

I � =X

j 2 �

n2j (Yj − Y )2

=3mX

i= 1(nd,i + nu,i )2

✓nd,i

nd,i + nu,i−

ndnd + nu

◆2

=✓

ndnu

nd + nu

◆2 3mX

i = 1

✓nd,i

nd−

nu,i

nu

◆2.

Page 24: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

A lower bound for predictivity

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

24

,12

14

12

14 2 lim

→ 1

where

Lo, A., Chernoff, H., Zheng, T., & Lo, S. H. (2016). Framework for making better predictions by directly estimating variables’ predictivity. Proceedings of the National Academy of Sciences, 113(50), 14277-14282.

Page 25: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Example 1Variable sets as partitions

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore25

Senario 1:and are independent with

• 1 1 1/2, • 1 2.

1 -1

1 = 1 = -1

-1 = -1

= 1

Overall mean of Y = 0

Senario 2:, are independent with

• 1 1 1/2, • 1 2.

1 1 -1

1 = 1 = -1

-1 = -1

= 1

Overall mean of Y = 0

11 -1

1 = 1 = -1

-1 = -1 = 1

Page 26: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

A data set of 50 observations

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore26

x1=1-2*rbinom(nn.use, 1, 0.5)x2=1-2*rbinom(nn.use, 1, 0.5)x3=1-2*rbinom(nn.use, 1, 0.5)yy=x1*x2+rnorm(nn.use, 0, 1)yy=1*(yy>0) > ftable(yy, x1, x2, x3)

x3 -1 1yy x1 x2 0 -1 -1 1 1

1 2 81 -1 3 4

1 3 11 -1 -1 8 2

1 1 01 -1 0 2

1 7 7

yy0 1 22 28

Page 27: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Influence on Y

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore27

X3=-1 X3=1

X1=-1 X2=-1 Y=0: 1/22Y=1: 8/28

Y=0: 1/22Y=1: 2/28

X2= 1 Y=0: 2/22Y=1: 1/28

Y=0: 8/22Y=1: 0/28

X1= 1 X2=-1 Y=0: 3/22Y=1: 0/28

Y=0: 4/22Y=1: 2/28

X2= 1 Y=0: 3/22Y=1: 7/28

Y=0: 1/22Y=1: 7/28

X3=-1 X3=1

X1=-1 X2=-1 Y=0: 1/22Y=1: 8/28

Y=0: 1/22Y=1: 2/28

X2= 1 Y=0: 2/22Y=1: 1/28

Y=0: 8/22Y=1: 0/28

X1= 1 X2=-1 Y=0: 3/22Y=1: 0/28

Y=0: 4/22Y=1: 2/28

X2= 1 Y=0: 3/22Y=1: 7/28

Y=0: 1/22Y=1: 7/28

Page 28: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore 28

Page 29: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore 29

Page 30: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

30

A.

x1 x1

x2

x1

x2

x3

x1

x2

x3

x4

x1

x2

x3

x4

x5

x1

x2

x3

x4

x5

x6

x1

x2

x3

x4

x5

x6

x7

x1

x2

x3

x4

x5

x6

x7

x8

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x1 x1

x2

x1

x2

x3

x1

x2

x3

x4

x1

x2

x3

x4

x5

x1

x2

x3

x4

x5

x6

x1

x2

x3

x4

x5

x6

x7

x1

x2

x3

x4

x5

x6

x7

x8

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

2 4 6 8 10

0.50

0.55

0.60

0.65

0.70

Variable module size (k)

True

Bay

es ra

te

with influential Xsw/o influential Xsinfluentalnoise

B.‐

0 2 4 6 8 10

0.50

0.55

0.60

0.65

0.70

Size of variable module (k)

PR

's I

scor

e

Distributions of estimated prediction using PR's I. Reflecting different rat between scenarios (1) and (2) with largest difference at k=5

0 2 4 6 8 10

020

060

010

0014

00

Size of variable module (k)

Chi

-squ

are

test

sta

tistic

Due to the small sample size Chi-square test does not have power for detecting influential Xs.

0 2 4 6 8 10

0.5

0.6

0.7

0.8

0.9

1.0

Size of variable module (k)

Trai

ning

rate

Due to the small sample size the empirical train rate does not reflect the true prediction rate

Page 31: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Simulation studies

July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE

31

Page 32: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore32

Page 33: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

A six‐gene network

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore33

Multiplicative odds ratio

Page 34: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore 34

0.50

0.55

0.60

0.65

0.70

250 cases, 250 controls

Pre

dict

ion

Rat

e

Ref: theoretical Bayes rateRef: outsample pred. rateLower bound based on I score

0.50

0.55

0.60

0.65

0.70

500 cases, 500 controls

Pre

dict

ion

Rat

e

0.50

0.55

0.60

0.65

0.70

1000 cases, 1000 controls

Pre

dict

ion

Rat

e

0.4

0.6

0.8

1.0

variable sets

Pre

dict

ion

Rat

e

x2x3

x1x2x3

x1x2x3

x7 x1x2x3

x7x8 x1

x2x3

x7x8x9

x1x2x3

x7x8x9x10

x1x2x3

x7x8x9x10x11

x1x2x3

x7x8x9x10x11x12

0.4

0.6

0.8

1.0

variable sets

Pre

dict

ion

Rat

e

x2x3

x1x2x3

x1x2x3

x7 x1x2x3

x7x8 x1

x2x3

x7x8x9

x1x2x3

x7x8x9x10

x1x2x3

x7x8x9x10x11

x1x2x3

x7x8x9x10x11x12

0.4

0.6

0.8

1.0

variable sets

Pre

dict

ion

Rat

e

x2x3

x1x2x3

x1x2x3

x7 x1x2x3

x7x8 x1

x2x3

x7x8x9

x1x2x3

x7x8x9x10

x1x2x3

x7x8x9x10x11

x1x2x3

x7x8x9x10x11x12

Ref: theoretical Bayes rateRef: outsample pred. r ateTraining set pred. r ate

Page 35: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore 35

0.50

0.55

0.60

0.65

0.70

250 cases, 250 controls

Pre

dict

ion

Rat

e

Ref: theoretical Bayes rateRef: outsample pred. rateLower bound based on I score

0.50

0.55

0.60

0.65

0.70

500 cases, 500 controls

Pre

dict

ion

Rat

e

0.50

0.55

0.60

0.65

0.70

1000 cases, 1000 controls

Pre

dict

ion

Rat

e

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

variable sets

Pre

dict

ion

Rat

e

x1x2x3x4x5x6

x1x2x3x4x5x6

x7 x1x2x3x4x5x6

x7x8 x1

x2x3x4x5x6

x7x8x9

x1x2x3x4x5x6

x7x8x9x10

x1x2x3x4x5x6

x7x8x9x10x11

x1x2x3x4x5x6

x7x8x9x10x11x12

Ref: theoretical Bayes rateRef: outsample pred. rateTraining set pred. rate

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

variable sets

Pre

dict

ion

Rat

e

x1x2x3x4x5x6

x1x2x3x4x5x6

x7 x1x2x3x4x5x6

x7x8 x1

x2x3x4x5x6

x7x8x9

x1x2x3x4x5x6

x7x8x9x10

x1x2x3x4x5x6

x7x8x9x10x11

x1x2x3x4x5x6

x7x8x9x10x11x12

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

variable sets

Pre

dict

ion

Rat

e

x1x2x3x4x5x6

x1x2x3x4x5x6

x7 x1x2x3x4x5x6

x7x8 x1

x2x3x4x5x6

x7x8x9

x1x2x3x4x5x6

x7x8x9x10

x1x2x3x4x5x6

x7x8x9x10x11

x1x2x3x4x5x6

x7x8x9x10x11x12

Page 36: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Application to genetic studies

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore 36

Page 37: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Backward dropping algorithm (BDA)

a candidate set of k variables

Returned influential set

(possibly empty)

Returned influential set

(possibly empty)

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore37

• A greedy backward screening based on I-score.• At each step, the variable that leads to the most

gain in I-score is removed. • Stops when there is no more gain.

Page 38: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Example: Application to rheumatoid arthritis (RA) Rheumatoid Arthritis (RA) is a heterogeneous disease that exhibits a complex genetic

component.

We studied 349 controls and 474 cases with genotypes on 5407 SNPs throughout the

genome. We used a two-stage screening for this data set.

First stage: use standard BGTA screening and select top approximately 20% important

markers.

Second stage: further screening to identify important marker clusters.

Significant markers were selected based on FDR estimated using permutations.

For 39 identified loci that showed strong association with the RA, of which about 2/3 were

found in the RA literature, we constructed an association network among them using

association scores.

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore38

Page 39: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore 39

Page 40: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore 40

Page 41: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine

Institute for Mathematical Sciences, National University of Singapore 41

Page 42: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Relation to big data prediction

Feature selectionFeature selection• Model-free evaluation of joint influence from multiple x

variables on Y

Feature generationFeature generation• Jointly selected variable sets suggest interactions among the x

variables. • Interaction terms within each selected VS can be viewed as a

feature in a predictor at the construction stage.

July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE

42

Page 43: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Predictive learning workflow

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

43

raw data

Prediction oriented

(VSG)

Prediction oriented Variable set generation

(VSG)

Predictive modeling (PM)

Outcome prediction

Page 44: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Predictive Gene Set identified for breast cancer (Wang et al 2012)

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

44

Wang, H., Lo, S. H., Zheng, T., & Hu, I. (2012). Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics, 28(21), 2834-2842.

van’t Veer et al (2002) data set

Page 45: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

Conclusion Significance does not automatically mean high

predictivity.

Predictivity of variable sets can be treated as a parameter of interest.

We propose a potential sample low bound for predictivity.

A better measure of predictivity can lead to more reliable findings, especially for interactions.

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

45

Page 46: based predictive learning for the genetic etiology of ...ims.nus.edu.sg/events/2017/quan/files/tian.pdf · Workshop on Perspectives & Analysis Methods for Personalized Medicine Institute

July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore

46

Thanks!