for biomedical prediction problems and DREAM …astanesc/papers/DREAM_2016_ana.pdfTarget problem:...

1
Learning parsimonious ensembles for biomedical prediction problems and DREAM challenges Ana Stanescu and Gaurav Pandey* Icahn School of Medicine at Mount Sinai, New York, NY [email protected] http://research.mssm.edu/gpandey/ HETEROGENEOUS ENSEMBLES FOR DREAM CHALLENGES: GOAL: build heterogeneous ensembles Parsimony of such an ensemble can be of even greater value for DREAM challenges due to enhanced interpretability Ensemble Selection(ES)/Pruning is a potential approach for this, but popular algorithms like Caruana’s ES (CES) are ad- hoc (sub-optimal) and non-exhaustive MATERIALS and METHODS: REFERENCES: A. Stanescu and G. Pandey, Learning parsimonious ensembles for unbalanced computational genomics problems. In: Pacific Symposium on Biocomputing, PSB 2017 (In press.) S. Whalen, O. P. Pandey and G. Pandey, Predicting protein function and other biomedical characteristics. Methods 93(15): 92-102, 2016. R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes, Ensemble selection from libraries of models. ICML 2004. R. Caruana, A. Munson and A. Niculescu-Mizil, Getting the Most Out of Ensemble Selection. ICDM 2006. CJCH Watkins and P. Dayan, Q-Learning. Machine Learning 8(3-4), 1992. G. Schweikert, G. Rätsch, C. Widmer, and B. Schölkopf, An empirical analysis of domain adaptation algorithms for genomic sequence analysis. NIPS 2009. Target problem: Splice Site Prediction Selected ensemble model Validation set (20%) Test set (20%) Repeat 5 folds of cross validation Dataset Repeat process K times Assessment of performance (F-Max) of combined predictions over the 5 folds Obtain performance estimates (F-max) of base predictors as needed Train base predictors Training set (60%) Obtain predictions on the test set using the ensemble Ensemble selection module: (A) CES (B) RL using lattices Visual description of the workflow used to in our experiments. K=10 for the PF datasets and K=5 for the SS datasets. RESULTS: DREAM challenges are a great mechanism for identifying the most effective solution(s) for challenging biomedical problems. Can we improve the (predictive) ability of DREAM challenges by considering the contributions of non-winning submissions/ models also? Q-Learning ACATGCTA … ATCGATCTAG GGATGCTACATCGCGAT … ATCGATCTC 61 st Position Exonic Nucleotides Intronic Nucleotides 141 Nucleotides + Class RL able to better capture predictive performance close to the full ensembles with a much smaller number of base predictors. More capable of achieving this balance than CES, especially for larger datasets. The downstream performance or sizes of the RL selected ensembles is not sensitive to RL parameters (e.g., exploitation/exploration probability), showing robustness to parameters as compared to other, more ad-hoc ES methods. Algorithm to find an optimal action-selection policy. Proven to converge to an optimal solution (i.e. find an optimal action-selection policy) under certain constraints. RL Strategies for ES RL_greedy Reward is given by ensemble performance RL_pessimistic Reset to start as soon as performance drops RL_backtrack Go back one position when performance drops Reinforcement Learning (RL) Acknowledgements: This work was partially supported by NIH grant # R01-GM114434 and an IBM faculty award to GP. We thank the Icahn Institute for Genomics and Multiscale Biology and the Minerva supercomputing team for their financial and technical support. We also thank Om P. Pandey and Gustavo Stolovitzky for their technical advice. A. thaliana auESC size_ratio@60 size_ratio@120 size_ratio@180 perf_ratio@60 per_ratio@120 perf_ratio@180 BP 0.3833 0.0167 0.0083 0.0056 0.8118 0.7912 0.8237 FE 0.4769 1 1 1 1 1 1 CES 0.4549 0.4 0.31 0.24 0.9710 0.9577 0.9379 RL_greedy 0.4725 0.5 0.5 0.51 0.9946 0.9927 0.9945 RL_pessimistic 0.4634 0.48 0.29 0.21 0.9919 0.9649 0.9623 RL_backtrack 0.4721 0.87 0.79 0.75 0.9983 0.9919 0.9985 Our approach can help extract more useful knowledge from DREAM challenges by constructing predictive and parsimonious ensembles of the submissions. Will be applied in the DREAM Respiratory Viral challenge Implementation available: https://github.com/GauravPandeyLab/lens Problem C. elegans D. melanogaster P. pacificus C. remanei A. thaliana #Features 141 141 141 141 141 #Positives 1,598 997 1,596 1,600 1,600 #Negatives 158,150 99,003 156,326 157,542 158,377 Total 159,748 100,000 157,922 159,142 159,977 IDEA: A novel ensemble selection approach based on reinforcement learning (RL), which provides a systematic way of exhaustively exploring the many possible combinations of base predictors that can be selected into an ensemble START (1) (2) 0.565 (3) (4) (5) (1, 2) (1, 3) (1, 4) (1, 5) (2, 3) (2, 4) (2, 5) 0.619 (3, 4) (3, 5) (4, 5) (1, 2, 3) (1, 2, 4) (1, 2, 5) 0.648 (1, 3, 4) (1, 3, 5) (1, 4, 5) (1, 2, 3, 4) (1, 2, 3, 5) (1, 2, 4, 5) 0.654 (1, 2, 3, 4, 5) 0.627 FINISH (1, 3, 4, 5) (2, 3, 4) (2, 3, 5) (2, 4, 5) (2, 3, 4, 5) (3, 4, 5) Explore Learn a policy Exploit Possible actions DATA D1 D2 D3 DN ... ... ? E N 5 E M 8 L 3 ? MODEL 1 MODEL 2 MODEL N MODEL 3

Transcript of for biomedical prediction problems and DREAM …astanesc/papers/DREAM_2016_ana.pdfTarget problem:...

Page 1: for biomedical prediction problems and DREAM …astanesc/papers/DREAM_2016_ana.pdfTarget problem: Splice Site Prediction Selected ensemble model Validation set (20%) Test set (20%)

Learning parsimonious ensembles for biomedical prediction problems and DREAM challenges

Ana Stanescu and Gaurav Pandey* Icahn School of Medicine at Mount Sinai, New York, NY

[email protected] http://research.mssm.edu/gpandey/

HETEROGENEOUS ENSEMBLES FOR DREAM CHALLENGES:

•  GOAL: build heterogeneous ensembles

•  Parsimony of such an ensemble can be of even greater value for DREAM challenges due to enhanced interpretability

•  Ensemble Selection(ES)/Pruning is a potential approach for this, but popular algorithms like Caruana’s ES (CES) are ad-hoc (sub-optimal) and non-exhaustive

MATERIALS and METHODS:

REFERENCES: •  A. Stanescu and G. Pandey, Learning parsimonious ensembles for unbalanced computational genomics

problems. In: Pacific Symposium on Biocomputing, PSB 2017 (In press.) •  S. Whalen, O. P. Pandey and G. Pandey, Predicting protein function and other biomedical characteristics. Methods 93(15):

92-102, 2016. •  R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes, Ensemble selection from libraries of models. ICML 2004. •  R. Caruana, A. Munson and A. Niculescu-Mizil, Getting the Most Out of Ensemble Selection. ICDM 2006. •  CJCH Watkins and P. Dayan, Q-Learning. Machine Learning 8(3-4), 1992. •  G. Schweikert, G. Rätsch, C. Widmer, and B. Schölkopf, An empirical analysis of domain adaptation algorithms for

genomic sequence analysis. NIPS 2009.

Target problem: Splice Site Prediction

Selected ensemble

model

Validation set(20%)

Test set(20%)

Repeat 5 folds of cross validation

Dataset

Repeat process K times

Assessment of performance

(F-Max) of combined

predictions over the 5 folds

Obtain performance estimates (F-max) of

base predictors as needed

Train base predictors

Training set(60%)

Obtain predictions on the test set using

the ensemble

Ensemble selection module:(A) CES(B) RL using lattices

Visual description of the workflow used to in our experiments. K=10 for the PF datasets and K=5 for the SS datasets.

RESULTS:

•  DREAM challenges are a great mechanism for identifying the most effective solution(s) for challenging biomedical problems. •  Can we improve the (predictive) ability of DREAM challenges by considering the contributions of non-winning submissions/

models also?

Q-Learning

ACATGCTA … ATCGATCTAG GGATGCTACATCGCGAT … ATCGATCTC61st Position

Exonic NucleotidesIntronic Nucleotides

141 Nucleotides

+

Class

•  RL able to better capture predictive performance close to the full ensembles with a much smaller number of base predictors.

•  More capable of achieving this balance than CES, especially for larger datasets.

•  The downstream performance or sizes of the RL selected ensembles is not sensitive to RL parameters (e.g., exploitation/exploration probability), showing robustness to parameters as compared to other, more ad-hoc ES methods.

•  Algorithm to find an optimal action-selection policy. •  Proven to converge to an optimal solution (i.e. find an

optimal action-selection policy) under certain constraints.

RL Strategies for ES •  RL_greedy

•  Reward is given by ensemble performance •  RL_pessimistic

•  Reset to start as soon as performance drops •  RL_backtrack

•  Go back one position when performance drops

Reinforcement Learning (RL)

Acknowledgements: This work was partially supported by NIH grant # R01-GM114434 and an IBM faculty award to GP. We thank the Icahn Institute for Genomics and Multiscale Biology and the Minerva supercomputing team for their financial and technical support. We also thank Om P. Pandey and Gustavo Stolovitzky for their technical advice.

A. thaliana auESC size_ratio@60 size_ratio@120 size_ratio@180 perf_ratio@60 per_ratio@120 perf_ratio@180

BP 0.3833 0.0167 0.0083 0.0056 0.8118 0.7912 0.8237 FE 0.4769 1 1 1 1 1 1 CES 0.4549 0.4 0.31 0.24 0.9710 0.9577 0.9379 RL_greedy 0.4725 0.5 0.5 0.51 0.9946 0.9927 0.9945 RL_pessimistic 0.4634 0.48 0.29 0.21 0.9919 0.9649 0.9623 RL_backtrack 0.4721 0.87 0.79 0.75 0.9983 0.9919 0.9985

•  Our approach can help extract more useful knowledge from DREAM challenges by constructing predictive and parsimonious ensembles of the submissions.

•  Will be applied in the DREAM Respiratory Viral challenge

•  Implementation available: https://github.com/GauravPandeyLab/lens

Problem C. elegans D. melanogaster P. pacificus C. remanei A. thaliana

#Features 141 141 141 141 141

#Positives 1,598 997 1,596 1,600 1,600

#Negatives 158,150 99,003 156,326 157,542 158,377

Total 159,748 100,000 157,922 159,142 159,977

•  IDEA: A novel ensemble selection approach based

on reinforcement learning (RL), which provides a systematic way of exhaustively exploring the many possible combinations of base predictors that can be selected into an ensemble

START

(1) (2) 0.565 (3) (4) (5)

(1, 2) (1, 3) (1, 4) (1, 5)(2, 3) (2, 4) (2, 5) 0.619(3, 4) (3, 5) (4, 5)

(1, 2, 3) (1, 2, 4) (1, 2, 5) 0.648(1, 3, 4) (1, 3, 5) (1, 4, 5)

(1, 2, 3, 4) (1, 2, 3, 5) (1, 2, 4, 5) 0.654

(1, 2, 3, 4, 5) 0.627

FINISH

(1, 3, 4, 5)

(2, 3, 4) (2, 3, 5) (2, 4, 5)

(2, 3, 4, 5)

(3, 4, 5)

•  Explore

•  Learn a policy

•  Exploit

•  Possible actions

DATA

D1 D2 D3 DN...

...

? E N 5 E M 8 L 3 ?

MODEL 1 MODEL 2 MODEL NMODEL 3