Post on 07-Apr-2020
April 25th 2018
Automatic
optimization of
predictive Bioactivity
models
Chi Chung Lam, Fabian Steinmetz, Paul Czodrowski
2
Multiple models trained for biological targets
Random Forests
Neural Networks
Gradient Boosted Trees
NNs and GBTs are very sensitive to hyperparameter changes
Automated ways needed to build models with the right hyperparameters
Predictive Models in Production
millions of unique
combinations possible3
NN Architectures & Hyperparameters
NN-Architecture
• Layer-Type
• Number of Layers
• Neurons per Layer
• Activation-Functions
Training-Parameters
• Optimizer
• Learning-Rate
• Weight-Decay
• Batch-Size
• Loss-Function
• …
Hyperparameters
Guido Bolick: Automatic Generation of Neural Network Architectures Using a Genetic Algorithm | 27.09.2016
4
Genetic Algorithm for hyperparameter optimization
5.1
5.2
4
12 3
Guido Bolick: Automatic Generation of Neural Network Architectures Using a Genetic Algorithm | 27.09.2016
5
Genetic Algorithm Workflow
Guido Bolick: Automatic Generation of Neural Network Architectures Using a Genetic Algorithm | 27.09.2016
6
Comparing Global Models
Model Description
RF Random Forest with fixed hyperparams
Leiden DNN DNN with fixed hyperparams
GA DNN DNN with GA optimized hyperparams
Random DNN DNN with grid search optimized hyperparams
Feature-Wise Baseline Model that takes the fingerprint bit as prediction
XGBoost Gradient Boosted Trees with fixed hyperparams
7
Assume that each fingerprint bit is a prediction, and select the best bit
Feature-Wise Baseline
Bit 0 Bit 1 Bit 2 Bit 3 Activity
Sample 1 1 0 0 1 0
Sample 2 1 0 0 0 0
Sample 3 1 1 1 1 1
Sample 4 1 1 1 0 1
Sample 5 0 0 1 1 0
Kappa score 0.41 1.00 0.67 -0.17
8
Global Model Performance
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
CACO CLINT_H CLINT_M CLINT_R HERG SOL
Kappa S
core
Target
Global Model Performance
RF Leiden DNN GA DNN Random DNN Feature-Wise XGBoost XGBoost Random
9
GA vs Random Search Comparison
Mean kappa score increases as GA evolution occurs
However, good solution is found too easily (already found in initial 100 architectures)
A random search of the same search space finds a similar or better solution
10
Fingerprints hash a molecule’s substructures into a fixed bit
A small fingerprint size will cause “collisions”
A large fingerprint size will cause many redundant bits
Fingerprint Filtering: CLINT_R
FP Size 1024 4096
Avg substructures per bit 79.84 20.64
0.01 variance filter 3 2388
Substr/bit after 0.01 var filter 80.00 21.86
True size after 0.01 var filter 1021 1708
11
Feature-selection of fingerprints by variance
Control: unfiltered FP of same length as filtered FP
Problem: Arbitrary choice of threshold variance
Fingerprint Filtering: CLINT_R
0,000
0,050
0,100
0,150
0,200
0,250
0.01 var Control 0.0 var Control Unfiltered
Mean K
appa S
core
CLINT_R 1024 Bits Filtering
DNN RF XGB
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0.01 var Control 0.0 var Control Unfiltered
Mean K
appa S
core
CLINT_R 4096 Bits Filtering
DNN RF XGB
Finding the optimal variance: CLINT_R
0,000
0,050
0,100
0,150
0,200
0,250
0.01 var 0.0 var Optimal Var Unfiltered
Mean K
appa S
core
CLINT_R Optimal Var Filtering
Finding the optimal variance: HERG
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0.01 var 0.0 var Optimal Var Unfiltered
Mean K
appa S
core
HERG Optimal Var Filtering
Fingerprint Filtering: Problems
Variance of bits highly depends on sample size
Use threshold that is relative to sample size, instead of absolute value
Can we combine this filtering with the “feature-wise baseline” analysis?
Drop fingerprints that correlate poorly with dependent variable?
15
Nested Cluster Validation
16
The final models are used in production and served to chemists, etc.
Retraining occurs every 3 months
During these three months, models are “outdated”
Retraining more frequently is time-wise impractical
XGB and DNNs allow “On-line” updating
Fit new data during an additional training step of existing models
Can happen nearly real-time
Retraining only necessary when performance starts declining
On-line Updating of Models
Our in house environments: CREAM and MOCCA
CREAM (Classification REgression At Merck)
- Python environment and modelling tool
- Used for the majority of predictive models
- Holds versatile features, such as
- Multiple machine learning algorithms
- Different validation methods
- Interface to MOCCA
MOCCA is the Merck Online Computational Chemistry Analyzer, our
web-based in-house prediction tool
Global models
• Large Dataset
• Large Applicability Domain (AD)
• Endpoints, such as
• Physico-chemical Properties
• Pharmacokinetics
• Toxicity
• General Selectivity
Global vs. local models
Local models
• Smaller Dataset
• Smaller Applicability Domain
• Endpoints, such as
• Activity
• Selectivity
• Toxicity, Pharmacokinetics
Generally global models are preferrable dueto greater in-house modelling experience andlarger AD, but we are happy to supportprojects with local models if needed.
e.g.
22
• Chi Chung Lam
• Wolf-Guido Bolick (Andreas Dominik)
• Fabian Steinmetz
• Kristina Preuer, Günter Klambauer (Sepp Hochreiter)
• Friedrich Rippmann
• Marcel Baltruschat
• Cornelius Kohl
• Samo Turk
• Jan Fiedler
• Christian Röder
Acknowledgement
23
back-up
24
SET Train Test Classes
CACO 9637 523 3
CLINT_H 16264 797 3
CLINT_M 18313 981 3
CLINT_R 15910 760 3
HERG 6894 288 2
SOL 19615 667 3
Datasets
millions of unique
combinations possible25
NN Architectures & Hyperparameters
NN-Architecture
• Layer-Type
• Number of Layers
• Neurons per Layer
• Activation-Functions
Training-Parameters
• Optimizer
• Learning-Rate
• Weight-Decay
• Batch-Size
• Loss-Function
• …
Hyperparameters
26
Optimization of Hyperparameters
Expert Lucky People Everyone
Hyperparameters derived
from literature & experience
Hyperparameter search
within promising parameter
areas
Random-Search (Bergstra et al. 2012)
Grid-Search (Larochelle et al. 2007)
Probability based algorithms (Brochu et al. 2010, Bergstra et al. 2011)
Directed Random-Search
(e.g. genetic algorithms)
27
What is a Genetic Algorithm?
5.1
5.2
4
12 3
28
Validation Strategies
• Use as much data as possible for training
• Being able to get a realistic glimpse of the
performance
• 5-fold cross-validation
• Every compound represented in 4/5 models
• Hyperparameter optimization to increase
performance of validation sets
• Resulting performance trustworthy ?!
• 5-fold nested cross-validation 25 models
• Every compound represented in 16/25 models
• Increased computational requirements
• 5x Hyperparameter optimizations to increase
performances of validation sets
• Final performances evaluated using
corresponding outer loop test sets
29
Getting a job (hyperparameters) from the jobserver
Repeat for all training/test sets:
Building of a NN based on hyperparameters
Training of the NN using a training set
Balanced-Batch-Generator maintains the same active/inactive-ratio within a batch
Early-Stopping, when mean validation-loss of sliding window (15 epochs) does not
improve for 100 epochs
Evaluation of best state (center of best window)
using validation set, metric Cohen’s Kappa
Training of a NN
1
2
2.1
2.2
2.3
Agreement of labels vs. prediction
Agreement of 2 random observers
30
So many parameters..
Genetic Algorithm
• Population-Size: 100
• Workers: 10
• Fingerprint-Size:
1024
• Smarts-Patterns:
826
• Evolution-Strat.:
Drop-Worst-50%
Mutation Settings
• Default:
• Mutation-Rate: 5%
• Mutation-Strength: 1
• Crossing-Over-Rate: 30%
• Increased:
• Mutation-Rate: 10%
• Mutation-Strength: 2
• Crossing-Over-Rate: 30%
Training
• Optimizer: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam
• Loss-Functions: mae, mse, msle
• Learning-Rate: 0.05, 0.1, 0.5, 1.0
• Weight-Decay: 0.0, 1E-7, 5E-7
• Momentum: 0.0, 0.1, …, 0.9
• Nesterov: 0, 1
• Batch-Size: 5%, 6%, …, 20%
Architecture
• Layers: 1-4
• Layer-Types: Dense, Dropout
• Neurons: 32, 64, …, 512
• Dropout-Ratio: 5%, 10%, …, 90%
• Activation-Functions: linear, sigmoid, hard-sigmoid, softmax, relu, tanh
31
Datasets
Dataset hERG Micronucleus-Test
Compounds 6999 798
Actives 3205 (46%) 263 (33%)
Inactives 3794 (54%) 535 (67%)
Binary Classification: Inactive 0
Active 1
32
Found NN-Hyperparameters
33
Found NN-Hyperparameters
34
Improvement of NNs while running the GA
Initial population starts with inner-
kappa values of ~0.6 in all splits
GA is able to improve performance of
best entities even more (red line)
Mutations can lead to bad performing
entities (blue line) until the last
generation
35
Novelty of Architectures
Proportion of new entities in population
decreases during the runtime of the GA
Higher mutation-rate (red line) increases
the searchable space for the GA
36
Influence of Hyperparameters
1_activation (344)
First hidden
layer
Activation-function
of this layer
Number of
contributing pairs
Contributing pairs only differ by
the shown parameter
Boxplots are based on the
absolute difference of both inner-
kappa values of all contributing
pairs
37
User-Interface
38
Implemented an algorithm to create a consensus-model using 5-fold nested cross-validation
Each compound is represented in 16 of 25 NNs
Calculation needs 8-14 hours (e.g. during a night) using a GTX-Cluster
GA improves already high kappa values of NNs even more
Kappa values of final NN-models are mostly larger than 0.5 (“moderate” according to Landis et al. 1977)
Further steps:
Possibility to use chemical descriptors and multiple fingerprints
Option to create multi-class models (more classes than just 0 and 1) and regression models
(Polishing up and writing a paper)
Conclusion
39
Implementation of the GA