Know More Before You Score: An Analysis of Structure-Based Virtual Screening Protocols
description
Transcript of Know More Before You Score: An Analysis of Structure-Based Virtual Screening Protocols
Know More Before You Score: An Analysis of Structure-Based Virtual
Screening Protocols
Know More Before You Score: An Analysis of Structure-Based Virtual
Screening Protocols
Structure-Based Virtual Screening (SBVS) is a proven technique for Structure-Based Virtual Screening (SBVS) is a proven technique for lead discoverylead discovery
Still many areas for improvementStill many areas for improvement Efforts generally focussed on scoring functionEfforts generally focussed on scoring function
Often with little consideration of the assumptions underpinning SBVSOften with little consideration of the assumptions underpinning SBVS
Here we consider a number of these processes in detail from the Here we consider a number of these processes in detail from the perspective of our primary SBVS tool (DOCK) perspective of our primary SBVS tool (DOCK) Ligand conformational search protocolsLigand conformational search protocols Varying site points definitionsVarying site points definitions Alteration of DOCK variables that directly affect sampling Alteration of DOCK variables that directly affect sampling
Determine their impact on hit enrichment and search speedDetermine their impact on hit enrichment and search speed Analyze implications for future researchAnalyze implications for future research
Ligand Flexibility StudiesStrategy
Ligand Flexibility StudiesStrategy
SBVS CPU intensiveSBVS CPU intensive Conformational searching of ligand clearly importantConformational searching of ligand clearly important
Sampling limited to allow search completion in reasonable time frameSampling limited to allow search completion in reasonable time frame
Test required to compare different conformational sampling Test required to compare different conformational sampling methodsmethods Ability to reproduce bioactive conformation testedAbility to reproduce bioactive conformation tested
145 ligands from a 1995 analysis of pdb complexes (Gschwend UCSF 145 ligands from a 1995 analysis of pdb complexes (Gschwend UCSF unpublished)unpublished)
30 compound subset chosen for analysis- selection based on visual and 30 compound subset chosen for analysis- selection based on visual and numerical inspection of diversity in ligand flexibility and functionality numerical inspection of diversity in ligand flexibility and functionality
Relatively small sample of molecules used, many peptidic in natureRelatively small sample of molecules used, many peptidic in nature Peptidic moieties are among the better parameterized systems, so Peptidic moieties are among the better parameterized systems, so
this is in some ways a best case scenario this is in some ways a best case scenario
Ligand Flexibility StudiesProcedure
Ligand Flexibility StudiesProcedure
Multiple sampling techniques chosen:Multiple sampling techniques chosen:
Catalyst-best / Catalyst-fast / Confort / Omega / DOCKCatalyst-best / Catalyst-fast / Confort / Omega / DOCK Variety of sampling levels Variety of sampling levels Starting from Concord structure, conformers generated Starting from Concord structure, conformers generated
and superimposed onto pdb ligand conformation. and superimposed onto pdb ligand conformation. Conformation with lowest heavy atom RMS to used as quality Conformation with lowest heavy atom RMS to used as quality
measure measure
Ligand Flexibility StudiesSearch Settings EmployedLigand Flexibility Studies
Search Settings Employed
Dock - Dock - conformation_cutoff_factor=3/5/10 clash_overlapconformation_cutoff_factor=3/5/10 clash_overlap==0.7 times 0.7 times vdW radius for clash overlap with customized rules for bond increment vdW radius for clash overlap with customized rules for bond increment settingssettings
Confort - Confort - Rough (0.10 kcal) convergence, diverse conformer selection, Rough (0.10 kcal) convergence, diverse conformer selection, boat ring search on - sampling at 5/10 confs per single bond + 500 max boat ring search on - sampling at 5/10 confs per single bond + 500 max
Catalyst- Best/Fast Catalyst- Best/Fast Default settings - sampling at Default settings - sampling at 5/10 confs per 5/10 confs per single bond + 100 max single bond + 100 max
Omega: Omega: Defaults +Defaults + RMS_CUTOFF=1.0, GP_ENERGY_WINDOW=5.0, RMS_CUTOFF=1.0, GP_ENERGY_WINDOW=5.0, sampling at 100 maxsampling at 100 max
In addition Concord generated and Sybyl minimized ligand xray structures In addition Concord generated and Sybyl minimized ligand xray structures also analyzed as “controls”also analyzed as “controls”
Ligand Flexibility Results Overall Performance - RMS/ Rank
Ligand Flexibility Results Overall Performance - RMS/ Rank
0.76 0.81 0.88 0.92 0.870.97 0.96 0.99 0.99 1.00 1.03
1.13
1.76
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
Ave
rag
e in
tern
al r
ank
0.000.200.400.600.801.001.201.401.601.80
Ave
rag
e R
MS
dev
iati
on
Average internal rank
Average rms deviation
Ligand Flexibility ResultsPerformance vs FlexibilityLigand Flexibility ResultsPerformance vs Flexibility
0
0.5
1
1.5
2
2.5
Av
erag
e R
MS
D
evia
tion
3 to 5 single bonds (15)6 to 8 single bonds (7)9 to 14 single bonds (8)
Ligand Flexibility Results The Pain Gain Ratio
Ligand Flexibility Results The Pain Gain Ratio
Does extra noise introduced to scoring functions outweigh this Does extra noise introduced to scoring functions outweigh this improvement? Is it worth the extra CPU?improvement? Is it worth the extra CPU?
425
0.81 0.87 0.88 0.92 0.96 0.97 1.031.125
0102030405060708090
100
CO
NF 5
00
BE
ST 1
00
FAS
T 10
0
CO
NF1
0
DO
CK
10
FAS
T 5
DO
CK
5
DO
CK
3
Search Types
Co
nfo
rmat
ion
s /
mo
lecu
le
0.000.200.400.600.801.001.201.401.601.80
RM
S d
evia
tion
Average conformations / moleculeAverage rms deviation
Ligand Flexibility ResultsVisual Analysis
Ligand Flexibility ResultsVisual Analysis
Even at lower RMS, deviation in hydrogen positions an issueEven at lower RMS, deviation in hydrogen positions an issue As RMS rises (0.9) we begin to see more significant deviations in heavy As RMS rises (0.9) we begin to see more significant deviations in heavy
atom positions - large enough to possibly prove troublesome to atom positions - large enough to possibly prove troublesome to standard force fieldsstandard force fields
RMS=0.65 RMS=0.90
Ligand Flexibility ResultsVisual Analysis
Ligand Flexibility ResultsVisual Analysis
As RMS rises further, hydrogen bond mapping begins to partially break downAs RMS rises further, hydrogen bond mapping begins to partially break down Significant deviation begins to be seen although general shape Significant deviation begins to be seen although general shape
complementarity is still reasonablecomplementarity is still reasonable DOCKing tricky, pharmacophore searches possible with loose tolerances, although DOCKing tricky, pharmacophore searches possible with loose tolerances, although
site point vector definitions (DISCO / Catalyst) a no nosite point vector definitions (DISCO / Catalyst) a no no
RMS=2.19RMS=1.55
Ligand FlexibilityConclusions
Ligand FlexibilityConclusions
At current sampling levels used in virtual screeningAt current sampling levels used in virtual screening Rough search techniques perform comparably to more exhaustive methodsRough search techniques perform comparably to more exhaustive methods
Dock performs quite well, and Fast does slightly better than comparable Best runDock performs quite well, and Fast does slightly better than comparable Best run Results highlight the need for “forgiving” scoring functions and pharmacophore Results highlight the need for “forgiving” scoring functions and pharmacophore
constraint tolerances (especially for flexible molecules)constraint tolerances (especially for flexible molecules) Generating function directly from crystal structure data may not be optimumGenerating function directly from crystal structure data may not be optimum
Use the conformation closest to the biologically relevant structure with chosen Use the conformation closest to the biologically relevant structure with chosen sampling techniquesampling technique
May be better to ignore more flexible molecules when possible (~>8 bonds)May be better to ignore more flexible molecules when possible (~>8 bonds)
Analysis of more extensive data set might provide basis for determining if Analysis of more extensive data set might provide basis for determining if optimum sampling settings exist (Best/Omega/Confort)optimum sampling settings exist (Best/Omega/Confort) Coarseness of poling values for exampleCoarseness of poling values for example
Structure-Based Search ProtocolsAn Analysis of DOCK
Structure-Based Search ProtocolsAn Analysis of DOCK
Working within current DOCK paradigm, what search Working within current DOCK paradigm, what search protocols provide optimum search criterion?protocols provide optimum search criterion? Site point definitionsSite point definitions Alteration of sampling variablesAlteration of sampling variables Different scoring grids Different scoring grids
Comparisons illustrated for 5 test systems with Comparisons illustrated for 5 test systems with diverse active data sets diverse active data sets
Analysis based on ranking within list that includes Analysis based on ranking within list that includes ~10000 “noise” compounds ~10000 “noise” compounds
““Random” selection within bounds of size and flexibility Random” selection within bounds of size and flexibility distribution seen in in-house databasedistribution seen in in-house database
Structure-Based Search ProtocolsDOCK variables
Structure-Based Search ProtocolsDOCK variables
Contains many variables that effect performance Contains many variables that effect performance Ligand sampling within the site being the primary variantLigand sampling within the site being the primary variant
nodesnodes 3/4 3/4
distance_tolerance 0.5/1.0distance_tolerance 0.5/1.0
distance_minimum 3.0distance_minimum 3.0
bump_filter 4bump_filter 4
conformation_cutoff_factor 5conformation_cutoff_factor 5
clash_overlap 0.7clash_overlap 0.7
maximum_orientations 500/5000maximum_orientations 500/5000
Structure-Based Search ProtocolsDOCK and pharmacophoric constraints
Structure-Based Search ProtocolsDOCK and pharmacophoric constraints
It is possible to assign fairly sophisticated pharmacophoric It is possible to assign fairly sophisticated pharmacophoric (henceforth also known as chemical) definitions(henceforth also known as chemical) definitions
name acidname acid
# deprotonated carboxyl# deprotonated carboxyl
definition O.co2 ( C )definition O.co2 ( C )
# tetrazole# tetrazole
definition N.pl3 ( H ) ( N.2 ( N.2 ( N.2 ( C.2 ) ) ) )definition N.pl3 ( H ) ( N.2 ( N.2 ( N.2 ( C.2 ) ) ) )
definition N.pl3 ( H ) ( N.2 ( N.2 ( C.2 ( N.2 ) ) ) )definition N.pl3 ( H ) ( N.2 ( N.2 ( C.2 ( N.2 ) ) ) )
definition N.2 ( N.2 ( N.2 ( C.2 ( N.pl3 ( H ) ) ) ) )definition N.2 ( N.2 ( N.2 ( C.2 ( N.pl3 ( H ) ) ) ) )
definition N.2 ( N.2 ( C.2 ( N.pl3 ( H ) ( N.2 ) ) ) )definition N.2 ( N.2 ( C.2 ( N.pl3 ( H ) ( N.2 ) ) ) )
definition N.2 ( C.2 ( N.2 ( N.pl3 ( H ) ( N.2 ) ) ) )definition N.2 ( C.2 ( N.2 ( N.pl3 ( H ) ( N.2 ) ) ) )
definition N.2 ( N.2 ( C.2 ( N.2 ( N.pl3 ( H ) ) ) ) )definition N.2 ( N.2 ( C.2 ( N.2 ( N.pl3 ( H ) ) ) ) )
definition N.2 ( N.pl3 ( H ) ( N.2 ( N.2 ( C.2 ) ) ) )definition N.2 ( N.pl3 ( H ) ( N.2 ( N.2 ( C.2 ) ) ) )
# acyl sulphonamide # acyl sulphonamide
definition N.am ( S ( 2 O.2 ) ) ( C.2 ( O.2 ) )definition N.am ( S ( 2 O.2 ) ) ( C.2 ( O.2 ) )
definition O.2 ( C.2 ( N.am ( H ) ( S ( 2 O.2 ) ) ) )definition O.2 ( C.2 ( N.am ( H ) ( S ( 2 O.2 ) ) ) )
definition O.2 ( S ( O.2 ) ( N.am ( H ) ( C.2 ( O.2 ) ) )definition O.2 ( S ( O.2 ) ( N.am ( H ) ( C.2 ( O.2 ) ) )
Current types:
heavy atom
donor
acceptor
hydrophobe
aromatic
aromatic_hydrophobic
acid
base
donor_and_acceptor
special (e.g. metal chelator)
Structure-Based Search ProtocolsSite Points Used in Kinase SearchStructure-Based Search ProtocolsSite Points Used in Kinase Search
Region 3
Hydrophobic /
Any heavy atom
Region 1 ( + 4)
acceptor / donor
Region 2
Hydrophobic + 2 donors
Structure-Based Search ProtocolsTest Sets and Site Points Used
Structure-Based Search ProtocolsTest Sets and Site Points Used
Sphgen used to generate site points for “generic” DOCK searchesSphgen used to generate site points for “generic” DOCK searches Pharmacophore points derived from a mixture of non-data set bound ligands and in-house Pharmacophore points derived from a mixture of non-data set bound ligands and in-house
programs that process GRID maps and Connolly surfaces (plus plenty of human programs that process GRID maps and Connolly surfaces (plus plenty of human intervention)intervention)
Active data sets broken down into chemotypes to prevent the problem of common analogue Active data sets broken down into chemotypes to prevent the problem of common analogue bias - an under appreciated issue in all validationsbias - an under appreciated issue in all validations
Target Active ChemotypeDefinitions
PharmacophorePoints / Critical
Regions2 Serineproteases
P1 substituent / P1-P4 linker substituent
P1 (base /hydrophobe) + P4(hydrophobe) pockets
2 Fatty acidbindingproteins
Core linking acidmoiety to remainingsubstituents
Acid binding pocket
Kinase Moiety mimicingadenine / main coreof molecules
Adenine bindingpocket(donor/acceptor) [+rear hydrophobicpocket]
Results - fatty acid binding protein 1 No. of hits after 7 chemotypes located by at least one search ( 500
compounds processed from 28 actives / 8 chemotypes)
Results - fatty acid binding protein 1 No. of hits after 7 chemotypes located by at least one search ( 500
compounds processed from 28 actives / 8 chemotypes)
• Missing chemotype a citrazinate - not covered in chemical definitions - easy to fix - another advantage over electrostatics
0
5
10
15
20
Search Types
Com
poun
ds
0
2
4
6
8
Che
mot
ypes
Chemotypes
Compounds
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Co
mp
ou
nd
s
Search Type
Best hit rateMean hit rateWorst hit rate
Results-OverallCompounds processed for 50% Chemotype Coverage for All Systems
Results-OverallCompounds processed for 50% Chemotype Coverage for All Systems
Rigid conformer screens perform quite well in generic search modeRigid conformer screens perform quite well in generic search mode One system contains predominantly rigid chemotypes, two others One system contains predominantly rigid chemotypes, two others
require a predominantly extended conformation for bindingrequire a predominantly extended conformation for binding
On addition of critical and chemical constraints, inability of rigid search On addition of critical and chemical constraints, inability of rigid search to adapt to more exacting requirements severely compromises resultsto adapt to more exacting requirements severely compromises results
Generic searches with addition of conformational flexibility little improvement relative to rigid searchGeneric searches with addition of conformational flexibility little improvement relative to rigid search signal to noise issuessignal to noise issues
Addition of critical region constraint alone worsens resultsAddition of critical region constraint alone worsens results 500 orientations per conformer too few for search - leads to premature termination of docking analysis 500 orientations per conformer too few for search - leads to premature termination of docking analysis
for many ligandsfor many ligands
Adding chemical in addition to critical constraints provides best balance for sampling parametersAdding chemical in addition to critical constraints provides best balance for sampling parameters still required reasonable tolerances and forgiving scoring function for optimum resultsstill required reasonable tolerances and forgiving scoring function for optimum results
ResultsSample Hit Rate Comparisons
ResultsSample Hit Rate Comparisons
Kinase sites tend to be highly mobile Kinase sites tend to be highly mobile Forgiving DOCK scoring function more appropriateForgiving DOCK scoring function more appropriate
Fatty acid active site deep and fairly rigid Fatty acid active site deep and fairly rigid Prometheus at least comparable performance to DOCK even with more Prometheus at least comparable performance to DOCK even with more
simplistic constraintssimplistic constraints
Kinase
0
2
4
6
8
10
12
100 200 300 400 500 600 700 800Compounds Processed
Chem
otyp
es h
it
PrometheusunconstrainedPrometheusconstrainedDock constrained
DOCKunconstrained
Fatty acid binding protein 1
012345678
Compounds processedCh
emot
ypes
hit
Prometheusconstrained
DOCK constrained
ResultsSample Hit Rate Comparisons
ResultsSample Hit Rate Comparisons
Illustrates how addition of constraints can allow performance of Illustrates how addition of constraints can allow performance of simplistic scoring functions to surpass those deemed more simplistic scoring functions to surpass those deemed more sophisticated sophisticated
Serine protease 1
0
2
4
6
8
10
12
14
16
18
100 200 300 400 500Compounds processed
Co
mp
ou
nd
s h
itDOCKconstrained
ICM
ResultsSample Hit Rate Comparisons
ResultsSample Hit Rate Comparisons
Removing highly flexible molecules from the search reduces the noise at the top of the hit listRemoving highly flexible molecules from the search reduces the noise at the top of the hit list In a database of 250000, the top 100 becomes top 2500 In a database of 250000, the top 100 becomes top 2500 Could be crucial when only small data sets can be assayedCould be crucial when only small data sets can be assayed Smaller molecules generally make better leadsSmaller molecules generally make better leads
012345678
Ch
emo
typ
es h
it
100 200 300 400 500Compounds processed
Average Hit Rates Using Different Flexibility Constraints
Max 15 bonds
Max. 8 bonds
Sampling choices have a profound effect on SBVS resultsFor maximum impact impact current methodology, scoring functions should either
Be designed/utilized with these limitations in mind Forgiving / targeted at less flexible molecules
Improve results by such a high degree that additional sampling (and CPU) is warranted
In the mean time, utility of pharmacophoric hypotheses {critical region(s) with pharmacophoric constraints} is clear
Better results faster Less sensitivity to model coarseness Allows constraints exploiting known structural biologyKey to optimum use is balancing constraints and tolerances to ensure sufficient sampling
benchmarking with known ligands one way to do this
ConclusionsThe hypothesis hypothesis
ConclusionsThe hypothesis hypothesis
AcknowledgementsAcknowledgements
Thank youThank you to my BMS CADD colleagues to my BMS CADD colleagues