When Command Line Tools meet KNIME: Using the best of the ...bulletin.acscinf.org › PDFs › 250nm...
Transcript of When Command Line Tools meet KNIME: Using the best of the ...bulletin.acscinf.org › PDFs › 250nm...
When Command Line Tools meet KNIME:
Using the best of the two worlds to support
drug discovery teams
Man-Ling Lee 250th ACS National Meeting 08/17/2015 Workflow Tools & Data Pipelining in Drug Discovery Symposium
M. Lee 250th ACS National Meeting 08/17/2015
What this talk is about 2
Disclaimer This talk is not about application user interfaces!
Motivation It’s about Genentech’s approach of building an
extensible and sustainable infrastructure to support drug discovery teams.
M. Lee 250th ACS National Meeting 08/17/2015
Three Example Applications 3
Project Vortex Sessions ü Easy-to-use command line tools enable comp chemists to take
ownership of project database ü KNIME/Command Line tool integration to facilitate the maintenance
of the Unix pipes
DMPK Model Validation ü Easy-to-understand predicted probabilities for prioritizing
compounds in a “local” context ü KNIME/Command Line tool integration to enable comp chemists to
build high-performance end-user applications
Fragment Hit Clustering ü Easy-to-visualize AAPSim/SE clustering supports project teams in hit
follow-up processes ü Use command Lines tools to build new Command Line tools
M. Lee 250th ACS National Meeting 08/17/2015
Genentech’s Way of Accessing Project Data 4
Comp chemists provide Vortex sessions to their project teams
M. Lee 250th ACS National Meeting 08/17/2015
Command Line Tools @ Genentech 5
Autocorrelator package A. Gobbi, M. Lardy, M. Lee http://code.google.com/p/autocorrelator/
Aestel package M. Lee, A. Gobbi http://sourceforge.net/projects/aestel/
Genentech package A. Gobbi, J. Feng, M. Lee, E. Kochetkova, B. Seller, K. Clark
Data Manipulation Chemical Structure Properties Diversity sdfTagTool.csh sdfMDLSSSMatcher.csh sdf2Omega.py sdfCFP.csh
sdf2Tab.csh sdfEnumerator.csh sdfFastRocs.pl sdfFingerprinter.csh
sdf2Xls.csh sdfTransformer.csh sdfCalcProps.py sdfCluster.pl
sdfAggregator.csh sdfRGroupExtractor.pl sdfRGroupCalcProps.pl sdfFPCluster.pl
sdfBinning.csh sdfRingSystemExtraction.csh sdfCNS_MPO.grvy sdfFPNNFinder.csh
sdfDataPivoter.csh sdfSmarsGrep.csh sdfGiniCalculator.csh sdfMCSSNNFinder.csh
sdfFromImage.py sdfStructureTagger.csh sdfSelectivityCalculator.csh sdfFPSphereExclusion.csh
sdfSdfMerger.csh sdfSubRMSD.csh sdfLE.grvy sdfMCSSSphereExclusion.csh
sdfSplicer.csh sdfNormalizer.csh sdfTorsionScanner.csh
sdfSliceByRe.pl sdfAlign.csh / sdf2DAlign.csh sdfConformerSampler.csh
sdfSorter.csh Database Modeling Other sdfTabMerger.csh AEREAExporter.csh sdfRRandomForestCreator sdfGroovy.csh
sdfMolSeparator.csh dataLoader.csh sdfRSVMCreator sdfMultiplexer.pl
sdfGrapheme.csh sdfExport.pl sdfRModelPredictor sdfIMatch.csh
tab2sdf.csh sdfSdfExporter.csh sdfMosaicFromSMDI.pl tabTabMerger.pl tabExport.pl
M. Lee 250th ACS National Meeting 08/17/2015
AEREA UI for Configuring Exports from Database 6
Specify compounds of interest
+ Display data of interest
Components Commercial Compounds
M. Lee 250th ACS National Meeting 08/17/2015
AEREA UI for Configuring Exports from Database 7
Table report with compounds and data of interest
A
K
J
I
H
G
F
E
D
C
B
Hewitt et. al. JCIM, 2005
M. Lee 250th ACS National Meeting 08/17/2015
AEREA Command Line for Automating Data Retrieval 8
AEREAExporter.csh -uName manle \ -treeName smdiRes \
-hitlistLevel "Base Compound" \ -qName “demo_searchQuery" \ -rName “demo_reportTemplate” \ -out .sdf
J AEREA enables users to retrieve data from the database J No need to know about SQL and data model J The data export function can be called from command line
M. Lee 250th ACS National Meeting 08/17/2015
An Unix Pipe Example 9
( AEREAExporter.csh -‐out .sdf -‐hitlistLevel "Base Compound" -‐uName manle \ -‐qName "project cmpds" -‐rName “report" -‐treeName smdiRes \
;AEREAExporter.csh -‐out .sdf -‐hitlistLevel "Base Compound" -‐uName albertgo \ -‐qName "tested in assay" -‐rName “report" -‐treeName smdiRes \
) | sdfTagTool.csh -‐in .sdf -‐out .sdf -‐rmRepeatTag 'G-‐Number=1‘ \ | sdfTagTool.csh -‐in .sdf -‐out .sdf -‐rename rename_list.txt \ | sdfGiniCalculator.csh -‐in .sdf -‐out .sdf -‐idField "G-‐Number" -‐conc 1 \ | sdfStructureTagger.csh -‐in .sdf -‐out .sdf -‐smarts projectCore_SMARTS.tab \
-‐sets projectCores -‐tag_info "firstTag" \
| sdfSelectivityCalculator.csh -‐in .sdf -‐out .sdf \ -‐denominator "Target-‐1 Ki" –numerator "Target-‐2 Ki"\
-‐outputMode separate -‐selectivity "Target-‐1/Target-‐2" \
| sdfGroovy.csh -‐in .sdf -‐out .sdf –f calcLigandEfficiency.grvy \ | sdfTabMerger.csh -‐sdf .sdf –out .sdf \ -‐tab $userDump/PK_data.tab -‐mergeMode multiRecordKeepTemplate \
-‐mergeTag "G-‐Number" -‐mergeCol "G-‐Number“ -‐quiet \
| sdfTagTool.csh -‐in .sdf -‐out.sdf -‐reorder reorder_list.txt \ > projectVortex_SAR.sdf
M. Lee 250th ACS National Meeting 08/17/2015
“Mis”-using KNIME to debug UNIX pipes 10
M. Lee 250th ACS National Meeting 08/17/2015
KNIME’s Dynamic Node Generation Framework … 11
M. Lee 250th ACS National Meeting 08/17/2015
… and the specification of Command Line Tools … 12
XML file with definition of the command line programs • Generate one node per <command> element • Deduce ports from <ports> element • Nodes are initialized during startup of KNIME Desktop
<commands> <config> <exchangeDir local='\\resfiles….' remote='/gnet/…'/> <ssh remoteHost='rescomp2' timeout='1000' initFileTemplate='~cdduser/bin/knimerc.$mode'/> </config> <command name='AEREAExporter.csh'> <IO out="-out .sdf"/> <default>-hitlistLevel 'Base Compound' -uName XXXX -rName 'XVortex Project Subst‘ ….</default> <ports out='sdf'/> </command> <command name='sdfGrep.pl'> <IO in="" out=""/> <default>-i GNum</default> <ports in='sdf' out='sdf'/> </command> …
M. Lee 250th ACS National Meeting 08/17/2015
… for generating Command Line nodes 13
Display the “compiled” Unix command up to
the current node
M. Lee 250th ACS National Meeting 08/17/2015
Three Example Applications 14
Project Vortex Sessions ü Task-oriented command line tools enable comp chemists to take
ownership of project databases ü KNIME/Command Line tool integration to facilitate the maintenance
of the Unix pipes
DMPK Model Validation ü Easy-to-understand predicted probabilities for prioritizing
compounds in a “local” context ü KNIME/Command Line tool integration to enable comp chemists to
build high-performance end-user applications
Fragment Hit Clustering ü Easy-to-visualize AAPSim/SE clustering supports project teams in hit
follow-up processes ü Use command Lines tools to build new Command Line tools
M. Lee 250th ACS National Meeting 08/17/2015
Assessing the risks, increasing the success rate 15
Regression
Predicted probability
Predicted Value
Classifica4on threshold Cumula&ve Distribu&on:
Given mean and SD we can describe a Normal Distribu4on Func4on
From regression to classifica&on Create a regression model, then compute the probability of the predicted value being below/above the set threshold
Aliagas et. al. JCAMD, 2015
When prioritizing compounds, chemists are interested in picking the “winners”
Predicted probability is easy to use and widely applicable
M. Lee 250th ACS National Meeting 08/17/2015
“Adjusting” Global Models For Local Use 16
cPPB_prob_bin
cPPB_prob is the predicted probability of the compound being > 95% bound to plasma protein
Compound prioritization ü Chemistry teams decide which probability cut-off to use ü Cut-off could be different for different series of compounds
M. Lee 250th ACS National Meeting 08/17/2015
PPB Model Validation Application 17
Collaboration with I. Aliagas, A. Gobbi
Users can evaluate the quality of models or
which probability cut-off to use
Project 1 Project 2 Project 3 Project 4 Project 6 Project 7 Project 8
none
Project 1
Cmpd ID_____none Project 1
DMPK_Validation/123445678_345678/
Cmpd ID
M. Lee 250th ACS National Meeting 08/17/2015
PPB Model Validation- KNIME Workflow 18
GUI GUI Compute
M. Lee 250th ACS National Meeting 08/17/2015
Node Template: GenerateValidationData 19
Performance Boost Running the computation in parallel on Linux cluster
Predict
Calculate Properties
M. Lee 250th ACS National Meeting 08/17/2015
PPB Model Validation- KNIME Workflow 20
GUI GUI Compute
M. Lee 250th ACS National Meeting 08/17/2015
Access DMPK Model Validation From Vortex 21
Automatically generate the validation files on daily basis for • download from Vortex menu • sharing with partners if requested
Collaboration with I. Aliagas, F. Broccatelli
M. Lee 250th ACS National Meeting 08/17/2015
Three Example Applications 22
Project Vortex Sessions ü Task-oriented command line tools enable comp chemists to take
ownership of project databases ü KNIME/Command Line tool integration to facilitate the maintenance
of the Unix pipes
DMPK Model Validation ü Easy-to-understand predicted probabilities for prioritizing
compounds in a “local” context ü KNIME/Command Line tool integration to enable comp chemists to
build high-performance end-user applications
Fragment Hit Clustering ü Easy-to-visualize AAPSim/SE clustering supports project teams in hit
follow-up processes ü Use command Lines tools to build new Command Line tools
M. Lee 250th ACS National Meeting 08/17/2015
Cluster fragment hits and visualizing results 23
The results from Atom-Atom-Path Similarity / Sphere Exclusion clustering can be effectively visualized in scatterplots
Gobbi et. al. J of Cheminformatics, 2015
M. Lee 250th ACS National Meeting 08/17/2015 Gobbi et. al. J of Cheminformatics, 2015
Watch out for structure relationships ! 24
M. Lee 250th ACS National Meeting 08/17/2015
Watch out for structure relationships ! 25
Gobbi et. al. J of Cheminformatics, 2015
M. Lee 250th ACS National Meeting 08/17/2015
Watch out for structure relationships ! 26
M. Lee 250th ACS National Meeting 08/17/2015
Atom-Atom-Path Similarity: A MCSS-based similarity 27
16 15 17
19 18 20
21
22
23
24
25
12 11
10 9
14
13
8 5 4
3
2 7
6
1
19
15
16
17 18 20
21
22 23
24
25
26
12 11
10 9
14
13
8 5 4
3
2 7
6
1 Compare the atom paths of
all pairs of atoms
Calculate the molecular similarity from the atomic similarities of the aligned atoms
Align atom pairs according to the atomic similarities
Gobbi et. al. J of Cheminformatics, 2015
M. Lee 250th ACS National Meeting 08/17/2015
Hudson et. al. QSAR, 1996
Lee et. al. JCISCS, 2003
Sphere Exclusion-based Clustering 28
• Sort the compounds by ligand efficiency (or any other properties of interest)
• Select the diverse sub set (of cluster seeds)
• Assign excluded molecules to the most similar cluster seed
M. Lee 250th ACS National Meeting 08/17/2015
Sphere Exclusion-based Clustering – Behind the scene 29
M. Lee 250th ACS National Meeting 08/17/2015
Three Example Applications 30
Project Vortex Sessions ü Task-oriented command line tools enable comp chemists to take
ownership of project databases ü KNIME/Command Line tool integration to facilitate the maintenance
of the Unix pipes
DMPK Model Validation ü Easy-to-understand predicted probabilities for prioritizing
compounds in a “local” context ü KNIME/Command Line tool integration to enable comp chemists to
build high-performance end-user applications
Fragment Hit Clustering ü Properties-biased AAPSim/SE clustering for supporting project teams
in hit follow-up processes ü Use Command Lines tools to build new Command Line tools
M. Lee 250th ACS National Meeting 08/17/2015
Program to create UNIX pipes, a pipe dream? 31
sdfTagTool.csh -copy TITLE=___sdfCalcProps_saved_title___ -addCounter -counterTag ___sdfCalcProps_counter___ \ -title ___sdfCalcProps_counter___ -in .smi -out .sdf \ | tee /tmp/sdfCalcProps.$$.1437358230757.orig.sdf \ | sdfFilter.csh -in .sdf -out .sdf \ | filter -in .sdf -out .sdf -dots false -filter /dev/null -pkanorm false -prefix /tmp/filter.$USER.$$ \ | sdfTagTool.csh -in .sdf -out .sdf -keep ___sdfCalcProps_counter___ > /tmp/sdfCalcProps.$$.1437358230757.filtered.sdf; echo NCCO \ | babel -in .smi -out .sdf >> /tmp/sdfCalcProps.$$.1437358230757.filtered.sdf; cat /tmp/sdfCalcProps.$$.1437358230757.filtered.sdf \ | blabber.py \ | OEProps.csh -in .sdf -out .sdf H_polar Charge \ | sdf2Tab.csh -in .sdf -tags "___sdfCalcProps_counter___|Charge|H_polar" \ | sdfTabMerger.csh -tab - -sdf /tmp/sdfCalcProps.$$.1437358230757.filtered.sdf \ -mergeTag ___sdfCalcProps_counter___ -mergeCol ___sdfCalcProps_counter___ -out .sdf \ | cMR.py \ | MoKa1.py --cLogD_74 --cLogD_all --cLogP \ | sdfALogP.csh -in .sdf -out .sdf -outputCounts \ | sdfNormalizer.csh -in .sdf -out .sdf -shortMessage \ | sdfGroovy.csh -in .sdf -out .sdf \ -c '$>c_pKa_MB_model = $c_pKa_MB.length()==0 ? 0 : $c_pKa_MB;$>c_pKa_MA_model = $c_pKa_MA.length()==0 ? 14 : $c_pKa_MA;‘ \ | sdfTopologicalIndexer.csh -in .sdf -out .sdf Zagreb Wiener JStar JYStar JXStar JY JX \ | sdfCFP.csh -type functional -in .sdf -out .sdf -format counts -nbits 256 -level 2 \ | OEProps.csh -in .sdf -out .sdf Solubility_Index CNS_MPO HeteroAliphaticRings CarboAliphaticRings AliphaticRings HeteroAromaticRings \ CarboAromaticRings AromaticFraction RO5 RotBonds Rings Heavy_Atoms N+O AromaticRings TPSA MW NH+OH \ | sdfSdfExport.csh -in .sdf -out .sdf -queryTags CTISMILES -sqlFile $AESTEL_DIR/config/exporter/models/models.xml -sqlName mLogD \ | sdfGroovy.csh -in .sdf -out .sdf -c '$>LogD74_TypeInModel="C"; $>LogD74_in_Model=tVal($mol,"cLogD7.4");' \ | sdfRModelPredictor.pl -in .sdf -out .sdf \ -modelLocation ModelData/DMPK/PPB/201412/models/PPB_H_SVM_201412 -modelName PPB_H_SVM_201412 \ | sdf2Tab.csh -in .sdf -tags "___sdfCalcProps_counter___|cPPB_H|cPPB_H_err|cPPB_H_fu|cPPB_H_pct|cPPB_H_prob|cPPB_H_version" | sdfTabMerger.csh -sdf /tmp/sdfCalcProps.$$.1437358230757.orig.sdf -tab - \ -mergeTag ___sdfCalcProps_counter___ -mergeCol ___sdfCalcProps_counter___ -out .sdf \ | sdfTagTool.csh -in .sdf -title ___sdfCalcProps_saved_title___ -out .sdf \ | sdfTagTool.csh -in .sdf -remove "___sdfCalcProps_counter___\ | ___sdfCalcProps_saved_title___" -out .sdf
echo CC | sdfCalcProps.csh -in .smi –out .sdf cPPB_H
generates the two UNIX commands
M. Lee 250th ACS National Meeting 08/17/2015
Calculator dependencies: cPPB Example 32
cPPB_H
pChem
counts
ALogPCounts
c_pKa
Solubility_Index
JX …
FFP2_256
cLogD7.4
AromaticRings
Dependencies are encoded in xml files
M. Lee 250th ACS National Meeting 08/17/2015
Calculator dependencies: cPPB Example 33
cPPB_H
pChem
counts
ALogPCounts
c_pKa
Solubility_Index
JX …
FFP2_256
cLogD7.4
AromaticRings
Start with the execution of independent calculators
M. Lee 250th ACS National Meeting 08/17/2015
The growing Calculator Warehouse 34
AliphaticRings AromaticFraction AromaticRings.
CarboAliphaticRings CarboAromaticFraction CarboAromaticRings
Charge cMR
H_polar: Heavy_Atoms
HeteroAliphaticRings HeteroAromaticRings
MW N+O
NH+OH RotBonds
AFP1_128 AFP1_256
AFP2 AFP2_128 AFP2_256 AFP3_128 AFP3_256 FFP1_128 FFP1_256 FFP2_128 FFP2_256 FFP3_128 FFP3_256
MACCSKeys
TopoIndexes EStateCount EStateSum
cBrainPerm_score cCYP_Inh
cHep cLM
cMDCK-MDRI_AB cMDCK-MDRI_efflux
cMDCK_AB cMDCK_efflux
cPPB cTDI
pChem cKinSol
Solubility_Index ALogP
ALogPCounts cLogP c_pKa
cgLogD cLogD7.4 cLogD_all mLogD7.4
CNS_MPO RO5
HTSCysReactivity HTSExclusion HTSNeverSee
cIC50atLE0.3
M. Lee 250th ACS National Meeting 08/17/2015
Acknowledgements 35
Computer Drug Design Group Ignacio Aliagas
Jeff Blaney Fabio Broccatelli
Huifen Chen Kevin Clark JW Feng
Alberto Gobbi Chinchih Lu Ben Sellers
Members of CDD Groups
Roche Pharma IT Simran Hansrai
Elena Kochetkova Slaton Lipscomb
Barry Pon Hubert Pun
KNIME.com Michael Berthold Thomas Gabriel
Bernd Wieswedel
M. Lee 250th ACS National Meeting 08/17/2015
Project Vortex Sessions – Behind the Scene 37
Cron job (internal)
Scripts for Vortex file generation
Files for Vortex filesToCopy
unknownFiles (internal)
File shares
Cron job (file share)
Copy Instruction
Pulls files to the specified file
shares
S. Hansrai, H. Pun, B. Pon
Internal use External use
M. Lee 250th ACS National Meeting 08/17/2015
… equals Genentech Command Line Nodes 38
Collaboration with A. Gobbi, T. Gabriel, B. Wiswedel
J “Command Line” Node Categories:
• Reader
• Processor
• Writer
• Utility
J “Command Line” ports handle • Data in SD file format • Unix command text
M. Lee 250th ACS National Meeting 08/17/2015
Straightforward Command Line Input … 39
Copy and paste the text from the Node Description window
M. Lee 250th ACS National Meeting 08/17/2015
Accessing PPB Model Validation Application 40
KNIME WebPortal
Application Dashboard