Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types...
Transcript of Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types...
Big Data in Chemistry
Institute of Structural Biology (STB), Helmholtz Zentrum München (HMGU)
Dr.IgorV.Tetko
Neuherberg,18October2016
Outline
• Sources• ExampleofBigData• Dataqualityandcomplexity• Annota?onoflargevirtualsets• Deeplearning• Securedatasharing• Outlook
Big Data Sources
DowereallyhaveBigDatainchemistry?Whatkindoflargedatadowehave?
Big Data definition
Bigdataisatermfordatasetsthataresolargeorcomplexthattradi?onaldataprocessingapplica?ons
areinadequate(Wikipedia)
Large Chemical Database
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
Compounds
Exp.Facts
?
Data Types
Database Main data types
ChEMBL v. 211 Data mined from literature and PubChemHTS assays
BindingDB2 Experimental protein-small molecule interaction data
PubChem3 Bioactivity data from HTS assays
Reaxys4 Literature mined property, activity and reaction data
SciFinder (CAS)5 Experimental properties, 13C and 1H NMR spectra, reaction data
GOSTAR6 Target-linked data from patents and articles
AZ IBIS7 AZ in-house SAR data points
OCHEM8 Mainly ADMET data collected from literature
1)PapadatosG,etal.JComputAidedMolDes2015;29(9)885-96.2)GilsonMK,etal.NucleicAcidsRes2016;44(D1):D1045-53.3)KimS,etal.NucleicAcidsRes2016;44(D1):D1202-13.4)h^p://www.elsevier.com/solu?ons/reaxys5)h^p://www.cas.org/products/scifinder6)h^p://www.gostardb.com7)MuresanSetal.DrugDiscovToday2011;16(23-24):1019-30.8)SushkoI,etal..JComputAidedMolDes2011;25(6):533-54.
Big Data sizesBigdataisatermfordatasetsthataresolargeorcomplexthattradi?onaldataprocessingapplica?onsareinadequate(Wikipedia)
CCBY-SA3.0,h^ps://commons.wikimedia.org/w/index.php?curid=29452425
1exabyte:1018
Large Chemical Database
100,0001,000,000
10,000,000100,000,000
1,000,000,00010,000,000,000100,000,000,000
1,000,000,000,00010,000,000,000,000
100,000,000,000,0001,000,000,000,000,000
10,000,000,000,000,000100,000,000,000,000,000
1,000,000,000,000,000,000
Compounds
Exp.Facts
Big Data are relative to a field
• Methodstoanalyzesuchdatadonotexist• Wemaynotsufficienttechnicalresources(speed,memory)to
usetheexis?ngmethods• Wemaynothaveknowledgetousetheexis?ngmethodsThustheBigDatacanappeardueto:Physicalchallenges(hardware)Knowledgechallenges(informa?cs,sogware)
Example of Big Data
Whichdataarereallybigones?
What data sizes are “big” ones?
“Generalmel?ngpointpredic?onbasedonadiversecompounddatasetandar?ficialneuralnetworks”Karthikeyanetal.J.Chem.Inf.Model.2005,45(3),681-90.N=4173
à Largedataset~50kà Bigdataset~250k
Melting Point Datasets
• Bergström277• Bradley2886• OCHEM22404• Enamine21883
data
Bergström
Bradley
OCHEM
Enamine
TetkoetalJ.Chem.Inf.Model.2014,22;54(12):3320-9.
275k Melting Point Datasets
• Bergström277• Bradley2886• OCHEM22404• Enamine21883• PATENTS228079
data
Bergström Bradley OCHEM Enamine Patents
TetkoetalJ.Chemoinforma2cs,2016,8,2.
COMBINED:OCHEM+Enamine+Bradley+Bergström
Extraction of MP information from patents
• [0835] To a solution of 2-amino-4,6-dimethoxybenzamide (0.195 g, 0.99 mmol) and 5-(2-(tert-butyldimethylsilyloxy)ethoxy)-6-phenylpicolinaldehyde (0.355 g, 0.99 mmol) in N,N-dimethyl acetamide (10 ml), was added NaHSO3 (0.264 g, 1.49 mmol) and p-toluenesulfonic acid monohydrate (0.038 g, 0.198 mmol). The reaction mixture was heated at 120° C. for 16 h. After that time the reaction was cooled to rt and the solvent was removed under reduced pressure. The reaction mixture was then diluted with water (150 mL) and neutralized with NaHCO3. The precipitated solids were collected by filtration, washed with water and dried to give 2-(5-(2-(tert-butyldimethylsilyloxy)ethoxy)-6-phenylpyridin-2-yl)-5,7-dimethoxyquinazolin-4(3H)-one (0.500 g, 94%) as an off-white solid: 1H NMR (400 MHz, DMSO-d6) δ 11.08 (s, 1H), 8.35 (d, J=8.98 Hz, 1H), 8.21 (d, J=2.34 Hz, 2H), 7.82 (d, J=8.59 Hz, 1H), 7.44-7.52 (m, 3H), 6.81 (d, J=2.34 Hz, 1H), 6.58 (d, J=2.34 Hz, 1H), 4.24-4.32 (m, 2H), 3.94-4.00 (m, 2H), 3.92 (s, 3H), 3.86 (s, 3H), 0.85 (s, 9H), 0.08 (s, 6H); ESI MS m/z 534 [M+H]+.
• http://www.google.com/patents/US20140140956
Extracting of melting points from patents
Workflow
NextMoveLtd,UK
Extraction of MP information from patents
Modeling of MP data
Package name
Type of descriptors
Number of descriptors
Matrix size, billions
Non zero values, millions
Sparseness
Functional Groups integer 595 0.18 3.1 33
QNPR integer 1502 0.45 6.3 49
MolPrint binary 688634 205 8.1 7200
Estate count float 631 0.19 10 14
Inductive float 54 0.02 11 1
ECFP4 binary 1024 0.31 12 25
Isida integer 5886 1.75 18 37
ChemAxon float 498 0.15 23 1.5
GSFrag integer 1138 0.34 24 5.7
CDK float 239 0.07 27 2
Adriana float 200 0.06 32 1.3
Mera, Mersy float 571 0.17 61 1.1
Dragon float 1647 0.49 183 1.5
Large à Big
• NeuralNetworkswastooslow(ensembletraining!)àSVMwasused
• Supportofparallelcalcula?ons(48core)• Supportofgridanalysis(>1000CPUs)
• Storageoffulldatamatrix->sparsedatamatrix
Prediction errors for Bergström drug like compounds using models developed with different training sets
0
10
20
30
40
50
60
277(Bergström) 4k(Karthikeyan) 50k(Literature) 275k(Patents)
°C
trainingsetsize
Prediction of Huuskonen set using ALOGPS logP and MP based on 50k measurements
logS=0.5–0.01(MP-25)–logKow
Prediction of Huuskonen set using ALOGPS logP and MP based on 230k measurements
logS=0.5–0.01(MP-25)–logKow
Big Data Quality and Complexity
Whyisitveryimportant?Howdomainspecificanalysiscouldhelp?
Suscep?bilityofCPM-basedHTStoscreeningcompound-basedinterference.(A)Assayschema?cfortheCPM-basedHTSusedinthisstudy.TheassaymeasurestheHATac?vityoftheR^109–Vps75complex,whichcatalyzesthetransferofanacetylmoietyfromacetyl-CoAtospecificlysineresiduesontheAsf1–dH3–H4substratecomplextoproduceacetylatedhistoneresiduesandcoenzymeA(CoA).Addi?onofthethiol-scavengingprobeCPMleadstoahighlyfluorescentadductbyreac?ngwiththeCoAbyproduct,whichisusedtoquan?fyHATac?vityviafluorescenceintensitymeasurement.(B)Representa?veassayinterferencechemotypesiden?fiedduringpost-HTStriage.
DahlinetalJ.Med.Chem.2015,58,2091-2113.
99.8%FHs!
Promiscuous compounds filters
Promiscuous compounds filters
Pan Assay INterference compoundS (PAINS) Filters
• colorquenching• singletoxygenquenching• auto-fluorescence• covalentbinding• inherently“s?cking”
compounds• disrupttheinterac?on
betweenthetagoftheproteinandbindingsiteofthedetec?onsystem
BaellandHolloway,J.Med.Chem.,2010,53:2719-40.
~500filtersbasedonN=93212compounds
Structural & Toxic Alerts at http://ochem.eu • Screening of compounds against published toxicity alerts,
groups, frequent hitters • Filter alerts by endpoints or publications • Create or upload custom SMARTS rules
Sushkoetal,J.Chem.Inf.Model,2012,52(8):2310-6.
>500func?onalgroups>2.3kalertsintotal
Identification of AlphaScreen-HIS Frequent Hitters
For Peer Review
210x297mm (300 x 300 DPI)
Page 31 of 43
http://mc.manuscriptcentral.com/jbsc
Journal of Biomolecular Screening
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
TrueHitsTM:StreptavidinDonorbeadBio?nylatedAcceptorbeads
SchorppetalJ.Biomol.Screen.2014,9,715-726.
Mode Of Action of AlphaScreen-HIS Frequent Hitters
SchorppetalJ.Biomol.Screen.2014,9,715-726.
Bio Assays Ontology relationships
Abeyruwan,U.etal“EvolvingBioAssayOntology(BAO):Modulariza?on,Integra?onandApplica?ons,”JournalofBiomedicalSeman?cs,vol.5,no.1:S5,2014.
Annotation of large chemical spaces
BigData,whichhavebeenalwaysinchemistry.
Virtual chemical spaces
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
100,000,000,000
1,000,000,000,000
Synthesizable~1024andtotalspaceis~1060
Virtual chemical spaces
1.00E+031.00E+041.00E+051.00E+061.00E+071.00E+081.00E+091.00E+101.00E+111.00E+121.00E+131.00E+141.00E+151.00E+161.00E+171.00E+18
GDB*N–allpossiblechemicalswith≤Natoms
Virtual chemical spaces
1.00E+03
1.00E+05
1.00E+07
1.00E+09
1.00E+11
1.00E+13
1.00E+15
1.00E+17
1.00E+19
1.00E+21
1.00E+23
Synthesizable~1024
Virtual chemical spaces
1.00E+031.00E+061.00E+091.00E+121.00E+151.00E+181.00E+211.00E+241.00E+271.00E+301.00E+331.00E+361.00E+391.00E+421.00E+451.00E+481.00E+511.00E+541.00E+571.00E+60
Synthesizable~1024andtotal“drug–like”spaceis~1060
Annotation of compounds
• ALOGPS2.1*(predic?onoflogPandwatersolubilityofchemicalcompounds)
• ~100,000moleculesperminute
• Annota?onofGDB-17willtake~3yearsofcalcula?onsusingonecore
• ~10minutesonLeibnizSupercompu?ngCentrewith241,000cores
*Tetko,I.V.J.Chem.Inf.Comput.Sci.2001,41,1407-1421.
We can’t predict unpredictable!
N
O O
newmeasurement
N O
O OHN O
O
newseriestopredict
New machine learning approaches
WhichmethodscanhelpuswithBigData?
CourtesyofProf.J.Bajorath
Multi-task learning
0
0.1
0.2
0.3
0.4
0.5
0.6
Mea
n A
bsol
ute
Erro
rone several
Human data
FatBrainLiverKidneyMuscle
Problem:• predic?onof?ssue-airpar??oncoefficients• smalldatasets30-100molecules(human&ratdata)Results:simultaneouspredic?onofseveralproper?esincreasedtheaccuracyofmodels
Varnek,A.etalJ.Chem.Inf.Model.2009,49,133-44.
Renaissance of neural networks
Deeplearning– Massiveneuralnetworkswiththousandsofneuronsandlayers– Newlearningmethods(dropouttechnique)
Examplesoftheuseofdeeplearningtechnology:– Recogni?onofChinesecharacterswithhumanaccuracy– VictoryinGo-tournament– Diagnos?csofbreastcancer
Baskin,I.I.;Winkler,D.;Tetko,I.V.Arenaissanceofneuralnetworksindrugdiscovery.Expertopinionondrugdiscovery2016,11(8):785-95.
h^p://adsabs.harvard.edu/abs/2015arXiv150202072R
259datasets• 128PubChem• 17MUV• 102DUD-E• 12Tox21
Total~40Mdatapointsfor1.6Mcompounds
Descriptors:ECFP4RDKit
Multitask Networks Learning Results
• Massivelymul?tasknetworksobtainpredic?veaccuraciessignificantlybe^erthansingle-taskmethods.
• Thepredic?vepowerofmul?tasknetworksimprovesasaddi?onaltasksanddataareadded.
• Thetotalamountofdataandthetotalnumberoftasksbothcontributesignificantlytomul?taskimprovement.
• Mul?tasknetworksaffordlimitedtransferabilitytotasksnotinthetrainingset.
h^p://adsabs.harvard.edu/abs/2015arXiv150202072R
Multitask benefit from increasing tasks and data independently.
h^p://adsabs.harvard.edu/abs/2015arXiv150202072R
Secure Information Sharing
Howcanweshareinforma?onbutnotdata?Howcanweenablecoopera?onbetweenindustries?
Secure Sharing of information
• CINF/COMPworkshopwasorganizedduringACSin2005byProf.Oprea• Variousstructurerepresenta?on(descriptors)wereproposed• Severalmethodsforsecuresharingwereintroduced
• Butinthetheore?callimit*– SMILESrepresenta?onofmolecules:CCC,CNCCC,c1ccccc1– Zippingofstructuresrequires<1bitperatom– Structurewith32atomsrequires<32bits– Anydescriptorortheircombina?onwith>32bitscouldbeusedtodecodea
molecule(intheory)
*Tetko,I.V.;Abagyan,R.;Oprea,T.I.J.Comput.Aided.Mol.Des.2005,19,749-764.
Currently used technologies
“Honestbroker”– Receivesdescriptors(orstructures)– Developmodelsanddonotrevealtheunderlyingdata
Sharingrela?onshipsbetweenstructures– MatchedMolecularPairs(changesinpropertyduetochangeofgroups)
Sharingdevelopedmodels– Structuralalerts– Computa?onalpredic?onmodels
Sharingreliablepredic?ons(surrogatedata)*
*Tetko,I.V.;Abagyan,R.;Oprea,T.I.J.Comput.Aided.Mol.Des.2005,19,749-764.
Multi-party secure computation
Secure summation
Conclusions
ExpectaOonsü Improvedpredic?onofproper?es,and
ac?vi?esü Improvedpoly-pharmacologyü Searchofnewchemistry(topdown
explora?onanddenovodesign)ü Predic?onofinvivoenpoints
Challengesü Newmachinelearningapproaches(deep
learning)ü Integra?onofdiversedataandapriory
knowledge(ontology,pathways,invitro,invivo,simula?onresults,differenterrors,etc.)
ü Applicabilitydomainü Securedatasharingü Datavisualiza?onü Denovodesign
Further reading
• Tetko,I.V.;Engkvist,O.;Koch,U.;Reymond,J.L.;Chen,H.,BIGCHEM:ChallengesandOpportuni?esforBigDataAnalysisinChemistry.MolInform2016,35(11-12):615-621(OpenAccess).
• Tetko,I.V.;Engkvist,O.;Chen,H.Does'BigData'existinmedicinalchemistry,andifso,howcanitbeharnessed?FutureMedChem.20168(15):1801-1806(OpenAccess).
Acknowledgements
Dr.Y.SushkoDr.S.NovotarskyiMr.R.KörnerMrs.E.SalminaDr.K.Hadian(TOX,HMGU)Dr.A.Williams(USA)Dr.D.Lowe(UK)
Dr.O.Engkvist(AZ)Dr.H.Chen(AZ)Dr.U.Koch(LDC)Prof.J.-L.Reymond(UniBern)Dr.I.Baskin(MSU)Dr.D.Winkler(CSIRO)AndBIGCHEMpartners