Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types...

Big Data in Chemistry

Institute of Structural Biology (STB), Helmholtz Zentrum München (HMGU)

Dr.IgorV.Tetko

Neuherberg,18October2016

Outline

•  Sources•  ExampleofBigData•  Dataqualityandcomplexity•  Annota?onoflargevirtualsets•  Deeplearning•  Securedatasharing•  Outlook

Big Data Sources

DowereallyhaveBigDatainchemistry?Whatkindoflargedatadowehave?

Big Data definition

Bigdataisatermfordatasetsthataresolargeorcomplexthattradi?onaldataprocessingapplica?ons

areinadequate(Wikipedia)

Large Chemical Database

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

Compounds

Exp.Facts

?

Data Types

Database Main data types

ChEMBL v. 211 Data mined from literature and PubChemHTS assays

BindingDB2 Experimental protein-small molecule interaction data

PubChem3 Bioactivity data from HTS assays

Reaxys4 Literature mined property, activity and reaction data

SciFinder (CAS)5 Experimental properties, 13C and 1H NMR spectra, reaction data

GOSTAR6 Target-linked data from patents and articles

AZ IBIS7 AZ in-house SAR data points

OCHEM8 Mainly ADMET data collected from literature

1)PapadatosG,etal.JComputAidedMolDes2015;29(9)885-96.2)GilsonMK,etal.NucleicAcidsRes2016;44(D1):D1045-53.3)KimS,etal.NucleicAcidsRes2016;44(D1):D1202-13.4)h^p://www.elsevier.com/solu?ons/reaxys5)h^p://www.cas.org/products/scifinder6)h^p://www.gostardb.com7)MuresanSetal.DrugDiscovToday2011;16(23-24):1019-30.8)SushkoI,etal..JComputAidedMolDes2011;25(6):533-54.

Big Data sizesBigdataisatermfordatasetsthataresolargeorcomplexthattradi?onaldataprocessingapplica?onsareinadequate(Wikipedia)

CCBY-SA3.0,h^ps://commons.wikimedia.org/w/index.php?curid=29452425

1exabyte:1018

Large Chemical Database

100,0001,000,000

10,000,000100,000,000

1,000,000,00010,000,000,000100,000,000,000

1,000,000,000,00010,000,000,000,000

100,000,000,000,0001,000,000,000,000,000

10,000,000,000,000,000100,000,000,000,000,000

1,000,000,000,000,000,000

Compounds

Exp.Facts

Big Data are relative to a field

•  Methodstoanalyzesuchdatadonotexist•  Wemaynotsufficienttechnicalresources(speed,memory)to

usetheexis?ngmethods•  Wemaynothaveknowledgetousetheexis?ngmethodsThustheBigDatacanappeardueto:Physicalchallenges(hardware)Knowledgechallenges(informa?cs,sogware)

Example of Big Data

Whichdataarereallybigones?

What data sizes are “big” ones?

“Generalmel?ngpointpredic?onbasedonadiversecompounddatasetandar?ficialneuralnetworks”Karthikeyanetal.J.Chem.Inf.Model.2005,45(3),681-90.N=4173

à  Largedataset~50kà  Bigdataset~250k

Melting Point Datasets

•  Bergström277•  Bradley2886•  OCHEM22404•  Enamine21883

data

Bergström

Bradley

OCHEM

Enamine

TetkoetalJ.Chem.Inf.Model.2014,22;54(12):3320-9.

275k Melting Point Datasets

•  Bergström277•  Bradley2886•  OCHEM22404•  Enamine21883•  PATENTS228079

data

Bergström Bradley OCHEM Enamine Patents

TetkoetalJ.Chemoinforma2cs,2016,8,2.

COMBINED:OCHEM+Enamine+Bradley+Bergström

Extraction of MP information from patents

•  [0835] To a solution of 2-amino-4,6-dimethoxybenzamide (0.195 g, 0.99 mmol) and 5-(2-(tert-butyldimethylsilyloxy)ethoxy)-6-phenylpicolinaldehyde (0.355 g, 0.99 mmol) in N,N-dimethyl acetamide (10 ml), was added NaHSO3 (0.264 g, 1.49 mmol) and p-toluenesulfonic acid monohydrate (0.038 g, 0.198 mmol). The reaction mixture was heated at 120° C. for 16 h. After that time the reaction was cooled to rt and the solvent was removed under reduced pressure. The reaction mixture was then diluted with water (150 mL) and neutralized with NaHCO3. The precipitated solids were collected by filtration, washed with water and dried to give 2-(5-(2-(tert-butyldimethylsilyloxy)ethoxy)-6-phenylpyridin-2-yl)-5,7-dimethoxyquinazolin-4(3H)-one (0.500 g, 94%) as an off-white solid: 1H NMR (400 MHz, DMSO-d6) δ 11.08 (s, 1H), 8.35 (d, J=8.98 Hz, 1H), 8.21 (d, J=2.34 Hz, 2H), 7.82 (d, J=8.59 Hz, 1H), 7.44-7.52 (m, 3H), 6.81 (d, J=2.34 Hz, 1H), 6.58 (d, J=2.34 Hz, 1H), 4.24-4.32 (m, 2H), 3.94-4.00 (m, 2H), 3.92 (s, 3H), 3.86 (s, 3H), 0.85 (s, 9H), 0.08 (s, 6H); ESI MS m/z 534 [M+H]+.

•  http://www.google.com/patents/US20140140956

Extracting of melting points from patents

Workflow

NextMoveLtd,UK

Extraction of MP information from patents

Modeling of MP data

Package name

Type of descriptors

Number of descriptors

Matrix size, billions

Non zero values, millions

Sparseness

Functional Groups integer 595 0.18 3.1 33

QNPR integer 1502 0.45 6.3 49

MolPrint binary 688634 205 8.1 7200

Estate count float 631 0.19 10 14

Inductive float 54 0.02 11 1

ECFP4 binary 1024 0.31 12 25

Isida integer 5886 1.75 18 37

ChemAxon float 498 0.15 23 1.5

GSFrag integer 1138 0.34 24 5.7

CDK float 239 0.07 27 2

Adriana float 200 0.06 32 1.3

Mera, Mersy float 571 0.17 61 1.1

Dragon float 1647 0.49 183 1.5

Large à Big

•  NeuralNetworkswastooslow(ensembletraining!)àSVMwasused

•  Supportofparallelcalcula?ons(48core)•  Supportofgridanalysis(>1000CPUs)

•  Storageoffulldatamatrix->sparsedatamatrix

Prediction errors for Bergström drug like compounds using models developed with different training sets

0

10

20

30

40

50

60

277(Bergström) 4k(Karthikeyan) 50k(Literature) 275k(Patents)

°C

trainingsetsize

Prediction of Huuskonen set using ALOGPS logP and MP based on 50k measurements

logS=0.5–0.01(MP-25)–logKow

Prediction of Huuskonen set using ALOGPS logP and MP based on 230k measurements

logS=0.5–0.01(MP-25)–logKow

Big Data Quality and Complexity

Whyisitveryimportant?Howdomainspecificanalysiscouldhelp?

Suscep?bilityofCPM-basedHTStoscreeningcompound-basedinterference.(A)Assayschema?cfortheCPM-basedHTSusedinthisstudy.TheassaymeasurestheHATac?vityoftheR^109–Vps75complex,whichcatalyzesthetransferofanacetylmoietyfromacetyl-CoAtospecificlysineresiduesontheAsf1–dH3–H4substratecomplextoproduceacetylatedhistoneresiduesandcoenzymeA(CoA).Addi?onofthethiol-scavengingprobeCPMleadstoahighlyfluorescentadductbyreac?ngwiththeCoAbyproduct,whichisusedtoquan?fyHATac?vityviafluorescenceintensitymeasurement.(B)Representa?veassayinterferencechemotypesiden?fiedduringpost-HTStriage.

DahlinetalJ.Med.Chem.2015,58,2091-2113.

99.8%FHs!

Promiscuous compounds filters

Pan Assay INterference compoundS (PAINS) Filters

•  colorquenching•  singletoxygenquenching•  auto-fluorescence•  covalentbinding•  inherently“s?cking”

compounds•  disrupttheinterac?on

betweenthetagoftheproteinandbindingsiteofthedetec?onsystem

BaellandHolloway,J.Med.Chem.,2010,53:2719-40.

~500filtersbasedonN=93212compounds

Structural & Toxic Alerts at http://ochem.eu •  Screening of compounds against published toxicity alerts,

groups, frequent hitters •  Filter alerts by endpoints or publications •  Create or upload custom SMARTS rules

Sushkoetal,J.Chem.Inf.Model,2012,52(8):2310-6.

>500func?onalgroups>2.3kalertsintotal

Identification of AlphaScreen-HIS Frequent Hitters

For Peer Review

210x297mm (300 x 300 DPI)

Page 31 of 43

http://mc.manuscriptcentral.com/jbsc

Journal of Biomolecular Screening

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

TrueHitsTM:StreptavidinDonorbeadBio?nylatedAcceptorbeads

SchorppetalJ.Biomol.Screen.2014,9,715-726.

Mode Of Action of AlphaScreen-HIS Frequent Hitters

SchorppetalJ.Biomol.Screen.2014,9,715-726.

Bio Assays Ontology relationships

Abeyruwan,U.etal“EvolvingBioAssayOntology(BAO):Modulariza?on,Integra?onandApplica?ons,”JournalofBiomedicalSeman?cs,vol.5,no.1:S5,2014.

Annotation of large chemical spaces

BigData,whichhavebeenalwaysinchemistry.

Virtual chemical spaces

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

100,000,000,000

1,000,000,000,000

Synthesizable~1024andtotalspaceis~1060


1.00E+031.00E+041.00E+051.00E+061.00E+071.00E+081.00E+091.00E+101.00E+111.00E+121.00E+131.00E+141.00E+151.00E+161.00E+171.00E+18

GDB*N–allpossiblechemicalswith≤Natoms


1.00E+03

1.00E+05

1.00E+07

1.00E+09

1.00E+11

1.00E+13

1.00E+15

1.00E+17

1.00E+19

1.00E+21

1.00E+23

Synthesizable~1024


1.00E+031.00E+061.00E+091.00E+121.00E+151.00E+181.00E+211.00E+241.00E+271.00E+301.00E+331.00E+361.00E+391.00E+421.00E+451.00E+481.00E+511.00E+541.00E+571.00E+60

Synthesizable~1024andtotal“drug–like”spaceis~1060

Annotation of compounds

•  ALOGPS2.1*(predic?onoflogPandwatersolubilityofchemicalcompounds)

•  ~100,000moleculesperminute

•  Annota?onofGDB-17willtake~3yearsofcalcula?onsusingonecore

•  ~10minutesonLeibnizSupercompu?ngCentrewith241,000cores

*Tetko,I.V.J.Chem.Inf.Comput.Sci.2001,41,1407-1421.

We can’t predict unpredictable!

N

O O

newmeasurement

N O

O OHN O

O

newseriestopredict

New machine learning approaches

WhichmethodscanhelpuswithBigData?

CourtesyofProf.J.Bajorath

Multi-task learning

0

0.1

0.2

0.3

0.4

0.5

0.6

Mea

n A

bsol

ute

Erro

rone several

Human data

FatBrainLiverKidneyMuscle

Problem:• predic?onof?ssue-airpar??oncoefficients• smalldatasets30-100molecules(human&ratdata)Results:simultaneouspredic?onofseveralproper?esincreasedtheaccuracyofmodels

Varnek,A.etalJ.Chem.Inf.Model.2009,49,133-44.

Renaissance of neural networks

Deeplearning–  Massiveneuralnetworkswiththousandsofneuronsandlayers–  Newlearningmethods(dropouttechnique)

Examplesoftheuseofdeeplearningtechnology:–  Recogni?onofChinesecharacterswithhumanaccuracy–  VictoryinGo-tournament–  Diagnos?csofbreastcancer

Baskin,I.I.;Winkler,D.;Tetko,I.V.Arenaissanceofneuralnetworksindrugdiscovery.Expertopinionondrugdiscovery2016,11(8):785-95.

h^p://adsabs.harvard.edu/abs/2015arXiv150202072R

259datasets•  128PubChem•  17MUV•  102DUD-E•  12Tox21

Total~40Mdatapointsfor1.6Mcompounds

Descriptors:ECFP4RDKit

Multitask Networks Learning Results

•  Massivelymul?tasknetworksobtainpredic?veaccuraciessignificantlybe^erthansingle-taskmethods.

•  Thepredic?vepowerofmul?tasknetworksimprovesasaddi?onaltasksanddataareadded.

•  Thetotalamountofdataandthetotalnumberoftasksbothcontributesignificantlytomul?taskimprovement.

•  Mul?tasknetworksaffordlimitedtransferabilitytotasksnotinthetrainingset.


Multitask benefit from increasing tasks and data independently.


Secure Information Sharing

Howcanweshareinforma?onbutnotdata?Howcanweenablecoopera?onbetweenindustries?

Secure Sharing of information

•  CINF/COMPworkshopwasorganizedduringACSin2005byProf.Oprea•  Variousstructurerepresenta?on(descriptors)wereproposed•  Severalmethodsforsecuresharingwereintroduced

•  Butinthetheore?callimit*–  SMILESrepresenta?onofmolecules:CCC,CNCCC,c1ccccc1–  Zippingofstructuresrequires<1bitperatom–  Structurewith32atomsrequires<32bits–  Anydescriptorortheircombina?onwith>32bitscouldbeusedtodecodea

molecule(intheory)

*Tetko,I.V.;Abagyan,R.;Oprea,T.I.J.Comput.Aided.Mol.Des.2005,19,749-764.

Currently used technologies

“Honestbroker”–  Receivesdescriptors(orstructures)–  Developmodelsanddonotrevealtheunderlyingdata

Sharingrela?onshipsbetweenstructures–  MatchedMolecularPairs(changesinpropertyduetochangeofgroups)

Sharingdevelopedmodels–  Structuralalerts–  Computa?onalpredic?onmodels

Sharingreliablepredic?ons(surrogatedata)*

*Tetko,I.V.;Abagyan,R.;Oprea,T.I.J.Comput.Aided.Mol.Des.2005,19,749-764.

Multi-party secure computation

Secure summation

Conclusions

ExpectaOonsü  Improvedpredic?onofproper?es,and

ac?vi?esü  Improvedpoly-pharmacologyü  Searchofnewchemistry(topdown

explora?onanddenovodesign)ü  Predic?onofinvivoenpoints

Challengesü  Newmachinelearningapproaches(deep

learning)ü  Integra?onofdiversedataandapriory

knowledge(ontology,pathways,invitro,invivo,simula?onresults,differenterrors,etc.)

ü  Applicabilitydomainü  Securedatasharingü  Datavisualiza?onü  Denovodesign

Further reading

•  Tetko,I.V.;Engkvist,O.;Koch,U.;Reymond,J.L.;Chen,H.,BIGCHEM:ChallengesandOpportuni?esforBigDataAnalysisinChemistry.MolInform2016,35(11-12):615-621(OpenAccess).

•  Tetko,I.V.;Engkvist,O.;Chen,H.Does'BigData'existinmedicinalchemistry,andifso,howcanitbeharnessed?FutureMedChem.20168(15):1801-1806(OpenAccess).

Acknowledgements

Dr.Y.SushkoDr.S.NovotarskyiMr.R.KörnerMrs.E.SalminaDr.K.Hadian(TOX,HMGU)Dr.A.Williams(USA)Dr.D.Lowe(UK)

Dr.O.Engkvist(AZ)Dr.H.Chen(AZ)Dr.U.Koch(LDC)Prof.J.-L.Reymond(UniBern)Dr.I.Baskin(MSU)Dr.D.Winkler(CSIRO)AndBIGCHEMpartners

Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types...

Documents

Transcript of Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types...