1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory...

11

Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers

for Cheminformatics Research (ECCR): Talk II

July 18 2006Geoffrey Fox

Computer Science, Informatics, PhysicsPervasive Technology Laboratories

Indiana University Bloomington IN [email protected]

http://www.infomall.orghttp://www.chembiogrid.org

mailto:[email protected]

http://www.infomall.org/

http://www.chembiogrid.org/

http://www.chembiogrid.org/

22

Chemical Informatics and Cyberinfrastructure Collaboratory

Collaboration between School of Informatics (Cheminformatics, Bioinformatics, Computer Science), departments of Biology and Chemistry at Indiana University Bloomington and Indianapolis (IUPUI)

Thrusts are Education, use of Cyberinfrastructure for Cheminformatics and Computational Chemistry and Tool research

NSF has an Office of Cyberinfrastructure running (roughly) TeraGrid (100 TF distributed supercomputers) and eScience

eScience describes “modern Science as a team sport” with distributed Computers, Databases, Instruments, Sensors and People (>100 such projects worldwide)

eScience builds applications as Grids using large scale managed Web services

33

Training People for your Centers!Cheminformatics Education at IU

Linked to bioinformatics in an Indiana University’s School of Informatics• http://www.informatics.indiana.edu

School of Informatics degree programs• BS, MS, PhD

Programs offered at both the Indianapolis (IUPUI) and Bloomington (IUB) campuses• Bioinformatics MS and track on PhD• Chemical Informatics MS and track on PhD• Informatics Undergraduates can choose a chemistry cognate

PhD in Informatics started in August 2005 and offers tracks in• bioinformatics; chemical informatics; health informatics;

human-computer interaction design; social and organizational informatics; more to come!

http://www.informatics.indiana.edu/

44

Formal Cheminformatics Courses I571 Chemical Information Technology (3 cr.)

• Distance Ed section had 10 students in Fall 2005, from California to Connecticut

I572 Computational Chemistry and Molecular Modeling (3 cr.) I573 Programming Techniques for Chemical and Life Science

Informatics (3 cr.) I553 Independent Study in Chemical Informatics (3 cr.) Above courses required for the new Graduate Certificate

Program in Chemical Informatics I533 Seminar in Chemical Informatics

• Spring 2006 Topic: Molecular Informatics, the Data Grid, and an Introduction to eScience

• http://www.indiana.edu/~cheminfo/I533/533home.html I647 Seminar in Chemical Informatics

• Fall 2006 Topic: Bridging Bioinformatics and Chemical Informatics• http://www.indiana.edu/~cheminfo/I647/647home.html

http://www.indiana.edu/~cheminfo/I533/533home.html

http://www.indiana.edu/~cheminfo/I647/647home.html

55

Related Courses L519 Bioinformatics: Theory and Application (3 cr.) (at

IUPUI: CSCI 548) L529 Bioinformatics in Molecular Biology and

Genetics: Practical Applications (4 cr.) (not offered at IUPUI)

I619 Structural Bioinformatics (3 cr.) I617 Informatics in Life Sciences and Chemistry (3 cr.)

(for non-majors) B649 Topics in Systems: Service Architectures and

Science (3 cr.) I590 Topics in Informatics: Scientific Applications of

XML (IUPUI)

66

Other Educational Activities Graduate Certificate Program in Chemical Informatics

(4 courses by Distance Education)• Required courses: I571, I572, I573, I553• Enrollees pay in-state graduate fees regardless of location

Special section of I571 will be taught as CIC CourseShare offering with Michigan, Fall 2006• University of Michigan School of Pharmaceutical Engineering

ChE531 Introduction to Chemoinformatics Experiments with teleconferencing as a distance

education tool (Raindance, Macromedia Breeze) Mesa Analytics Cheminformatics Virtual Classroom

• http://www.chemvc.com:8020/• Workshop, July 30, 2006 at the Biennial Conference on

Chemical Education

http://www.chemvc.com:8020/

77

Some Grid Concepts I Services are “just” (distributed) programs sending and

receiving messages with well defined syntax Interfaces (input-output) must be open; innards can be

open source (allowing you to modify) or proprietary• Services can be any language from Fortran, Shell scripts, C,

C#, C++, Java, Python, Perl – your choice!!

• Web Services supported by all vendors (IBM, Microsoft …) Service overhead will be just a few milliseconds (more

now) which is < typical network transit time• Any program that is distributed can be a Web service

• Any program taking execution time ≥ 20ms can be an efficient Web service

88

Some Grid Concepts II Systems are built from contributions from many different

groups – you do not need one “vendor” for all components as Web services allow interoperability between components• One reason DoD likes Grids (called Net-Centric computing)

Grids are distributed in services and data allowing anybody to store their data and to produce “their” view• Some think that University Library of future will curate/store data of

their faculty “2 level programming model”: Classic programming of services

and services are composed using workflow consistent with industry standards (BPEL)

Grid of Grids: (System of Systems) Realistically Grid-like systems will be built using multiple technologies and “standards” – use wrapping to integrate Pipeline Pilot, CBIS, Chembank, ChemBench, ChemModLab, PowerMV etc. with OGSA (Open Grid Service Architecture from OGF) systems into a single Grid

99

Grid Capabilities for Science (Cheminformatics) Open technologies for any large scale distributed system that is

adopted by industry, many sciences and many countries (including UK, EU, USA, Asia)• Security, Reliability, Management and state standards• Many bioinformatics grids including BIRN, caBIG, MyGrid• Also computational chemistry and related (materials) grids

Service and messaging specifications User interfaces via portals and portlets virtualizing to desktops,

email, PDA’s etc.• ~20 TeraGrid Science Gateways including RENCI Bio portal• OGCE Portal technology effort led by Indiana

Uniform approach to access distributed (super)computers supporting single (large) jobs and spawning lots of related jobs

Data and meta-data architecture supporting real-time and archives as well as federation• Links to Semantic web and annotation

Grid (Web service) workflow with standards and several successful instantiations (such as Taverna and MyLead)

1010

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

Taverna Taverna is typical Grid workflow

developed in UK for bioinformatics in MyGrid project

Not maybe better than well known tools like Pipeline Pilot but links to “all Grid services”

Taverna being robustified and extended by UK eScience program

1111

Grid Workflow Datamining in Earth Science Work with Scripps Institute Grid services controlled by workflow process real time

data from ~70 GPS Sensors in Southern California

Streaming DataSupport

TransformationsData Checking

Hidden MarkovDatamining (JPL)

Display (GIS)

NASA GPS

Earthquake

1212

Grid Workflow Data Assimilation in Earth Science Grid services triggered by abnormal events and controlled by workflow process real

time data from radar and high resolution simulations for tornado forecasts

Use a Portlet-based user portal to access and control services and workflow

CICC Prototype Web Services

Molecular weightsMolecular formulaeTanimoto similarity2D Structure diagramsMolecular descriptors3D structuresInChi generation/searchCMLRSS

Basic cheminformatics

Application based services

Compare (NIH)Toxicity predictions (ToxTree)Literature extraction (OSCAR3)Clustering (BCI Toolkit)Docking, filtering, ... (OpenEye)Varuna simulation

Define WSDL interfaces to enable global production of compatible Web services; refine CML Ready to try “Prototype Production” Develop more training material Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies

Next steps?

Key Ideas

Add value to PubChem with additional distributed services and databases Wrapping existing code in web services is not difficult Provide “core” (CDK) services and exemplars of typical tools Provide access to key databases via a web service interface Provide access to major Compute Grids

Web Service LocationsIndiana University

Clustering VOTables OSCAR3 Toxicity classification Database services

Penn State UniversityCDK based services

Fingerprints Similarity calculations 2D structure diagrams Molecular descriptors

Cambridge University InChi generation / search CMLRSS OpenBabel

InfoChem SPRESI

database

SDSCTypical TeraGrid Site

NIHPubChem …..Compare …..

Workflows Using Chemical Literature

OSCAR3program

All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red

SMILES NAME Pubmed IDCCC propane 1425356CC ethane 3546453..... ............. .............

Bulk download ofPubmed abstracts

Extract chemical structures

OSCAR3Service

Find similarmolecules

Searchable(structure/similarity)Grid database

Local DTP database

PubChem

PDBBind

Find similardocuments

Clustering of documents linked to clustering of chemicals

Large Scale Calculations on “All of PubChem/Med”

TeraGrid: 100 Teraflop now to 1000 Teraflop next year IU 2048 node Big Red supercomputer: 20 Teraflop today The CDK can currently calculate approx. 107 Descriptors

Whole of PubChem (6M compounds) – 276 hours, 1 CPU On IU's Big Red, 2048 CPU's, 20 TF: < 7 minutes Even increasing the descriptor count by 5 times gives us < 35 minutes

of compute time on Big Red

OSCAR3 takes a few seconds per abstract to text-mine all compounds in it All of PubMed would take < a day on Big Red Cleanup and Iteration would take some time

Can pre-calculate properties of smaller compounds using CDK (logP, BCUT, CPSA, …) and programs likes GAMESS 100,000 compounds take < a week each on a single CPU and would be

a practical computation over next year

17

TeraGridSupercomputers

“Flocks”

Prototype CICC Project: Controlling the TGF pathwayCollaboration between Baik & Zhang at IU

PDB

1IAS1IASInactive TGF

VARUNA

Experimentsin the Zhang

Lab

Active TGFActive TGFWith inhibitorWith inhibitor

PubChem

in-house Molecules in VarunaQM Database

Conceptual Conceptual Understanding of Understanding of TGFTGF

InhibitionInhibition

SimulationsAutoGeFFAutoGeFF

Questions:

- What molecular feature controls inhibitor binding?

- How do mutations impact binding?

Web Service togenerate customforce fields

Can afford few ms overhead!

18

Simulating the Structure and Reactivity of Cu-A Complex One of the speculation about the pathogenesis of Alzheimer’s Disease

involves complexes of Cu-ions and -Amyloid plaques.

We will test the hypothesis that a Cu-A complex can catalytically activate dioxygen to give hydrogen peroxide in a molecular modeling study:

Unfortunately, the structure of the Cu-A complex is currently not known.(a) We will carry out a combined QM/QM-MM/MD study to propose a structure(b) We will evaluate the plausibility of the catalysis by constructing a reaction profile(c) We will use PubChem in combination with our in-house Quantum Chemical Database to identify small molecules and molecule classes that may inhibit the catalysis.

This study serves as a Prototype application where an unknown protein fragment structure is computed a priori and saved in an in-

house database. the diversity of small molecule targets are derived from clustering PubChem and

other federated databases, including an in-house structural database

19

Other Chemical Projects Utilizing the Cyberinfrastructure

Mechanistic studies on how the anticancer drug cisplatin interacts with DNA to kill cancer cells. how xanthine oxidase catalyzes the oxidation of xanthine. how the bacterial enzyme methane monooxygenase catalyzes methane

oxidation. electrocyclization of small organic molecules that are relevant for natural

product synthesis. stereoselective carbocyclizations that are catalyzed by Rhodium

complexes. how molecular probes for in vivo detection of Zinc, Mercury and Lead

can be designed in a rational fashion. Utilizing a molecular modeling database

By registering and saving the structures, charge distributions and molecular orbitals of computer simulations, we can conduct a new kind of similarity searches and recognize trends.

The molecular modeling database will allow for curating the structural information of other databases, such as PubChem, by providing more detailed simulated information.

20

MLSCN Post-HTS Biology Decision SupportPercent Inhibition or IC50 data is retrieved from HTS

Question: Was this screen successful?

Question: What should the active/inactive cutoffs be?

Question: What can we learn about the target protein or cell line from this screen?

Compounds submitted to PubChem

Workflows encoding distribution analysis of screening results

Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis

A Grid of Grids linking collections of services atPubChemECCR centersMLSCN centers

Workflows encoding plate & control well statistics, distribution analysis, etc

Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etcCHEMINFORMATICSPROCESS GRIDS

21

MLSCN Data - How services and workflows are used

MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback

Data is stored in Pubchem

Workflows perform different kinds of analysis on the MLSCN data, including SAR, clustering, literature searching, protein searching, toxicity testing, etc…End-user

applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis

PubChem interfaces to workflows via SOAP

22

Example HTS workflow: finding cell-protein relationshipsA protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)

SImilar structures to the ligand can be

browsed using client portlets.

Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.

Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.

Docking results and activity patterns fed into R services for building of activity models and correlations

LeastSquaresRegression

RandomForests

NeuralNets

23

Next steps in workflows Expansion of HTS Workflows

Inclusion of ToxTree for toxicity flagging Prediction of protein binding through PDB ligand similarity search Inclusion of literature text mining (OSCAR) Using PubChem data instead of tumor cell dataset

More workflows Incorporating VARUNA, PubChem, PDBBind and other services Workflows from Cambridge collaboration

Making workflows available in other systems Taverna SCUFL <-> BPEL conversion Use of workflows in other execution environments (starting with

myLEAD supporting triggering)

24

Methods Development at the CICC

Tagging methods for web-based annotation exploiting del.icio.us and Connotea

Development of QSAR model interpretability and applicability methods

RNN-Profiles for exploration of chemical spaces VisualiSAR - SAR through visual analysis

See http://www.daylight.com/meetings/mug99/Wild/Mug99.html Visual Similarity Matrices for High Volume Datasets

See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php Fast, accurate clustering using parallel Divisive K-means Mapping of Natural Language queries to use cases and workflows Advanced data mining models for drug discovery information

http://www.daylight.com/meetings/mug99/Wild/Mug99.html

http://www.osl.iu.edu/~chemuell/new/bioinformatics.php

25

250

300

350

400

450

500

550

600

650

700

0 10 20 30 40 50 60 70 80 90

Number of processors

Ru

nti

me

(sec

on

ds)

Minsize 1 Minsize 100 Minsize 1000

MPI Parallel Divkmeans clustering of PubChem

AVIDD Linux cluster, 5,273,852 structures (Pubchem compound collection, Nov 2005)

min_size ncpus wall_mins walltime1 20 676 11:16:061 40 444 7:24:241 60 379 6:18:411 80 353 5:53:00

100 20 462 7:41:58100 40 356 5:56:01100 40 356 5:55:47100 60 339 5:38:44100 80 337 5:36:53

1000 20 513 8:32:391000 40 376 6:16:251000 60 346 5:46:221000 80 346 5:45:40

Exploring Chemical Spaces

The problem Thousands of compounds 10's to 100's of descriptors

Requirements In my chemistry space what are the outliers? Which compounds are in

the dense regions of space? I don't want to / can't do descriptor selection I don't want to squash things into a lower dimensional space I want a simple way to view all this

Our approach (Guha, R. et al;, J. Chem. Inf. Model., 2006) : Use the R-NN profile technique

R-NN Profiles & Exploring Chemical Space

4337 molecules

<MW> = 240, 5 descriptors

2 known outliers

Molecules at the top are in sparse regions

Molecules at the bottom are in dense regions

Drill down into specific regions (GGgobi, VOPlot ...), annotate with activity, ...

Simple & intuitive, can be very fast

R-NN Profiles & HPC

R-NN profiles require a pairwise distance matrix Can be sped up with approximate NN methods R-NN profiles can be trivially parallelized

1000 x 100 data matrix => 1000 x 1000 distance matrix -> 2.1 sec (P4 1GhZ laptop)

Evaluating R-NN profiles for 1000 compounds -> 43 sec The current parameters allow a 100x speedup if we use 100 CPU's

Measuring Model Applicability

We have many ways to build multiple models We perform validation But can we use a stored model for a new molecule(s)?

Trivially, yes But does it really make sense to do this?

Depends on similarities to the training set Also depends on a global chemistry space

We can provide a component that attempts to answer this question for arbitrary model types

Guha, R. et al., J. Chem. Inf. Model., 2005, 45, 65-73

Measuring Model Applicability

Our initial approach Considers regression

models Considers similarity to

the TSET We and P&G are working

on more robust methods that try and take into account a global chemistry space

Alternate methods can be easily included in workflows

StoredOLS/CNN/SVM/...

Model

Auxillary classification model

Choose cutoff

Training set residuals

New (unseen)molecule

Predictproperty

Obtain applicability

Decide whether it makes sense to go with this prediction

31

More detailed Slides not used

32

LoadWorkflow

RunWorkflow

CurrentProcess

Result Output

ResultOutputURL

33

Preliminary Results

Shown is a fully equilibrated structure (highest population in a 10 ns MD @ room temp) of theCu-A structure.

The Cu-(peptide) bonds require special attention,as standard force-fields do not allow simulationsof this type.

We use a new tool AutoGeFF (to be implented as a Webservice) that recognizes bonds for which no force field parameters exist. AutoGeFF can generate appropriate force fields by automatically carrying out a QM calculation on a small model system and fitting new forces to computed vibrational frequencies. We use a guided Monte Carlo approach to iteratively derive these forces. Currently, AutoGeFF recognizes most of the transition metals. Future work will include organic moieties (such as drug candidates).

34

Example HTS workflow: organization & flagging

A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database)

The compounds are clustered on chemical structure similarity, to group similar compounds together

The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT

OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs

35

Example of workflow output - LogP vs GI50

Plotting XLogP against GI50 can help identify highly active compounds with good logP profiles (1 - 4 range)

36

Example of workflow output - Cluster # vs GI50

Plotting Cluster against GI50 can help identify groups of highly active, structurally similar compounds, and also clusters which might yield good QSAR information

37

Example workflow output - docked complexes

NSC_ID 685478Docking score -29.74

NSC_ID 685477 Docking score -35.51



Example output of most similar compounds to PDB 1Y4 complex ligands docked into the target protein using OpenEye FRED

1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory...

Documents

Transcript of 1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory...