1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory...
-
Upload
aron-bates -
Category
Documents
-
view
213 -
download
0
Transcript of 1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory...
11
Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers
for Cheminformatics Research (ECCR): Talk II
July 18 2006Geoffrey Fox
Computer Science, Informatics, PhysicsPervasive Technology Laboratories
Indiana University Bloomington IN [email protected]
http://www.infomall.orghttp://www.chembiogrid.org
22
Chemical Informatics and Cyberinfrastructure Collaboratory
Collaboration between School of Informatics (Cheminformatics, Bioinformatics, Computer Science), departments of Biology and Chemistry at Indiana University Bloomington and Indianapolis (IUPUI)
Thrusts are Education, use of Cyberinfrastructure for Cheminformatics and Computational Chemistry and Tool research
NSF has an Office of Cyberinfrastructure running (roughly) TeraGrid (100 TF distributed supercomputers) and eScience
eScience describes “modern Science as a team sport” with distributed Computers, Databases, Instruments, Sensors and People (>100 such projects worldwide)
eScience builds applications as Grids using large scale managed Web services
33
Training People for your Centers!Cheminformatics Education at IU
Linked to bioinformatics in an Indiana University’s School of Informatics• http://www.informatics.indiana.edu
School of Informatics degree programs• BS, MS, PhD
Programs offered at both the Indianapolis (IUPUI) and Bloomington (IUB) campuses• Bioinformatics MS and track on PhD• Chemical Informatics MS and track on PhD• Informatics Undergraduates can choose a chemistry cognate
PhD in Informatics started in August 2005 and offers tracks in• bioinformatics; chemical informatics; health informatics;
human-computer interaction design; social and organizational informatics; more to come!
44
Formal Cheminformatics Courses I571 Chemical Information Technology (3 cr.)
• Distance Ed section had 10 students in Fall 2005, from California to Connecticut
I572 Computational Chemistry and Molecular Modeling (3 cr.) I573 Programming Techniques for Chemical and Life Science
Informatics (3 cr.) I553 Independent Study in Chemical Informatics (3 cr.) Above courses required for the new Graduate Certificate
Program in Chemical Informatics I533 Seminar in Chemical Informatics
• Spring 2006 Topic: Molecular Informatics, the Data Grid, and an Introduction to eScience
• http://www.indiana.edu/~cheminfo/I533/533home.html I647 Seminar in Chemical Informatics
• Fall 2006 Topic: Bridging Bioinformatics and Chemical Informatics• http://www.indiana.edu/~cheminfo/I647/647home.html
55
Related Courses L519 Bioinformatics: Theory and Application (3 cr.) (at
IUPUI: CSCI 548) L529 Bioinformatics in Molecular Biology and
Genetics: Practical Applications (4 cr.) (not offered at IUPUI)
I619 Structural Bioinformatics (3 cr.) I617 Informatics in Life Sciences and Chemistry (3 cr.)
(for non-majors) B649 Topics in Systems: Service Architectures and
Science (3 cr.) I590 Topics in Informatics: Scientific Applications of
XML (IUPUI)
66
Other Educational Activities Graduate Certificate Program in Chemical Informatics
(4 courses by Distance Education)• Required courses: I571, I572, I573, I553• Enrollees pay in-state graduate fees regardless of location
Special section of I571 will be taught as CIC CourseShare offering with Michigan, Fall 2006• University of Michigan School of Pharmaceutical Engineering
ChE531 Introduction to Chemoinformatics Experiments with teleconferencing as a distance
education tool (Raindance, Macromedia Breeze) Mesa Analytics Cheminformatics Virtual Classroom
• http://www.chemvc.com:8020/• Workshop, July 30, 2006 at the Biennial Conference on
Chemical Education
77
Some Grid Concepts I Services are “just” (distributed) programs sending and
receiving messages with well defined syntax Interfaces (input-output) must be open; innards can be
open source (allowing you to modify) or proprietary• Services can be any language from Fortran, Shell scripts, C,
C#, C++, Java, Python, Perl – your choice!!
• Web Services supported by all vendors (IBM, Microsoft …) Service overhead will be just a few milliseconds (more
now) which is < typical network transit time• Any program that is distributed can be a Web service
• Any program taking execution time ≥ 20ms can be an efficient Web service
88
Some Grid Concepts II Systems are built from contributions from many different
groups – you do not need one “vendor” for all components as Web services allow interoperability between components• One reason DoD likes Grids (called Net-Centric computing)
Grids are distributed in services and data allowing anybody to store their data and to produce “their” view• Some think that University Library of future will curate/store data of
their faculty “2 level programming model”: Classic programming of services
and services are composed using workflow consistent with industry standards (BPEL)
Grid of Grids: (System of Systems) Realistically Grid-like systems will be built using multiple technologies and “standards” – use wrapping to integrate Pipeline Pilot, CBIS, Chembank, ChemBench, ChemModLab, PowerMV etc. with OGSA (Open Grid Service Architecture from OGF) systems into a single Grid
99
Grid Capabilities for Science (Cheminformatics) Open technologies for any large scale distributed system that is
adopted by industry, many sciences and many countries (including UK, EU, USA, Asia)• Security, Reliability, Management and state standards• Many bioinformatics grids including BIRN, caBIG, MyGrid• Also computational chemistry and related (materials) grids
Service and messaging specifications User interfaces via portals and portlets virtualizing to desktops,
email, PDA’s etc.• ~20 TeraGrid Science Gateways including RENCI Bio portal• OGCE Portal technology effort led by Indiana
Uniform approach to access distributed (super)computers supporting single (large) jobs and spawning lots of related jobs
Data and meta-data architecture supporting real-time and archives as well as federation• Links to Semantic web and annotation
Grid (Web service) workflow with standards and several successful instantiations (such as Taverna and MyLead)
1010
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg
Taverna Taverna is typical Grid workflow
developed in UK for bioinformatics in MyGrid project
Not maybe better than well known tools like Pipeline Pilot but links to “all Grid services”
Taverna being robustified and extended by UK eScience program
1111
Grid Workflow Datamining in Earth Science Work with Scripps Institute Grid services controlled by workflow process real time
data from ~70 GPS Sensors in Southern California
Streaming DataSupport
TransformationsData Checking
Hidden MarkovDatamining (JPL)
Display (GIS)
NASA GPS
Earthquake
1212
Grid Workflow Data Assimilation in Earth Science Grid services triggered by abnormal events and controlled by workflow process real
time data from radar and high resolution simulations for tornado forecasts
Use a Portlet-based user portal to access and control services and workflow
CICC Prototype Web Services
Molecular weightsMolecular formulaeTanimoto similarity2D Structure diagramsMolecular descriptors3D structuresInChi generation/searchCMLRSS
Basic cheminformatics
Application based services
Compare (NIH)Toxicity predictions (ToxTree)Literature extraction (OSCAR3)Clustering (BCI Toolkit)Docking, filtering, ... (OpenEye)Varuna simulation
Define WSDL interfaces to enable global production of compatible Web services; refine CML Ready to try “Prototype Production” Develop more training material Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies
Next steps?
Key Ideas
Add value to PubChem with additional distributed services and databases Wrapping existing code in web services is not difficult Provide “core” (CDK) services and exemplars of typical tools Provide access to key databases via a web service interface Provide access to major Compute Grids
Web Service LocationsIndiana University
Clustering VOTables OSCAR3 Toxicity classification Database services
Penn State UniversityCDK based services
Fingerprints Similarity calculations 2D structure diagrams Molecular descriptors
Cambridge University InChi generation / search CMLRSS OpenBabel
InfoChem SPRESI
database
SDSCTypical TeraGrid Site
NIHPubChem …..Compare …..
Workflows Using Chemical Literature
OSCAR3program
All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red
SMILES NAME Pubmed IDCCC propane 1425356CC ethane 3546453..... ............. .............
Bulk download ofPubmed abstracts
Extract chemical structures
OSCAR3Service
Find similarmolecules
Searchable(structure/similarity)Grid database
Local DTP database
PubChem
PDBBind
Find similardocuments
Clustering of documents linked to clustering of chemicals
Large Scale Calculations on “All of PubChem/Med”
TeraGrid: 100 Teraflop now to 1000 Teraflop next year IU 2048 node Big Red supercomputer: 20 Teraflop today The CDK can currently calculate approx. 107 Descriptors
Whole of PubChem (6M compounds) – 276 hours, 1 CPU On IU's Big Red, 2048 CPU's, 20 TF: < 7 minutes Even increasing the descriptor count by 5 times gives us < 35 minutes
of compute time on Big Red
OSCAR3 takes a few seconds per abstract to text-mine all compounds in it All of PubMed would take < a day on Big Red Cleanup and Iteration would take some time
Can pre-calculate properties of smaller compounds using CDK (logP, BCUT, CPSA, …) and programs likes GAMESS 100,000 compounds take < a week each on a single CPU and would be
a practical computation over next year
17
TeraGridSupercomputers
“Flocks”
Prototype CICC Project: Controlling the TGF pathwayCollaboration between Baik & Zhang at IU
PDB
1IAS1IASInactive TGF
VARUNA
Experimentsin the Zhang
Lab
Active TGFActive TGFWith inhibitorWith inhibitor
PubChem
in-house Molecules in VarunaQM Database
Conceptual Conceptual Understanding of Understanding of TGFTGF
InhibitionInhibition
SimulationsAutoGeFFAutoGeFF
Questions:
- What molecular feature controls inhibitor binding?
- How do mutations impact binding?
Web Service togenerate customforce fields
Can afford few ms overhead!
18
Simulating the Structure and Reactivity of Cu-A Complex One of the speculation about the pathogenesis of Alzheimer’s Disease
involves complexes of Cu-ions and -Amyloid plaques.
We will test the hypothesis that a Cu-A complex can catalytically activate dioxygen to give hydrogen peroxide in a molecular modeling study:
Unfortunately, the structure of the Cu-A complex is currently not known.(a) We will carry out a combined QM/QM-MM/MD study to propose a structure(b) We will evaluate the plausibility of the catalysis by constructing a reaction profile(c) We will use PubChem in combination with our in-house Quantum Chemical Database to identify small molecules and molecule classes that may inhibit the catalysis.
This study serves as a Prototype application where an unknown protein fragment structure is computed a priori and saved in an in-
house database. the diversity of small molecule targets are derived from clustering PubChem and
other federated databases, including an in-house structural database
19
Other Chemical Projects Utilizing the Cyberinfrastructure
Mechanistic studies on how the anticancer drug cisplatin interacts with DNA to kill cancer cells. how xanthine oxidase catalyzes the oxidation of xanthine. how the bacterial enzyme methane monooxygenase catalyzes methane
oxidation. electrocyclization of small organic molecules that are relevant for natural
product synthesis. stereoselective carbocyclizations that are catalyzed by Rhodium
complexes. how molecular probes for in vivo detection of Zinc, Mercury and Lead
can be designed in a rational fashion. Utilizing a molecular modeling database
By registering and saving the structures, charge distributions and molecular orbitals of computer simulations, we can conduct a new kind of similarity searches and recognize trends.
The molecular modeling database will allow for curating the structural information of other databases, such as PubChem, by providing more detailed simulated information.
20
MLSCN Post-HTS Biology Decision SupportPercent Inhibition or IC50 data is retrieved from HTS
Question: Was this screen successful?
Question: What should the active/inactive cutoffs be?
Question: What can we learn about the target protein or cell line from this screen?
Compounds submitted to PubChem
Workflows encoding distribution analysis of screening results
Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis
A Grid of Grids linking collections of services atPubChemECCR centersMLSCN centers
Workflows encoding plate & control well statistics, distribution analysis, etc
Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etcCHEMINFORMATICSPROCESS GRIDS
21
MLSCN Data - How services and workflows are used
MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback
Data is stored in Pubchem
Workflows perform different kinds of analysis on the MLSCN data, including SAR, clustering, literature searching, protein searching, toxicity testing, etc…End-user
applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis
PubChem interfaces to workflows via SOAP
22
Example HTS workflow: finding cell-protein relationshipsA protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)
SImilar structures to the ligand can be
browsed using client portlets.
Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.
Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.
The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.
Docking results and activity patterns fed into R services for building of activity models and correlations
LeastSquaresRegression
RandomForests
NeuralNets
23
Next steps in workflows Expansion of HTS Workflows
Inclusion of ToxTree for toxicity flagging Prediction of protein binding through PDB ligand similarity search Inclusion of literature text mining (OSCAR) Using PubChem data instead of tumor cell dataset
More workflows Incorporating VARUNA, PubChem, PDBBind and other services Workflows from Cambridge collaboration
Making workflows available in other systems Taverna SCUFL <-> BPEL conversion Use of workflows in other execution environments (starting with
myLEAD supporting triggering)
24
Methods Development at the CICC
Tagging methods for web-based annotation exploiting del.icio.us and Connotea
Development of QSAR model interpretability and applicability methods
RNN-Profiles for exploration of chemical spaces VisualiSAR - SAR through visual analysis
See http://www.daylight.com/meetings/mug99/Wild/Mug99.html Visual Similarity Matrices for High Volume Datasets
See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php Fast, accurate clustering using parallel Divisive K-means Mapping of Natural Language queries to use cases and workflows Advanced data mining models for drug discovery information
25
250
300
350
400
450
500
550
600
650
700
0 10 20 30 40 50 60 70 80 90
Number of processors
Ru
nti
me
(sec
on
ds)
Minsize 1 Minsize 100 Minsize 1000
MPI Parallel Divkmeans clustering of PubChem
AVIDD Linux cluster, 5,273,852 structures (Pubchem compound collection, Nov 2005)
min_size ncpus wall_mins walltime1 20 676 11:16:061 40 444 7:24:241 60 379 6:18:411 80 353 5:53:00
100 20 462 7:41:58100 40 356 5:56:01100 40 356 5:55:47100 60 339 5:38:44100 80 337 5:36:53
1000 20 513 8:32:391000 40 376 6:16:251000 60 346 5:46:221000 80 346 5:45:40
Exploring Chemical Spaces
The problem Thousands of compounds 10's to 100's of descriptors
Requirements In my chemistry space what are the outliers? Which compounds are in
the dense regions of space? I don't want to / can't do descriptor selection I don't want to squash things into a lower dimensional space I want a simple way to view all this
Our approach (Guha, R. et al;, J. Chem. Inf. Model., 2006) : Use the R-NN profile technique
R-NN Profiles & Exploring Chemical Space
4337 molecules
<MW> = 240, 5 descriptors
2 known outliers
Molecules at the top are in sparse regions
Molecules at the bottom are in dense regions
Drill down into specific regions (GGgobi, VOPlot ...), annotate with activity, ...
Simple & intuitive, can be very fast
R-NN Profiles & HPC
R-NN profiles require a pairwise distance matrix Can be sped up with approximate NN methods R-NN profiles can be trivially parallelized
1000 x 100 data matrix => 1000 x 1000 distance matrix -> 2.1 sec (P4 1GhZ laptop)
Evaluating R-NN profiles for 1000 compounds -> 43 sec The current parameters allow a 100x speedup if we use 100 CPU's
Measuring Model Applicability
We have many ways to build multiple models We perform validation But can we use a stored model for a new molecule(s)?
Trivially, yes But does it really make sense to do this?
Depends on similarities to the training set Also depends on a global chemistry space
We can provide a component that attempts to answer this question for arbitrary model types
Guha, R. et al., J. Chem. Inf. Model., 2005, 45, 65-73
Measuring Model Applicability
Our initial approach Considers regression
models Considers similarity to
the TSET We and P&G are working
on more robust methods that try and take into account a global chemistry space
Alternate methods can be easily included in workflows
StoredOLS/CNN/SVM/...
Model
Auxillary classification model
Choose cutoff
Training set residuals
New (unseen)molecule
Predictproperty
Obtain applicability
Decide whether it makes sense to go with this prediction
31
More detailed Slides not used
32
LoadWorkflow
RunWorkflow
CurrentProcess
Result Output
ResultOutputURL
33
Preliminary Results
Shown is a fully equilibrated structure (highest population in a 10 ns MD @ room temp) of theCu-A structure.
The Cu-(peptide) bonds require special attention,as standard force-fields do not allow simulationsof this type.
We use a new tool AutoGeFF (to be implented as a Webservice) that recognizes bonds for which no force field parameters exist. AutoGeFF can generate appropriate force fields by automatically carrying out a QM calculation on a small model system and fitting new forces to computed vibrational frequencies. We use a guided Monte Carlo approach to iteratively derive these forces. Currently, AutoGeFF recognizes most of the transition metals. Future work will include organic moieties (such as drug candidates).
34
Example HTS workflow: organization & flagging
A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database)
The compounds are clustered on chemical structure similarity, to group similar compounds together
The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT
OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs
35
Example of workflow output - LogP vs GI50
Plotting XLogP against GI50 can help identify highly active compounds with good logP profiles (1 - 4 range)
36
Example of workflow output - Cluster # vs GI50
Plotting Cluster against GI50 can help identify groups of highly active, structurally similar compounds, and also clusters which might yield good QSAR information
37
Example workflow output - docked complexes
NSC_ID 685478Docking score -29.74
NSC_ID 685477 Docking score -35.51
NSC_ID 719175Docking score -30.78
NSC_ID 725806Docking score -32.15
Example output of most similar compounds to PDB 1Y4 complex ligands docked into the target protein using OpenEye FRED