11
http://www.chembiogrid.org
Gary Wiggins for
Geoffrey Fox
April 30, 2007
Computer Science, Informatics, PhysicsPervasive Technology Laboratories
Indiana University Bloomington IN [email protected]
http://www.infomall.org
22
Indiana University Focus
Creating a comprehensive, easily accessible infrastructure for cheminformatics tools and data sources
Becoming a central hub of cheminformatics education
33
CICC Web Service Infrastructure
Cheminformatics services
Statistics services
Database services
Grid services
Portal services
Web Services Vision
Web services provide a neutral approach to exposing functionality
They can be located anywhere:• On your desktop
• Intranet
• Internet
Literally anything can be made into a web service:• Libraries
• Standalone programs
• Commerical code
• Open-source code
Modes of Access Web Pages Workflow Tools
• Taverna, Pipeline Pilot, Xbaya, etc.
GUIs• Chimera
RSS Feeds• Feeds include 2D/3D structures in CML
• Viewable in Bioclipse, Jmol as well as Sage etc.
• Two feeds currently available: SynSearch – get structures based on full or partial chemical
names DockSearch – get best N structures for a target
Where Does Our Functionality Come From?
Indiana University VOTables NCI DTP predictions Database services
Cambridge University InChi generation / search OSCAR
OpenEye Docking
DigitalChemistry BCI fingerprints DivKMeans
CDK Cheminformatics
Univ. of Michigan PkCell
R Foundation R package
NIH PubChem PubMed
gNova Consulting
European Chemicals Bureau ToxTree toxicity predictions
77
Methods Development at the CICC
Tagging methods for web-based annotation exploiting del.icio.us and Connotea
Development of QSAR model interpretability and applicability methods
RNN-Profiles for exploration of chemical spaces VisualiSAR - SAR through visual analysis
• http://www.daylight.com/meetings/mug99/Wild/Mug99.html
Visual Similarity Matrices for High Volume Datasets• http://www.osl.iu.edu/~chemuell/new/bioinformatics.php
Fast, accurate clustering using parallel Divisive K-means
Mapping of Natural Language queries to use cases and workflows
Algorithm Development
Goals• Focus on interpretability and applicability
• Devise novel approaches to clustering problems
• Investigate the utility of low dimensional representations for a variety of problems
Examples• Ensemble feature selection (JCIM, in press)
• Cluster counting with R-NN curves (in revision)
Chemical Data Mining Working on screening data with Scripps, FL
• Random forests (modeling & feature selection)
• Naïve Bayes (modeling)
• Identifying features indicative of toxicity
• Domain applicability
NCI DTP Cell line activity predictions• Random forest models for 60 cell lines
All available as• downloadable R models
• web services (supply SMILES, get prediction) with web page clients
Computational Infrastructure
R, CDK, and PubChem Goals
• Access cheminformatics from within R
• Access PubChem data from within R
rcdk package allows to do cheminformatics within R using CDK functionality
rpubchem provides access to PubChem compound data and bioassay data• Searchable via assay ID, keywords
J. Stat. Soft, 2007, 18(6)
1111
Example: R Statistics applied to PubChem data
By exposing the R statistical package, and the Chemistry Development Kit (CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of PubChem assay data.
Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications.
Example below uses DTP Tumor Cell Line screens - a predictive model using Random Forests in R makes predictions of probability of activity across multiple cell lines (avail. at http://www.chembiogrid/cheminfo/ncidtp/dtp).
Databases
Our databases aim to add value to PubChem or link into PubChem
3D structures (MMFF94)• Searchable by CID, SMARTS, 3D similarity
Docked ligands (FRED)• 960,000 drug-like compounds into 7 targets
• Will eventually cover ~2000 targets
1313
Example: PubDock Database of 960K PubChem structures (the most drug-like) docked
into proteins taken from the PDB Available as a web service, so structures can be accessed in your
own programs, or using workflow tools like Pipeline Pilot Several interfaces developed, including one based on Chimera
(below) which integrates the database with the PDB to allow browsing of compounds in different targets, or different compounds in the same target
How do we use all of this?Percent Inhibition or IC50 data is retrieved from HTS
Question: Was this screen successful?
Question: What should the active/inactive cutoffs be?
Question: What can we learn about the target protein or cell line from this screen?
Compounds submitted to PubChem
Workflows encoding distribution analysis of screening results
Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis
A Grid of Grids linking collections of services atPubChemECCR centersMLSCN centers
Workflows encoding plate & control well statistics, distribution analysis, etc
Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etcCHEMINFORMATICSPROCESS GRIDS
1515
Example HTS workflow: Finding cell-protein relationships
A protein implicated in tumor growth with a known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex).
Similar structures to the ligand can be
browsed using client portlets.
Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.
Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.
The screening data from a cellular HTS assay is similarity searched for compounds with 2D structures similar to the ligand.
Docking results and activity patterns fed into R services for building of activity models and correlations
LeastSquaresRegression
RandomForests
NeuralNets
1616
Varuna environment for molecular modeling (Baik, IU)
QMDatabase
ResearcherResearcher
Simulation ServiceFORTRAN Code,
Scripts
Chemical Concepts
Experiments
QM/MMDatabasePubChem, PDB,
NCI, etc.
ChemBioGridChemBioGrid
ReactionDB
DB ServiceQueries, Clustering,
Curation, etc.
Papersetc.
Condor
TeraGridSupercomputers
“Flocks”
1717
Cheminformatics Education at IU School of Informatics degree programs: BS, MS,
PhD• Cheminformatics MS and track on PhD in Informatics• Informatics Undergraduates can choose a chemistry
cognate (minor in chemistry) Also Bioinformatics MS and Bioinformatics and Complex
Systems tracks on PhD in Informatics Good employer interest but modest student understanding
of value of Cheminformatics degree 3 core graduate courses in Cheminformatics plus seminars
and independent study courses Significant interest in distance education versions of
courses promising for the Graduate Certificate in Chemical Informatics
http://www.informatics.indiana.edu
1818
Spreading cheminformatics education with distance education
Partnered with the University of Michigan to offer our introductory graduate cheminformatics course at IU and Michigan as a CIC CourseShare• UM pharmacy, chemistry and
engineering students can be trained in cheminformatics for course credit at UM
Individual students in academia, government, and small and large life science companies have taken the class remotely from all over the country for credit towards the graduate certificate
Uses mixture of web conferencing (Breeze), videoconferencing, and online resources for maximum flexibility
• Most recent course wiki is available at http://cheminfo.informatics.indiana.edu/djwild/I571_2006_wiki
Giving a class remotely to UM students with video and web conferencing
1919
CICC Infrastructure Vision Drug Discovery and other academic chemistry and pharmacology
research will be aided by powerful modern information technology. ChemBioGrid is set up as distributed cyberinfrastructure in
eScience model. ChemBioGrid will provide user interfaces (portals) to distributed
databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses.
ChemBioGrid will provide services to manipulate this data and combine in workflows; it will have convenient ways to submit and manage multiple jobs.
ChemBioGrid will include access to PubChem, PubMed, PubMed Central, the Internet and its derivatives like Microsoft Academic Live and Google Scholar.
The services include open-source software like CDK, commercial code from vendors such as Digital Chemistry, OpenEye, and Google, and any user contributed programs.
ChemBioGrid will define open interfaces to use for a particular type of service allowing plug and play choices between different implementations.
2020
CICC Senior Personnel Geoffrey C. Fox Mu-Hyun (Mookie) Baik Dennis B. Gannon Kevin E. Gilbert Rajarshi Guha Marlon Pierce Beth A. Plale Gary D. Wiggins David J. Wild Yuqing (Melanie) Wu
Peter T. Cherbas Mehmet M. Dalkilic Charles H. Davis A. Keith Dunker Kelsey M. Forsythe John C. Huffman Malika Mahoui Daniel J. Mindiola Santiago D. Schnell William Scott Craig A. Stewart David R. Williams
From Biology, Chemistry, Computer Science, Informatics
at IU Bloomington and IUPUI (Indianapolis)
Top Related