Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1...

21
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Indiana University School of Chemoinformatics David Wild, [email protected] Bioinformatics Retreat, Feb 2nd, 2007

Transcript of Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1...

Page 1: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Indiana University School of

Chemoinformatics

David Wild, [email protected] Retreat, Feb 2nd,

2007

Page 2: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 2 Indiana University School of

Current state of chemoinformatics research

• What works and what doesn’t– Fingerprints, clustering and diversity– QSAR - predictive and descriptive methods, virtual screening

– 3D similarity, pharmacophores & docking– Visualization, organization and navigation of chemical datesets

• Current buzz areas in chemoinformatics• How can we use our internal strengths to do something new, important and impressive?

Page 3: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 3 Indiana University School of

What works and what doesn’t• 2D structure and similarity searching well established

– Lots of papers comparing fingerprints for similarity– Some recent evidence Scitegic ECFPs better for recall of

actives• Clustering well established but definite room for improvement

– Traditional methods Wards, K-means, Jarvis Patrick– Recently single pass similarity cutoff methods used for

very fast organization - >0.85 for similar activity, >0.55 for QSAR

– Data mining methods - ROCK, Chameleon, Cure, etc unexplored– Diversity hot -> cold -> smart

• QSAR - poor relation of academic work to industry usefulness– Lots of papers: “this method works best on this dataset”– Random forests appear practically to work rather well– Interpretability vs predictive ability– Predictive methods for LogP, pKa, solubility, etc work

reasonably– Virtual screening virtually useless unless tied in with HTS

screening process. However, is useful for exploring around leads.

Page 4: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 4 Indiana University School of

What works and what doesn’t

• Mostly, 3D methods haven’t worked out yet– Similarity & QSAR - Almost every paper: 2D better for recall and precision but 3D methods give “interesting ideas”. Useful for “lead hopping”

– Pharmacophore searching not widely used– Docking - very useful for visual inspection, poor correlation of scoring functions with binding

• Visualization, organization and navigation of datasets– Still not clear how to work with datasets > few hundred compounds

– Dot plots, spreadsheet-based methods work minimally

– Need for UI design and research

Page 5: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 5 Indiana University School of

The current buzz in chemoinformatics

• Decorporatization and commoditization of data and software– MLSCN, PubChem, open source, small companies– Crisis for the software companies, nice for academia– Pharma companies in the brown stuff without a paddle

• Integration with other “ics”– Data mining chemical/genomic information– Linking compounds -> proteins -> pathways, etc (e.g. KEGG)

• Fuzzy boundaries, integration with science and informatics– Microsoft 2020 vision for science

• Integration of text and structure searching• Semantic web, services and mashups will probably have a BIG impact: exporting best of breed… what happens to the rest?

Page 6: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 6 Indiana University School of

Suggested collaboration areas

• Chem/bio/complex systems mashups using web services in each of the areas: nice, confined projects for students once you have the infrastructure

• Chem and complex can work together on integrating text and structure-based searching, indexing and crawling (e.g. networks of web services and databases), and intelligent agents

• Data mining of chemogenomic information• Integration of advanced chemoinformatics methods with systems biology and pathway mapping tools

• Performing research to establish best practices for areas of chemoinformatics

• Tackling algorithmic problems for which there is currently no good solution - docking and scoring

Page 7: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 7 Indiana University School of

Cyberinfrastructure

Geoffrey FoxComputer Science, Informatics and

Physics

Page 8: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

Cyberinfrastructure Supports distributed science – data, people, computers Exploits Internet technology (Web2.0) adding (via Grid

technology) management, security, supercomputers etc. It has two aspects: parallel – low latency (microseconds)

between nodes and distributed – highish latency (milliseconds) between nodes

Parallel needed to get high performance on individual 3D simulations, data analysis etc.; must decompose problem

Distributed aspect integrates already distinct components Cyberinfrastructure is in general a distributed collection of

parallel systems Cyberinfrastructure is made of services (usually Web services)

that are “just” programs or data sources packaged for distributed access

Page 9: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

TeraGrid: Integrating NSF Cyberinfrastructure

TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research.Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today.

SDSCTACC

UC/ANL

NCSA

ORNL

PU

IU

PSCNCAR

Caltech

USC-ISI

UtahIowa

Cornell

Buffalo

UNC-RENCI

Wisc

Page 10: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

Cyberinfrastructure at IU Interpreted broadly (Web presences), there are many activities at

IU Interpreted narrowly as the “programmable web” or “using Grid

technologies” there are large projects in atmospheric, earthquake, ice-sheet sciences, network systems, particle physics, Crystallography and Cheminformatics

• IU has an international reputation in both parallel and distributed Cyberinfrastructure including education, research and resources

• IU has #31 Supercomputer in world and is part of two major National activities TeraGrid and Open Science Grid

There are several well known Bioinformatics Grids such as BIRN (mainly images) and caBIG (cancer databases) from NIH and MyGrid from UK (EBI)

Could be opportunities to link Biology and Informatics/CS in Cyberinfrastructure projects

Page 11: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

Cyberinfrastructure motivated by Web 2.0 Capture the power of interactive Web/Grid sites enabling people

to create, collaborate and build on each others work Programmableweb.com363 Web 2.0 API’sNeed Similar Life SciencePortal for Tools and Data

Page 12: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 12 Indiana University School of

Web services, workflows, portals and ontologies• Web Services allow us to quickly develop and deploy new tools,

interfaces that cross disciplines and are broadly accessible– Can use simple HTTP and ignore Web Service complications

• Workflows (called mashups in Web 2.0) allow us to string together collections of web services to do computation that is tailored to the science (as a one-off or for re-use). – Develop core capabilities as services and use in many different ways as

in 770 Google map mashups

• API’s/Languages/Data structures/Ontologies (WSDL AJAX JSON at low level) allow us to describe workflows and services in discoverable, standard ways, such that reasoning tools can piece them together to match queries

• Portals enable composable reusable user interfaces• Distributed posting of services and easily available composition

tools enable “everybody” to contribute– Interesting implications for “broader participation”

Page 13: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

Model and Data Sharing Cyberinfrastructure requires agreed sharing standards (data

structures, API’s, protocols, ontologies, languages) as intrinsically internationally distributed

There are agreed data structures for taking SequenceProteinFoldingInteraction Transparently, e.g. BLAST

Nothing at the level where genomics and proteomics is important: cells and tissues.

Partial answers: CellML, FieldML, SBML which do not link to relevant standards outside Biology

Need to connect models at these levels. Need Standard ontologies/data structures for cell behaviors to allow connections and validation

Need to connect Models like SBW (Systems Biology Workbench)/BioSpice ->Cell-level models (Compucell) ->Tissue level models (Physiome)

Model builders at these scales not CS-sophisticated. Models NOT interoperable and don’t use useful general ideas

Glazier organizing activity in this area with H. Sauro (U. Washington), W. Li (UCSD-SDSC), Hunter (U. Auckland) and NIH• Link to Open Grid Forum standard setting and community

activities

Page 14: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

http://www.chembiogrid.org Database

enabled quantum chemistry computations

Services to link PubChem, Supercomputers, results of high throughput Screening centers

Education; IU has unique Cheminformatics degrees

Portals

Page 15: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 15 Indiana University School of

Chemical Informatics web service infrastructure• Database Services

– Local NIH DTP Human Tumor Cell Line set

– Local PubChem mirror– Derived properties database

– Pub3D, PubDock– Synonym service– VARUNA quantum chemistry database

• Statistics (based on R)– Regression, Neural Nets,

Random Forest– LDA– K-means clustering– Plotting– T-test and distribution

sampling

• Computation Services– OpenEye FRED, OMEGA, FILTER, …

– Cambridge OSCAR3– BCI fingerprint generation, Ward’s, Divisive K-means clustering

– Tox Tree– Similarity & fingerprint calculations (CDK)

– Descriptor calculation (CDK)

– 2D structure diagrams (CDK)

– 2D->3D File format conversions

Page 16: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 16 Indiana University School of

Workflows - Taverna (taverna.sourceforge.net)

Page 17: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 17 Indiana University School of

Page 18: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 18 Indiana University School of

PubDock - Chimera-based interface

Page 19: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 19 Indiana University School of

Kemo - A ChatBot for PubChem

• Uses ALICE chatbot www.alicebot.org

• AIML used to define knowledge base, e.g. reaction to common phrases like FIND ME, WHAT IS THE LOGP OF, etc

• Can iteratively improve knowledge base

• Accesses PubChem through web service interface

Page 20: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 20 Indiana University School of

Workflow in Xbaya - a meteorology tool!

http://www.extreme.indiana.edu/xgws/xbaya/

Page 21: Indiana University School of David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1 Chemoinformatics David Wild, djwild@indiana.edu Bioinformatics.

David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 21 Indiana University School of

Indexing the world’s chemical informationAND computational functionality

• Crawl and index web pages, journal articles, etc. for– Structures (InChIs, SMILES)– Images (converted using Clide or ChemReader)– Names (converted using OSCAR3 or similar package)– Other information (IR spectra, reactions, etc…)

• Technology still immature, but improving quickly• Problem with access to journal articles: we will assume

open access in the future!• Expose computational functionality as web services,

contextualize in an OWL-S ontology (semantics), and publish in a UDDI

• Now we know what information we have, and what we can do with it

• Develop bots and intelligent agents to automatically do useful things