Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

40
David Wild – ECCR Meeting, October 2005. Page 1 Indiana University School of Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening David J. Wild Visiting Assistant Professor Indiana University School of Informatics [email protected] http://www.informatics.indiana.edu/ djwild/

description

Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening . David J. Wild Visiting Assistant Professor Indiana University School of Informatics [email protected] http://www.informatics.indiana.edu/djwild/. Content. - PowerPoint PPT Presentation

Transcript of Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

Page 1: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 1 Indiana University School of

Chemical Informatics & Cyberinfrastructure Collaboratory

HTS Data Analysis & Virtual Screening

David J. Wild

Visiting Assistant ProfessorIndiana University School of Informatics

[email protected]://www.informatics.indiana.edu/djwild/

Page 2: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 2 Indiana University School of

Content• Web services framework for HTS data analysis

– Long-term approach

• Priorities for web service development– Rapid dataset organization using cluster analysis– Interface tools for navigation and analysis– Virtual screening

Page 3: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 3 Indiana University School of

Thoughts relating to Pubchem HTS analysis

(and more widely applicable)• Existing approaches do not scale up well• Scientists’ questions are probably not going to be conceptually

complex, but finding the answers can currently be very time consuming and/or complex (for a human)– “who else is working on this chemical structure I just made (or

similar ones)?”– “are there any compounds in Pubchem (or elsewhere) that might

bind to the active site of this protein I just resolved?”– “do any compounds related to this one exhibit toxic side effects?”

• We need to figure out just what the questions are!(Contextual Inquiry, Use cases)

• Answers are often “stale” after a short period of time – questions need to be re-answered as new information is generated

• Almost all available systems are passive, and follow the(web) browsing model

Page 4: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 4 Indiana University School of

Purpose ToolsInteraction Layer Software for information

access and storage by humans, including email, browsing tools and “push” tools

Web browsers, email clients, RSS aggregators, JMol, JME

Aggregation Layer

Software, intelligent agents and data schemas customized for particular domains, applications and users

BPEL, Microsoft Smart Client

Interface Layer Common interfaces to the data layer – may be several for different kinds of information

Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft .NET

Data Layer Comprehensive data provision including storage, calculation, semantics and meta-data, probably in multiple systems

MySQL, PostgreSQL, gNova Cartridge chemoinformatics calculation programs; data from NCI, ZINCWild, D.J., Strategies for Using Information Effectively in Early-stage Drug

Discovery, in Ekins, S. (ed), Computer Applications in Pharmaceutical Research and Development, submitted July 2005

Page 5: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 5 Indiana University School of

Purpose ToolsInteraction Layer Software for information

access and storage by humans, including email, browsing tools and “push” tools

Web browsers, email clients, RSS aggregators, JMol, JME

Aggregation Layer

Software, intelligent agents and data schemas customized for particular domains, applications and users

BEPL, Microsoft Smart Client

Interface Layer Common interfaces to the data layer – may be several for different kinds of information

Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft .NET

Data Layer Comprehensive data provision including storage, calculation, semantics and meta-data, probably in multiple systems

MySQL, PostgreSQL, gNova Cartridge chemoinformatics calculation programs; data from NCI, ZINCWild, D.J., Strategies for Using Information Effectively in Early-stage Drug

Discovery, in Ekins, S. (ed), Computer Applications in Pharmaceutical Research and Development, submitted July 2005

web servicesdatabases & tools

intelligent agentshuman interfaces

Page 6: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 6 Indiana University School of

Onlinedatabase

(e.g. PubChem)

Localdatabase

3D DockingTool

2D-3Dconverter

3Dvisualizer

UDDI

New Structure ServiceSearch online databases

for recent structures

Search local databasesfor recent structures

Merge Results

AGENT / SMART CLIENT

Parse requestSelect appropriate use cases

and/or web service(s)Schedule as necessary

Request from Human Interface

WSDLSOAP

atomic services

aggregate services

USE-CASE SCRIPT

Invoke New Structure ServiceConvert structures to 3DDock results & protein file

Extract any hitsReturn links for visualization

“find me all thestructures that fit theenclosed protein forThe next three months”

Page 7: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 7 Indiana University School of

Priorities for web service development

• Rapid dataset search and organization– Search of PubChem (SOAP interface already exists)– Search of local gNova / PostgreSQL database– Clustering using BCI (Digital Chemistry) Divisive K-Means– BCI Markush searching

• Interface tools for navigation and analysis– Integration with Spotfire– ChemTK (or other spreadsheet-metaphor product)– Develop entirely new interface tools (usability studies)

• Virtual Screening– Molecular docking with OpenEye FRED– Property calculation with Molinspiration / Chemaxon– PDB Search (EMBL)– Activity prediction modules (Molinspiration / RP / SVMs etc)

Page 8: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 8 Indiana University School of

Visualization & interface level tools

• No matter how clever the smarts underneath, the overriding factor in usefulness will be the quality of scientists’ interaction with the system

• Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right peoplein chemical informatics, and deserve investigation for future use in this project

• Possibility of multiple interfaces for different people groups(Cooper’s “primary personas”)

• Don’t assume the browser interface – email / NLP ?• Start with the basics

– 2D chemical structure drawing (input)– Visualization of large numbers of chemical structures in 2D– 3D chemical structure visualization

• Planning on evaluation of NLP, email, RSS, etc. as well asbrowser-based interfaces

Page 9: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 9 Indiana University School of

Visualization methods for datasets &

clusters• Partitions– Spreadsheets– Enhanced Spreadsheets– 2D or 3D plots

• Hierarchies– Dendograms– Tree Maps– Hyperbolic Maps

Page 10: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 10 Indiana University School of

Supplemental Slides

Page 11: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 11 Indiana University School of

Page 12: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 12 Indiana University School of

Page 13: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 13 Indiana University School of

Use Case #1Are there any good ligands for my

target?• A chemist is working on a project involving a

particular protein target, and wants to know:– Any newly published compounds which might fit the

protein receptor site– Any published 3D structures of the protein or of protein-

ligand complexes– Any interactions of compounds with other proteins– Any information published on the protein target

Page 14: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 14 Indiana University School of

Use Case #1Are there any good ligands for my

target?• A chemist is working on a project involving a

particular protein target, and wants to know:– Any newly published compounds which might fit the

protein receptor site gNova / PostgreSQL, PubChem search, FRED Docking

– Any published 3D structures of the protein or of protein-ligand complexes PDB search

– Any interactions of compounds with other proteins gNova / PostgreSQL, PubChem search

– Any information published on the protein target Journal text search

Page 15: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 15 Indiana University School of

Use Case #2Who else is working on these

structures?• A chemist is working on a chemical series for a

particular project and wants to know:– If anyone publishes anything using the same or related

compounds– Any new compounds added to the corporate collection

which are similar or related – If any patents are submitted that might overlap the

compounds he is working on– Any pharmacological or toxicological results for those or

related compounds– The results for any other projects for which those

compounds were screened

Page 16: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 16 Indiana University School of

Use Case #2Who else is working on these

structures?• A chemist is working on a chemical series for a

particular project and wants to know:– If anyone publishes anything using the same or related

compounds ~ PubChem search– Any new compounds added to the corporate collection

which are similar or related gNova CHORD / PostgreSQL– If any patents are submitted that might overlap the

compounds he is working on ~ BCI Markush handling software

– Any pharmacological or toxicological results for those or related compounds gNova CHORD / PostgreSQL, MiToolkit

– The results for any other projects for which those compounds were screened gNova CHORD / PostgreSQL, PubChem search

Page 17: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 17 Indiana University School of

Use Case - PubchemWhich of these hits should I follow up?

• An MLI HTS experiment has produced 10,000 possible hits out of a screening set of 2m compounds. A chemist at another laboratory wants to know if there are any interesting active series she might want to pursue, based on:– Structure-activity relationships– Chemical and pharmacokinetic properties– Compound history– Patentability– Toxicity– Synthetic feasibility

Page 18: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 18 Indiana University School of

Use Case – PubChemWhich of these hits should I follow up?

• An HTS experiment has produced 10,000 possible hits out of a screening set of 2m compounds. A chemist on the project wants to know what the most promising series of compounds for follow-up are, based on:– Series selection BCI cluster analysis– Structure-activity relationships lots of methods– Chemical and pharmacokinetic properties mitools,

chemaxon– Compound history gNova / PostgreSQL / Pubchem search– Patentability BCI Markush handling software– Toxicity– Synthetic feasibility– + requires visualization tools!

Page 19: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 19 Indiana University School of

Cluster Analysis and Chemical Informatics

• Used for organizing datasets into chemical series, to build predictive models, or to select representative compounds

• Organizational usage has not been as well studies as the other two, but see– Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint Types and

Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering, Journal of Chemical Information and Computer Sciences., 2000, 40, 155-162.

• Essentially helping large datasets become manageable• Methods used:

– Jarvis-Patrick and variants• O(N2), single partition

– Ward’s method• Hierarchical, regarded as best, but at least O(N2)

– K-means• < O(N2), requires set no of clusters, a little “messy”

– Sphere-exclusion (Butina)• Fast, simple, similar to JP

– Kohonen network• Clusters arranged in 2D grid, ideal for visualization

Page 20: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 20 Indiana University School of

Limitations of Ward’s method forlarge datasets (>1m)

• Best algorithms have O(N2) time requirement (RNN)

• Requires random access to fingerprints– hence substantial memory requirements (O(N))

• Problem of selection of best partition– can select desired number of clusters

• Easily hit 4GB memory addressing limit on 32 bit machines– Approximately 2m compounds

Page 21: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 21 Indiana University School of

Scaling up clustering methods• Parallelisation

– Clustering algorithms can be adapted for multiple processors

– Some algorithms more appropriate than others for particular architectures

– Ward’s has been parallelized for shared memory machines, but overhead considerable

• New methods and algorithms– Divisive (“bisecting”) K-means method– Hierarchical Divisive– Approx. O(NlogN)

Page 22: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 22 Indiana University School of

Divisive K-means Clustering• New hierarchical divisive method

– Hierarchy built from top down, instead of bottom up– Divide complete dataset into two clusters– Continue dividing until all items are singletons– Each binary division done using K-means method– Originally proposed for document clustering

• “Bisecting K-means”– Steinbach, Karypis and Kumar (Univ. Minnesota)

http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf

– Found to be more effective than agglomerative methods– Forms more uniformly-sized clusters at given level

Page 23: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 23 Indiana University School of

BCI Divkmeans• Several options for detailed operation

– Selection of next cluster for division– size, variance, diameter– affects selection of partitions from hierarchy, not shape of

hierarchy• Options within each K-means division step

– distance measure– choice of seeds– batch-mode or continuous update of centroids– termination criterion

• Have developed parallel version for Linux clusters / grids in conjunction with BCI

• For more information, see Barnard and Engels talks at: http://cisrg.shef.ac.uk/shef2004/conference.htm

Page 24: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 24 Indiana University School of

Comparative execution timesNCI subsets, 2.2 GHz Intel Celeron processor

7h 27m

3h 06m

2h 25m

44m0

5000

10000

15000

20000

25000

30000

0 20000 40000 60000 80000 100000 120000Number of Structures in Clustered Set

Exe

cutio

n Ti

me

(s)

Wards

K-means

Divisive K-means

Parallel Divisive Kmeans (4-node)

Page 25: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 25 Indiana University School of

Clustering a 1 million compound dataset

on a 2.2 GHz Celeron Desktop MachineMethod Time * Memory Usage

K-Means(10,000 clusters)

3½ days 95 MB

Divisive K-means 7 days 65 MB

Divisive K-means(Parallel, 4 machinesincl. 1.7 GHz Pentium M)

16½ hours

~ 50 MB

* Time for a single run may vary due to different selection of seeds. Runtimes can be shortened e.g. by using a max. number of iterations or a % relocation cutoff.

Results from AVIDD clusters & Teragrid coming soon….

Page 26: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 26 Indiana University School of

Divisive Kmeans: Conclusions• Much faster than Ward’s, speed comparable to K-means,

suitable for very large datasets (millions) – Time requirements approximately O(N log N)– Current implementation can cluster 1m compounds in under a

week on a low-power desktop PC– Cluster 1m compounds in a few hours with a 4-node parallel

Linux cluster• Better balance of cluster sizes than Wards or Kmeans• Visual inspection of clusters suggests better assembly of

compound series than other methods• Better clustering of actives together than previously-

studied methods• Memory requirements minimal• Experiments using AVIDD cluster and Teragrid forthcoming

(50+ nodes)

Page 27: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 27 Indiana University School of

Visualization & interface level tools

• No matter how clever the smarts underneath, the overriding factor in usefulness will be the quality of scientists’ interaction with the system

• Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right peoplein chemical informatics [collaboration with HCI?]

• Possibility of multiple interfaces for different people groups(Cooper’s “primary personas”)

• Don’t assume the browser interface – email / NLP ?• Start with the basics

– 2D chemical structure drawing (input)– Visualization of large numbers of chemical structures in 2D– 3D chemical structure visualization

• Planning on evaluation of NLP, email, RSS, etc. as well asbrowser-based interfaces

Page 28: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 28 Indiana University School of

Usability of 2D structure drawing tools

• Key difference between “sequential” and “random” drawers

• Huge difference in intuitiveness• Key factor how badly you can mess things up• Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw

Page 29: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 29 Indiana University School of

Visualization methods for datasets &

clusters• Partitions– Spreadsheets– Enhanced Spreadsheets– 2D or 3D plots

• Hierarchies– Dendograms– Tree Maps– Hyperbolic Maps

Page 30: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 30 Indiana University School of

Page 31: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 31 Indiana University School of

Page 32: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 32 Indiana University School of

VisualiSAR – with a nod to Edward Tufte.See http://www.daylight.com/meetings/mug99/Wild/Mug99.html

Page 33: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 33 Indiana University School of

Tree Maps – very Tufte-esque

Page 34: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 34 Indiana University School of

External support• ECCR grant ($500,000)

– 20% Co-PI with Fox for development of web services for HTS data organization and visualization

– May lead to $5m/5 years grant for full center• Applied for Microsoft Smart Clients for eScience grant

($50,000)– Including Marlon Pierce in the Community Grids lab

• Peter Murray-Rust group, Cambridge – offering expertise and assistance with web services

• IO-Informatics – provision of Sentient software and consulting• BCI – clustering, structure enumeration & toolkit, consulting• OpenEye – a range of calculation tools, FRED docking• Molinspiration – MiTools Toolkit• gNova – CHORD chemical database system• Possible financial support from company in the UK

Page 35: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 35 Indiana University School of

Technology• Perl SOAP::Lite

– Will be used for initial web service development– Doesn’t really implement WSDL & UDDI

• Apache Axis & Tomcat– Deploy WSDL for web services

• BPEL4WS – Business Process Execution Language– For aggregation of web services– http://www-128.ibm.com/developerworks/library/specific

ation/ws-bpel/• Microsoft .NET & C#

Page 36: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 36 Indiana University School of

Current activities• Core activities

– Development of use-cases– Development of initial web services (Perl SOAP::Lite)– Use of Taverna to prototype use-case scripts

• Basic research on future components– Organizing large amounts of chemical information

for human consumption• Development of very fast parallel clustering techniques –

to be exposed as web services– Selection of interface-level tools for basic interaction

• Chemical structure drawing, display• Investigation of email, NLP, RSS, and browser interfaces

– Interface-level tools for visualization, navigation and analysis

• Cluster and dataset visualization, natural language interfaces)

Page 37: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 37 Indiana University School of

Sentient - an alternative approachto managing heterogenous data

sources• Collaboration with IO-Informatics (along with Cornell, and UCSD) for the investigation of service-oriented architectures in life sciences research using Sentient software

• Aim to integrate several sources of information relating to Alzheimer’s Disease (brain imaging, morphology, gene expression) so that cross-dataset biomarkers can be identified

• Sentient usies Intelligent Multidimensional Objects (IMOs) to define and query data sources and the tools used toaccess them

• Still a browsing approach, but with a layer of coherenceand “intelligence”

• Hope to expand to include chemistry data• Can also be used as an interface-level tool

Page 38: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 38 Indiana University School of

Page 39: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 39 Indiana University School of

Page 40: Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis & Virtual Screening

David Wild – ECCR Meeting, October 2005. Page 40 Indiana University School of

Conclusions so far• Effective exploitation of large volumes and diverse sources of chemical

information is a critical problem to solve, with a potential huge impact on the drug discovery process

• Most information needs of chemists and drug discovery scientists are conceptually straightforward, but complex (for them) to implement

• All of the technology is now in place to implement may of these information need “use-cases”: the four level model using service-oriented architectures together with smart clients look like a neat way of doing this

• The aggregation and interface levels offer the most challenges• In conjunction with grid computing, rapid and effective organization and

visualization of large chemical datasets is feasible in a web service environment

• Some pieces are missing:– Chemical structure search of journals (wait for InChI)– Automated patent searching– Effective dataset organization– Effective interfaces, especially visualization of large numbers of 2D structures

(we’re working on it!)