Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate,...

33
Open PHACTS “Data integration for all” Andrew Leach

Transcript of Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate,...

Page 1: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Open PHACTS

“Data integration for all”

Andrew Leach

Page 2: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Task, workflow and results

AUREUS search targets: voltage-gated potassium channels

Apply filters (MW, cLogP, Lipinski

+ remove undesirable target) ⇒ ~1000 molecules

Similarity searches (RG, TP, Daylight) Cluster analysis

⇒ ~10000 molecules selected

IonWorks© single shot screening

240 single shot hits progressed into full curve assay

5 full curve actives (in at least one test occasion)

Series for lead

optimisation

Stefan Senger, ca. 2004

Task: create a focussed set to

identify leads against voltage-

gated potassium channels

Page 3: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

We (may) know where the data is, but integrating is a pain, bespoke, and often only for experts

Q: Identify all oxidoreductase inhibitors with an activity <100nM in both

mouse and human

Q: The current Factor Xa lead series is characterised by substructure X.

Retrieve all bioactivity data in serine protease assays for molecules

that contain substructure X.

Q: For a given interaction profile, give me compounds similar to it.

ChEMBL DrugBank Gene

Ontology Wikipathways

Uniprot

ChemSpider

UMLS

ConceptWiki

ChEBI

etc.

Internal

Page 4: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

The Innovative Medicines Initiative

Biggest public-private partnership in

area of medicine

Collaboration between European

Commission and European

Federation of Pharmaceutical

Industries and Associations (EFPIA)

Promotion of medical innovation in

Europe

Tackle key bottlenecks

Recognises “in kind” contributions

Focus on key problems

– Efficacy, Safety, Education &

Training, Knowledge

Management

Page 5: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Public Domain Drug Discovery Data Pharma are accessing, processing, storing & re-processing

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Integration Data AnalysisFirewalled Databases

Why repeat at each company?

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Integration Data AnalysisFirewalled Databases

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Integration Data AnalysisFirewalled Databases

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Integration Data AnalysisFirewalled Databases

GSK

AZ

Pfizer

Merck

Page 6: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Information Tombs

– Built for primary use-case

– Tailored indexes

– Tailored GUIs

– Unique language & metadata

– Poor interoperability/integration

Literature HR Synthesis Portfolio SAR Docs Safety In vivo Etc

Page 7: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Pfizer Limited – Coordinator

Universität Wien – Managing entity

Technical University of Denmark

University of Hamburg, Center for Bioinformatics

BioSolveIT GmBH

Consorci Mar Parc de Salut de Barcelona

Leiden University Medical Centre

Royal Society of Chemistry

Vrije Universiteit Amsterdam

Spanish National Cancer Research Centre

University of Manchester

Maastricht University

Aqnowledge

University of Santiago de Compostela

Rheinische Friedrich-Wilhelms-Universität Bonn

AstraZeneca

GlaxoSmithKline

Esteve

Novartis

Merck Serono

H. Lundbeck A/S

Eli Lilly

Netherlands Bioinformatics Centre

Swiss Institute of Bioinformatics

ConnectedDiscovery

EMBL-European Bioinformatics Institute

Janssen

OpenLink

Project Partners

Page 8: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

A use-case driven approach, focussed on delivery

for the real world

Main architecture, technical implementation and primary

capabilities driven by a set of prioritised research questions

Based on the main research questions define prioritised data

sources

Develop three Exemplars to demonstrate the capabilites of

the Open PHACTS System and to define interfaces and

input/output standards

Page 9: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Work Streams

Build: Service layer and resource integration

Drive: Development of exemplar work packages & Applications

Sustain: Community engagement and long-term sustainability

Assertion & Meta Data Mgmt

Transform / Translate

Integrator

OPS Service Layer

Corpus 1

‘Consumer’

Firewall

Supplier

Firewall

Db 2

Db 3

Db 4

Corpus 5

Std Public

Vocabularies

Target

DossierCompound

Dossier

Pharmacological

Networks

Business

Rules

Work Stream 1: Open Pharmacological Space (OPS) Service Layer

Standardised software layer to allow public DD resource integration− Define standards and construct OPS service layer− Develop interface (API) for data access, integration

and analysis− Develop secure access models

Existing Drug Discovery (DD) Resource Integration

Work Stream 2: Exemplar Drug Discovery Informatics tools

Develop exemplar services to test OPS Service Layer

Target Dossier (Data Integration)

Pharmacological Network Navigator (Data Visualisation)

Compound Dossier (Data Analysis)

Page 10: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Platform Explorer

Standards

Apps

API

Page 11: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Number sum Nr of 1 Question

15 12 9 All oxido,reductase inhibitors active <100nM in both human and mouse

18 14 8

Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound?

24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives.

32 13 8 For a given interaction profile, give me compounds similar to it.

37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X.

38 13 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not).

41 13 8

A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature.

44 13 8 Give me all active compounds on a given target with the relevant assay data

46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease)

59 14 8 Identify all known protein-protein interaction inhibitors

Prioritised research questions

Kamal Azzaoui et al, DDT in press 2013

Page 12: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

`

Pathways

Pharmacological

Activities

Biological

Processes

Transcripts

Pathological

Processes

Diseases

Genes

Proteins

Interactions

Clinical Drug

Applications

Indications

Drugs

Compounds

Chemicals

Page 13: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Open PHACTS will be built upon semantic technologies and

standards, providing an opportunity to:

• Demonstrate that semantic technologies can perform to the same degree

as existing systems

• Provide an open platform to address common drug discovery questions;

expose pharma’s use-cases and knowledge

• Create a pre-competitive infrastructure that can be sustained and

expanded into new areas; providing the platform for future collaboration

Why Semantic Technologies?

• Rapidly developing technology, powerful algorithms for integration and

querying of data

• “schema free”

• Open standards – facilitating sharing public, private, commercial

• A community of developers, leverage work going on elsewhere

Page 14: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

User Interfaces & Applications

Linked Data API

Linked Data Cache

Identity

Mapping

Service

Identity

Resolution

Service

Domain

Specific

Services

Data Key architecture components

Page 15: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Nanopub

Db

VoID

Data Cache (Triple Store)

Semantic Workflow Engine

(LARKC)

Linked Data API (RDF/XML, TTL, JSON)

Domain

Specific

Services

Open PHACTS

Explorer 1st Gen Apps

App

Framework

Identity

Resolution

Service

(ConceptWiki)

Chemistry

Normalisation

& Q/C

ChemSpider

Identifier

Management

Service

(BridgeDb+)

Partner Apps

Data

Import

Co

re P

latf

orm

P12374

EC2.43.4

CS4532

“Adenosine

receptor 2a”

Oct. 2012

VoID

Db

Nanopub

Db

VoID

Db

VoID

Nanopub

VoID

Public Content Commercial

Public

Ontologies

User

Annotations

Page 16: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Building Quality

High quality chemical names and synonyms. Leverage ChemSpider and

Concept wiki curation, Q/C and mapping

ChemSpider Validation and Standardization Platform (CVSP) for flagging

chemical representation issues

Basic curation interface for editing concept terms available through Concept

Wiki

Data quality issues detected in data sources reported back to depositors for

their evaluation

Page 17: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

STANDARD_TYPE UNIT_COUNT

---------------- -------

AC50 7

Activity 421

EC50 39

IC50 46

ID50 42

Ki 23

Log IC50 4

Log Ki 7

Potency 11

log IC50 0

STANDARD_TYPE STANDARD_UNITS COUNT(*)

------------------ ------------------ --------

IC50 nM 829448

IC50 ug.mL-1 41000

IC50 38521

IC50 ug/ml 2038

IC50 ug ml-1 509

IC50 mg kg-1 295

IC50 molar ratio 178

IC50 ug 117

IC50 % 113

IC50 uM well-1 52

IC50 p.p.m. 51

IC50 ppm 36

IC50 uM-1 25

IC50 nM kg-1 25

IC50 milliequivalent 22

IC50 kJ m-2 20

~ 100 units

>5000 types Implemented using the Quantities, Dimension, Units, Types

Ontology (http://www.qudt.org/)

Quantitative Data Challenges

Page 18: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Chemistry within Open PHACTS

The challenges associated with handling chemistry data require the

support of a publicly accessible platform to integrate, standardise and

host the data.

ChemSpider, an online database from the Royal Society of Chemistry

hosts the chemical compound collection underpinning Open PHACTS

and is responsible for standardising the chemical compounds and

providing both regular updates and ongoing data curation.

To serve the Open PHACTS platform, a structure validation and

standardisation platform (CVSP) has been developed to ensure

chemical structures are normalised to rules derived from the FDA

structure standardisation guidelines and modified based on input from the

EFPIA members.

Page 19: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

The many challenges of chemistry representation…

Page 20: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Identities within Open PHACTS

Open PHACTS integrates information

from multiple different databases, many

of which use unique identifiers. The

Identity Mapping Service (IMS)

ensures these identifiers are linked and

available for use interchangeably

throughout the Open PHACTS platform.

To maintain vocabulary heterogeneity

and provide interoperability, the

ConceptWiki is used. The ConceptWiki

is an open access system that accepts

essentially unlimited numbers of

synonyms, in multiple languages, and

then maps all the terms correctly back to

one unique concept identifier, alleviating

vocabulary problems and identifier

differences.

Synonyms:

Aspirin

Dispril

2-Acetoxybenzoic acid

Acetyl salicylic acid

Salicylic acid, acetyl-

ChemSpider ID: 2157

Explorer

FDA: 16030

ChEBI ID: CHEBI:15365

DrugBank ID: APRD00264

IMS

Page 21: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Why Provenance Matters

Using a community specification known as

“VoID” (Vocabulary of Interlinked Datasets)

Record version, author, derivations

Builds trust with users – know what you are

querying (and why it might have changed)

Provides mechanism to provide usage

statistics back to providers, help them

understand the value

Easier to track errors and ensure quality

Actively participating in community

provenance programme (W3C)

Page 22: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

What does Open PHACTS do?

Currently integrated

databases

Database

Number of

triples

(million)

ACD Labs /

ChemSpider 161.34

ChEBI 0.91

ChEMBL_v13 146.08

ConceptWiki 3.74

DrugBank 0.52

Enzyme 0.07

Gene Ontology 0.85

SwissProt 156.57

WikiPathways 0.14

TOTAL 470.21

Open PHACTS draws together

multiple sources of publicly-

available pharmacological and

chemical data, allowing public

access to the information via the

Open PHACTS Explorer, an

intuitive interface.

Page 23: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Licensing: 3 “public” databases

Comparative Toxicogenomics Database

OMIM

Drugbank

All are available as “open” RDF you can download right now. But:

Page 24: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

“CUTTING THE GORDIAN KNOT”

What are the problems with licensing we had to address?

– To make the data and software generated by the project usable and reusable

– Multiplicity of unclear or non-standard licenses on original data sources

• ‘Public’ can mean use but not redistribute, use in commercial environment,

• Legal position on use and reuse extremely unclear

• Different issues than just linking to data

– What is the legal status of integrated collections of the above, and of derived knowledge from

such a collection?

– Appropriate software license selection

– Legal clarity for EFPIA and end users

– Approaches for commercial data integration, EFPIA in-house data

AIM: to enable maximum possible dissemination and usability of the integrated data and

architecture generated by the project - with approaches that will be applicable in other

data integration projects

Page 25: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Chose John Wilbanks as consultant

A framework built around STANDARD well-understood

Creative Commons licences – and how they interoperate

Deal with the problems by:

Interoperable licences

Appropriate terms

Declare expectations to users and

data publishers

One size won‘t fit all requirements

Data Licensing Solution

Page 26: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Development partnerships Influence on API developments

Opportunities to demo ideas & use cases to core team

Need MoU and annexe

Associated partners Support, information

Exchange of ideas, data, technology

Opportunities to demo at ctions, mostommunity

webinars

Need MoU

Associated partners

Development partnerships

Consortium

MoU

+Annexe

Consortium 28 current members

Open PHACTS and the

scientific community

Page 27: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Example applications

Advanced analytics

ChemBioNavigator Navigating at the interface of chemical and

biological data with sorting and plotting options

TargetDossier Interconnecting Open PHACTS with multiple

target centric services. Exploring target

similarity using diverse criteria

PharmaTrek Interactive Polypharmacology space of

experimental annotations

UTOPIA Semantic enrichment of scientific PDFs

Predictions

GARFIELD Prediction of target pharmacology based on the

Similar Ensemble Approach

eTOX collector Automatic extraction of data for building

predictive toxicology models in eTOX project

Page 28: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

ChemBioNavigator

Matthias Rarey et al

PharmaTrek

Jordi Mestres et al

Page 29: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Call for expressions of interest

Open PHACTS ENSO proposal

Open PHACTS intends to submit a

proposal for IMI ENSO funding.

We are currently drafting our ENSO

proposal and invite all EFPIA

companies with an interest in Open

PHACTS to contact us to discuss

opportunities for involvement.

The Open PHACTS Foundation

Open PHACTS has a successor

organisation, the Open PHACTS

Foundation.

Please register your interest with us

for further information on membership

and other opportunities to get involved

within Open PHACTS.

For more information and/or to register interest email us at

[email protected]

Page 30: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Acknowledgements

Stefan Senger

Gerhard Ecker

The OpenPHACTS consortium

Page 31: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society
Page 32: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Data

Targets; Chemistry; Pharmacology; Literature; Patents

Standards Ontology/taxonomy;

Minimum information guide; Dictionaries; Interchange mapping

Assertions

e.g. Gene-to-Disease; Compound-to-Target;

Compound-to-ADR

Application (Knowledge)

Fact Visualisation e.g. Target Dossiers;

SAR Visualisation

SERVICES

After Barnes et al Nature Review Drug Discovery 2009 doi10.1038/nrd2944

Page 33: Open PHACTS “Data integration for all”...support of a publicly accessible platform to integrate, standardise and host the data. ChemSpider, an online database from the Royal Society

Nanopublications – Capturing scientific information in

the Triple Store