Low Hanging Fruit Breakout Discussion #2

4
Compound Risk Dossier Objectives Improved toxicological prediction demands the best integrated view of current and historic data, both proprietary and public domain. The objective of the compound risk dossier (CRD) would be to create a service that is able gather and integrate risk/safety-related information for a compound (including consideration of similar structures, key moieties, metabolites, toxicology MoA, etc). The harvested information would then be integrated and presented to the user in the form of a “safety profile”. Business Case It is envisaged that the CRD could bring the following business benefits: The system would enable an efficient “background check” for NCEs based on structural or biological similarity, or possibly shared pharmacology, toxicology MoAs or adverse event effects, i.e. what is known about molecules similar to my candidate? Creation of a safety profile, in which safety categories are normalised and can be grouped according to public ontologies, provides a powerful method of aligning data and enables intelligent analysis. Pharma companies duplicate effort in aligning internal, vendor and public data; such a CRD service would reduce the organisation time for this sort of activity down to almost zero for common activities across organisations, which currently can be costly, time consuming, tedious, and error prone. Open Standards Open vocabularies, ontologies, e.g. PubChem, ChemIDplus, WHOINN, OBO, OpenTox, ChEBI,… Safety data sources: AERS, drug labels, regulatory documents, etc. Open source methods (QSAR, CDK, Weka, R, OpenTox,..) Open APIs (e.g., extend and test OpenTox API 1.2 http://www.opentox.org/dev/apis/api-1.2 for data integration into common rdf resource) Implementation It is suggested that a limited set of public domain data sources are selected in the first instance, to allow a proof of concept within a 12 months. Identify vocabulary, ontology sources for compounds, pathologies, etc.(See Toxicology Ontology Roadmap, Hardy, B. et al. from OpenTox-EBI Industry Forum workshop, in press) Identify data sources from which to harvest risk related information. Opt for a handful of structured sources rather than free text (NDAs, etc.) in the first instance? Compound safety data sources, both public and private, are mined for risk-related content which is harmonised and organised using public domain ontologies (and held as an RDF triple store?) Text mining and other semantic technologies will be necessary at this stage. This data store can be called on by APIs or provide information that can be consumed by analysis tools, ELNs, etc. Decide on quality metrics – on-the-fly profiles vs. curated, pre-canned data, accuracy vs. recall Other things to consider include provenance, governance, security, legal, etc. Pistoia Alliance Role Definition of Use Case Guidance on best safety-related data sources Guidance on open standards to use, and their extensions needed Provide partners willing to integrate public, vendor and proprietary data Funding of early phase POCs

description

Four projects (compound risk dossier, text mining, screening data management, and support for cloud collaboration) were outlined during a breakout discussion led by Paul Bradley and Barry Hardy at the Pistoia Alliance Information Ecosystem Workshop in October 2011.

Transcript of Low Hanging Fruit Breakout Discussion #2

Page 1: Low Hanging Fruit Breakout Discussion #2

CompoundRiskDossier Objectives Improved toxicological prediction demands the best integrated view of current and historic data, both proprietary and public domain. The objective of the compound risk dossier (CRD) would be to create a service that is able gather and integrate risk/safety-related information for a compound (including consideration of similar structures, key moieties, metabolites, toxicology MoA, etc). The harvested information would then be integrated and presented to the user in the form of a “safety profile”. Business Case It is envisaged that the CRD could bring the following business benefits:

The system would enable an efficient “background check” for NCEs based on structural or biological similarity, or possibly shared pharmacology, toxicology MoAs or adverse event effects, i.e. what is known about molecules similar to my candidate?

Creation of a safety profile, in which safety categories are normalised and can be grouped according to public ontologies, provides a powerful method of aligning data and enables intelligent analysis.

Pharma companies duplicate effort in aligning internal, vendor and public data; such a CRD service would reduce the organisation time for this sort of activity down to almost zero for common activities across organisations, which currently can be costly, time consuming, tedious, and error prone.

Open Standards

Open vocabularies, ontologies, e.g. PubChem, ChemIDplus, WHOINN, OBO, OpenTox, ChEBI,…

Safety data sources: AERS, drug labels, regulatory documents, etc. Open source methods (QSAR, CDK, Weka, R, OpenTox,..) Open APIs (e.g., extend and test OpenTox API 1.2

http://www.opentox.org/dev/apis/api-1.2 for data integration into common rdf resource)

Implementation It is suggested that a limited set of public domain data sources are selected in the first instance, to allow a proof of concept within a 12 months.

Identify vocabulary, ontology sources for compounds, pathologies, etc.(See Toxicology Ontology Roadmap, Hardy, B. et al. from OpenTox-EBI Industry Forum workshop, in press)

Identify data sources from which to harvest risk related information. Opt for a handful of structured sources rather than free text (NDAs, etc.) in the first instance?

Compound safety data sources, both public and private, are mined for risk-related content which is harmonised and organised using public domain ontologies (and held as an RDF triple store?)

Text mining and other semantic technologies will be necessary at this stage. This data store can be called on by APIs or provide information that can be

consumed by analysis tools, ELNs, etc. Decide on quality metrics – on-the-fly profiles vs. curated, pre-canned data, accuracy

vs. recall Other things to consider include provenance, governance, security, legal, etc.

Pistoia Alliance Role

Definition of Use Case Guidance on best safety-related data sources Guidance on open standards to use, and their extensions needed Provide partners willing to integrate public, vendor and proprietary data Funding of early phase POCs

Page 2: Low Hanging Fruit Breakout Discussion #2

TextMining/MetadataMarkupofUnstructuredText Objectives Unstructured text sources, both public and proprietary, are rich in information but several features limit their use in analysis, such as:

No mark-up of key concepts – important terms such as drug and target names are buried within free text with no simple mechanism to surface this information

Linguistic diversity – widespread use of synonyms and ad hoc identifiers make it difficult to carry out semantic searching of free text sources.

The objective is to carry out carry out text mining and concept tagging of unstructured text to provide a meta-data layer over documents. By linking the metadata to public ontologies, a semantically consistent set of tags will be produced, allowing document sources to be queried and clustered according to recognised standards. This resource could then be made available using a cloud model to deliver value and standard search capabilities to Pharma and Academics alike with appropriate consumption models. Business Case The mark-up and mapping of key terms from unstructured text would bring the following benefits:

Enhanced search and document retrieval over free text sources Linking of in-house structured data sources to unstructured information, in-house and

in the public domain Repurpose unstructured text to produce actionable intelligence, for example by

creating assertional metadata Drive towards a common standard for searching or at least a common “honest

broker” for search across different resources. Open Standards It is suggested that, in order to achieve a working implementation within a 12 month time frame, a limited set of open standards are applied in the first instance. This could be discussed more widely within the Pistoia Alliance, but the following areas are worthy of consideration

Limiting by domain, e.g. protein targets, drug terms, gene names, pathology Limit to a single standard that covers multiple domains, e.g. SNOMED-CT, ICD9CM

Implementation

Select public domain free text source, e.g. Medline Identify public ontologies and vocabulary sources Use text mining/concept recognition tools to identify key concepts and map to

standards: Autonomy, Metawise (BioWisdom), Helium (Ceiba), etc. Platform for search/display – Lucene, other open source

Pistoia Alliance Role

Collaborate to define Use Case Agree on document sources Agree on open standards to use, extensions needed Advise on best practice on document mark-up, search, analysis, governance,

security, etc. Funding of early phase POCs to aid the development of the tools and a drive towards

standards. Support for a free/reduced cost academic access mechanism to encourage common

methods of tagging and naming in the academic environment.

Page 3: Low Hanging Fruit Breakout Discussion #2

ImprovedCollaboration:ManagementofScreeningData Objectives

To integrate screening data from multiple sources To create a standard for expression of screening data, to allow easier integration

Business Case

Definition of a standard for reporting compound screening data allows easier integration, with cost and time savings

Facilitates easier sharing of data and collaboration

Open Standards MIABE, MIAME ISA-TAB Define standard for dose response for HTS, HCS, include vocabulary, units; support

multiple plate formats, standardised statistical anaylsis Define how to deal with incomplete data sets, null values, etc.

Implementation

Create a the standard, learning from existing standards such as MIAME Apply the standard in a working project Reiterate and refine

Pistoia Alliance Role

Guidance on definition of the standard Survey what has already been done in the area

Page 4: Low Hanging Fruit Breakout Discussion #2

Enablingbettercollaborationinthecloud,appliedtomonitoringofNGSdata

Objectives

To provide scientific, business and legal processes outlining best practices for organisations collaborating in the cloud.

Application of these best practices in a system for monitoring the progress of NGS projects.

Business Case

Time and cost savings in deciding whether a collaborative project should be carried out in the cloud.

Streamline implementation of cloud-based collaborations by providing clear guidelines.

Reduces delays in handovers. Greater visibility of distributed project statuses across different organisations. Early visibility, alerting of important events, allowing timely interventions.

Open Standards Clear APIs and communication standards. Define web services and service discovery mechanisms. UDDI (Universal Description, Discovery and Integration). MIAME?

Implementation

Outline best practice rules for working on the cloud What is the use case?, e.g. alternative to an internally-hosted system, a method of

distributing large queries, etc. What are the requirements for flexibility, such as how long is the service required for

and will capacity requirements change over time? What is the tie-in period? Need clear APIs and communication standards. Location – does data need to be held within certain boundaries, e.g. within the EU? What level of encryption is required? Create standard format for NGS data, consumable by analysis software, e.g. Spotfire.

Pistoia Alliance Role

Signposting best practice in the cloud. Advise on standard representation of NGS data.