Scaling the walls of discovery: using semantic metadata for integrative problem solving

Post on 11-Jan-2016

30 views 0 download

description

Scaling the walls of discovery: using semantic metadata for integrative problem solving. Greg Tucker-Kellogg, Ph.D. Chief Technology Officer Senior Director, Systems Biology Lilly Singapore Centre for Drug Discovery. Outline. - PowerPoint PPT Presentation

Transcript of Scaling the walls of discovery: using semantic metadata for integrative problem solving

Lilly Singapore Centre for Drug Discovery

LSCDD

Scaling the walls of discovery: using semantic metadata for integrative problem solving

Greg Tucker-Kellogg, Ph.D.Chief Technology OfficerSenior Director, Systems BiologyLilly Singapore Centre for Drug Discovery

2LSCDD

Outline

The Challenge of Translational Discovery in Pharmaceutical Research

Integration of Metadata using Semantic Web Technologies•Why focus on metadata?•How it helps

Examples

3LSCDD

Lilly Singapore Centre for Drug Discovery

Integrative Computational Sciences (tools)

Wet lab biology

Drug Discovery(drug candidates)

Oncology and diabetes research towards tailored therapy to improve patient outcome

Systems Biology(biomarkers)

Experimental Computational

4LSCDD

Pharmaceutical R&D spends more to get less

5LSCDD

Lost in translation

The limits of my language mean the limits of my

world (Ludwig

Wittgenstein)

我的语言限制的范围是我的

Translate

I limit the scope of the language I (Ludwig Wittgenstein)

Translate

6LSCDD

Translational research in cancer: Connecting the dots of genetic aberrations

Targets Disease Patients

Pathways

Tailored TherapeuticsImprove individual patient outcomes and health outcome

predictability through tailoring drug, dose, timing of treatment, and relevant information

Tailored TherapeuticsImprove individual patient outcomes and health outcome

predictability through tailoring drug, dose, timing of treatment, and relevant information

7LSCDD

The “Web” of heterogeneous data

Cell/AssayTechnologies

8LSCDD

Integrating Scientific Data Sets

Uncontrollable diversity

Most of the valuable data is from outside our walls

Much of it is poorly structured

Ranging from large (1TB/day) to boutique

9LSCDD

Scientist’s View of Integrated Information

Protein-IHC,-Luminex

Omics

RNAi reagents-Qiagen siRNA-BROAD shRNA-cDNA

Acumenassays

Cellomicsassays

High-contentbioassays

Biochemicaldata

Chemical biology

Functionalchemogenetics

Target basedchemotype profiling

Mapping and annotation backbone

Interrogators Reporters

Pathway-basedchemotype profiling

Strategic

Cross-domain integration

Domain-level integration

Foundational

Color code

Plate Reader

DNA-CGH-SNP,Mutation

RNA-miRNA-mRNA

Epigenetics-Methylation-ChIP-Chip

Platforms

10LSCDD

Manual Data Integration

A repeated, tedious process:• Pull data from internal and public data sets• Normalize terms and values• Write and run analysis scripts• Compile into a single Excel file, detached from the data

source (no drill-down)

Often this process can consume days with no guaranteed resolution

11LSCDD

Integration Approaches Considered

•Data Warehouse• Difficult to maintain and integrate new data sets• Difficult to evolve as data changes• Schemas tightly coupled to applications

•Federated queries• Query performance issues• Where to place the index?• Problematic to maintain• Translating user search syntax to all sources requires deep knowledge of data layer

•Semantic Integration• Relatively unproven in enterprise systems but adaptive to change• Relationships between data can be more fully characterized

12LSCDD

Standard Semantic Integration Model

QueryGenerator

ResultsPresentation

SemanticNormalization

Source

Source

Source

Source

DomainOntology &Mappings

Data SetIntegration

QueryPlanning

QuerySubmission

•All data is mapped to domain ontology in both directions

•If single system is down, incomplete results.

•Performance is limited to slowest system in network

•Massive mapping effort

•Multiple implementations of this approach, including:• Biological and Chemical Integrated Information System (BACIIS)

• Boeing

13LSCDD

Can we do better for our purposes?

•Avoid a complex architecture and extended development effort

•Realize benefits in the near-term

•Preprocess metadata to improve efficiency.

•Characterize the type of questions that ontology should answer

•Identify stable semantic technologies, do not employ parsers.

•Allow semantic and relational databases to work together

14LSCDD

What we need

Data Management and Availability•Capturing and filtering the global and growing avalanche of internal and external scientific data

Data Fusion•Systems to link, combine and navigate massive and heterogeneous data sets

Information Analysis and Mining•Algorithms and tools to help scientists seek correlations and find connections between pre-clinical and clinical knowledge to generate and test translational hypotheses.

15LSCDD

Data Architecture

Expression(Affy,Agilent,

Illumina)aCGH Screening Methylation

SNPMutation

TissueMicroarray

ChIP-Chip,miRNA

AnalysisResults

Domain/Platform Specific Data

Integration Layer

Experimental Matedata Repository

Annotation Services (Genomics mapping

+ Gene functional info)

QueryVisualization

Analysis and Mining

Algorithms Workflow

Experiment Context

Experiment Context

ReadoutReadout

Mapping & Annotation

Mapping & Annotation

Derived Results

Derived Results

Genomics mapping

Proteome/GO

FunctionalInformation

34 platforms

Ontology

Common Vocabulary

Centralized Experiment

Context

30 million triples

16LSCDD

LSCDD Data integration process in use

Query Visualization

ExperimentalMetadata Repository

Annotation Services(Genomics mapping + Gene Function)

AffyExpression

AgilentExpression

IlluminaExpression aCGH Screening RNAi

DatabaseMutation

SNP TMA AnalysisResults

17LSCDD

LSCDD Semantic Integration Approach

• Use semantic technology on an appropriate problem • Create Ontology focused on solving LSCDD integration needs

•Scientists and IT Analysts work together to iteratively create tailored vocabulary

•Define competency questions to validate the ontology•Encourage ontology to evolve, a different animal than RDBMS schemas

•Create bridges to public and internal ontologies to realize the full capabilities of the vocabulary

• Involve users to verify RDBMS-to-ontology mapping to increase confidence in the solution.

• Sparql is hard. Design an intuitive query model or question templates for users to navigate the repository.

18LSCDD

LSCDD Semantic Integration Approach (Cont)

• Used Agile philosophy throughout: application development, ontology development and mapping effort

• Drive adoption by engaging users to understand their challenges and refine the solution.

• Technologies •Protégé Ontology Editor•Oracle Semantic Technologies 11g•D2R Map (Database to RDF Mapping Language)•C# development in Visual Studio 2205

19LSCDD

Metadata RDF Repository

• Aggregates experiment metadata from a diverse set of LSCDD relational databases into an Oracle Semantic Technologies repository for LSCDD scientific investigation.

• Scientists at LSCDD now have a single source of experiment information described with a common vocabulary.

• Current data sources include:•Expression Data : Affymetrix, Illumina, Agilent•aCGH Data•RNAi Screening Data•Reagent Data•Gene Ontology (GO)•Medical Subject Headings (MeSH)•Many others

Currently ~30 million triples

Currently ~30 million triples

20LSCDD

LSCDD Metadata Ontology

Experiment

Protocol

CellLine

Chip

Tissue

Plate Well

DNA Reagent

Sample

Probe

hasPlateCompound

Gene

ReagentHardware

Assay

hasPlate

Protein Reagent

ClinicalData

Project StudyhasProject hasStudy

Software

subclass

Plate

TreatmentRNA Reagent

hasGene

Model

Chip Type

DiseaseState

hasDiseaseState

GeneList

hasSourceTissuehasSource

subclasshasSample

subclass

subclass

subclass

hasGOId

ViralBatch

hasModel

hasCelllinehasTissue

hasMESHId

hasChiphasAssay

hasChipType

hasChipType

hasGene

IsPartOf

hasReagent

hasReagent

hasProtocol

MESHGO

hasTreatment

hasCompound

21LSCDD

Metadata Repository Application

• Both browse and query views are provided for repository access.• The Query View allows the user to search the repository by setting constraints on attributes of the entities in the ontology.

• Links to external data sets such as Gene Ontology and MeSH have been defined, queries may span multiple ontologies.

• Results View displays details about each of the matches found and allows user to navigate across entities.

• The application is created as a plugin to the Lilly Science Grid and can leverage Integrated Genomics Portal for Cancer Research (IGPCR) plugins to provide details about Genes in hit lists.

22LSCDD

Metadata Repository Application

Find all deacetylases involved in Colorectal Neoplasms- Add filter to Gene Ontology Label attribute

- Add filter to MeSH Description Name attribute

- Run Query…Results View shows list of GenesNavigate across data links

23LSCDD

Experiment Data Annotation

While raw experiment results are not suitable for editing, metadata such as experiment descriptions and relations becomes more valuable when users augment and refine. Experiment

hasId: abc123hasContact: Bill SmithhasType: SiRNA ScreenhasDescription: ____

…Experiment

hasId: def456hasContact: Jane SmithhasType: SiRNA ScreenhasDescription: H460 screen

H460 screen: run 789

hasConflictingResults

24LSCDD

IGPCR: Integrated Genomics Portal for Cancer Research

An Integrated view for analysis

results

Helps oncology researchers with:•Drug target identification and prioritization

•Biomarker discovery

•Combination therapy

25LSCDD

Backup

26LSCDD

27LSCDD

28LSCDD

29LSCDD

30LSCDD

Are there any reagents available to conduct functional validation?

Get me all the interactions for methylases that are involved in colorectal cancer. And for all these genes, get the expression and aCGH values for all colon cancer samples.

Answering scientific questions

What is the status of the target of my interest across multiple tumor types? What are the right model systems to study the perturbation of my gene of interest?

31LSCDD

Cancer drug discovery

32LSCDD

Integration of high throughput datasets

Chemosensitivity

Tumor Samples

Patient Survival

Cell lines

RNAi

Tissue Microarrays

Expression

CGH / SKY

Public / P

rivate

Mutations

Chemosensitivity

Tumor Samples

Patient Survival

Cell lines

RNAi

Tissue Microarrays

Expression

CGH / SKY

Public / P

rivate

Mutations

33LSCDD

Going Forward

• Integration with additional external sources: NCBI, KEGG, Proteome, PubMED• Integration with National Cancer Institute Metathesaurus• Continued integration with new data types generated internally or from collaborators• Definition and support of additional ontologies

IntegratedAugmented

QueryResults

SnoMed

PubMed

NCI Metathesaurus

Stanford TissueMicroarray

Web ResourcesLilly Data

Labs

Internal Data

Public Data

Collaborators

Analysis Pipelines Visualizers

34LSCDD

Acknowledgements

LSCDD, SingaporeIT

•Kevin Gao, Rakhi Bhat, Srinivasulu Kota and Maurice Manning

Systems Biology•Amit Aggarwal and Mahesh Kumar Guzuva Desikan

ICS•Pat Hartman

HiSoft Technology – Dalian, China•Bill Yan, Young Gong, Harold Yin, Steven Cao and Jason Wang

Lilly, Indianapolis USA•Susie Stephens, Jacob Koehler

35LSCDD

Backup Slides

36LSCDD

Putting it all together…

Objects Measure

MTS Literature

Binding Coding

Clinical DB

Compounds

Images

Genes

SNPs

Expression

Linkage D

Signature

Fingerprint

Map 1 Map 2

37LSCDD

Silos Need to Broken Down

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Target Hit Lead PgS CS FHD FED PD/RD FS FA FL GL

TargetToHit

HitTo

Lead

LeadTo

PgS

LeadOptimization

Pre-ClinicalDevelopment

Phase I Phase 2 Phase 3Registration

LaunchGlobalLaunch

Project Program Product

Exploratory

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

38LSCDD

Web Interface

Input user queries andpresent the query results

Data SourceSchema

Bio-ChemicalOntology

BACIISKnowledge

Base

Query Generator Module

Generate semanticbased user queries into

domain recoganizedterms through Ontology

Query Planning and Execution Module

Query Planner

Decompose the userquery into subqueries,define the subqueriesdependancy, and find

the query paths

Mapping Engine

Map each subquery intospecific data source(s)

Execution Engine

Receive data sourcespecific subqueries

and envokecorresponding

wrappers to fetchthe data from

remote data source

Result Presentation Module

Receive and integratethe individual result

set from wrappers intoHTML format andsend result pages to

web interface

Mediator

Wrapper

Fetch HTML/XMLpages from remotedata source, extract

result data

WebDatabase

WebDatabase

WebDatabase

Wrapper

Fetch HTML/XMLpages from remotedata source, extract

result data

Wrapper

Fetch HTML/XMLpages from remotedata source, extract

result data

BACIIS System Architecture

39LSCDD

Hybrid Architecture

Knowledge-SpaceNavigation

PresentationServices

AnalyticServices

User Interface

Federation Entities

Navigational Entities

Presentation Entities

Personalization Entities

Persistence Entities

Analysis Entities

MetadataRepositories

Source Source Source Source Source

Data Access Service Layer

Navigation Service Layer Data Set Integration Services

Me

tad

ata

Se

rvic

es

La

yer

Query Preparation Service Semantic Normalization Service

Query Submission Service Streams Management Service

Request Brokers

Semantic Layer

Adaptive Layer

Physical Access Layer

ListManagement

40LSCDD

Goals

•Make knowledge emerge from repositories•Make data more valuable by adding context•Leverage intellectual assets•Decision support•Enhance productivity•Reduce IT integration efforts