Standards and Ontologies to Enable Discovery Data and Information Integration Robin McEntire...

35
Standards and Ontologies to Enable Discovery Data and Information Integration Robin McEntire GlaxoSmithKline 19 Nov, 2002

Transcript of Standards and Ontologies to Enable Discovery Data and Information Integration Robin McEntire...

Standards and Ontologies to Enable Discovery Data and

Information Integration

Robin McEntireGlaxoSmithKline

19 Nov, 2002

Q: What non-existing technology do you most wish you had?

A: A technology that would allow you to put in a DNA sequence and then spit out the specific protein function, disease association, known pharmacophores that could be developed into small molecules, and market value of small molecule or protein therapeutic (antibody) drugs generated from that gene.

Martin Leach

CuraGen Director of Bioinformatics

Bioinform 4(26), 10 (6 Nov 2000)

Drug Discovery Process, circa 2002

data miningmicroarraystransgenics

cheminformaticsbioinformatics

HT chemistry

chemicaldiversity

HT Screening SAR

identify‘hit’

optimize‘hit’ structure

targetvalidation

targetidentification/

validation

in vivo testing genotyping

Discovery Process IT

Sequencing

Synthesis Screening

SynthesisPlanning Inventory

Compound design

AnalyzeResults

Preparereagents

DevelopAssay

Candidate targets

Select target

Discovery

Analytical

Drug Discovery Today

Solution• Genomics

• Combi-chem

• HTS & uHTS

• Pharmaco-genomics

New Bottleneck

Data analysis, interpretation, & integration

Bottleneck• Few novel

targets

• Lead explosion in a series

• Too long to screen

• Relating genes to disease

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Integration of discovery information

What technologies can help?

• Integration - to assist the transformation of data to information and to knowledge

• Text Mining - to expose the information/knowledge locked in text documents (internal and external)

• Grid computing• Open source and public domain initiatives• . . .

Two fundamental problems for information integration

• Heterogeneous software systems– hardware platforms– operating systems– network protocols– programming languages & application formats

• Heterogeneous data semantics– naming conflicts– measurement conflicts– representation conflicts– computational conflicts– granularity conflicts

Solutions

This works until the next scientific advance

This works until the next merger

Require all information providers to use a single consistent vocabulary

Convert all software to single language, OS, hardware platform

Alternatively ...

• Focus on interoperability

• Collaboratively develop standards to support software interoperability

• Collaboratively develop tools and shareable ontologies

Use the Tom Sawyer approach!

How to cope?• Don’t rely on particular hardware platforms

Your system will outlive hardware

• Don’t rely on one operating system There will always be many — perhaps from one vendor

• Don’t rely on a single programming language They come and go faster than hardware

• Do follow the first principle of good design Define small, well-documented interfaces between modules Define common terminologies and common business objects

Coping -- Software ArchitectureCoping -- Software Architecture

• The real issue isn’t how many tiers you have, it’s understanding how to organize a distributed application– what are the components?– where do they live?– how do they talk?

• Most applications tend to follow a common structural pattern: presentation, “business model” (analysis), and data storage

Two-tier systemsTwo-tier systems

“Business model” is embedded in presentation (“fat client”) or data storage (stored procedures,triggers)

Back end physical storage,legacy applications, etc.

Data representation is medium of exchange (brittle, low-level)

Flat file, ASN.1, XML, ...

Three-tier systemsThree-tier systems

Local objects on desktopmanage presentation, actas clients to middle tier

Middle layer providesabstract model of business process and information,encapsulates back end

Back end physical storage,legacy applications, etc.

Distributed object technologyis the established technologyof choice for the middle tier

Focus on modeling business behavior

Focus on modeling business behavior

• Business logic/process is a first-class citizen– business logic focuses on behavior, not data– insulates client from data representation– encapsulates (hides) implementation, legacy

systems

• “Middle” layer should embody an abstract model of business process– its development is a long-term, core investment– this is where component technology is headed

Component Interfaces are needed - but are not the whole story

Integration of life sciences information across scientific disciplines and business areas is essential, however ...

• Terminology is inconsistent – information searches are usually incomplete and inaccurate

• Definitions and descriptions of objects across a business area differ among data sources – integrating multiple sources is labor-intensive, expensive, and time-consuming

• Make common, shareable ontologies a part of the component marketplace

Text Mining

Text Mining - Challenges and Possibilities

• Information overload. There’s too much.

• Free text is a large category: most bio-information is only in text – Medline indexes about 600K entries/year.– Pharmas make heavy use of full-text ejournals– The USPTO has over 2 million full-text patents

online• Business needs to

– find documents/information– screen and sort inputs– discover relationships and mine information

Text Mining• We would like

– Better retrieval– Help with handling the documents we have– Help finding specific pieces of information

without having to read each document

• What might help?– Statistical techniques– Natural language processing techniques– Knowledge domain based techniques

• Controlled vocabularies and ontologies are key

Grid Computing

Grid Computing• Still being defined to some extent. A good

working definition for a large part of The Grid is “A heterogeneous, location-transparent pool of network accessible computation, data and application resources within a secure, managed common namespace.”

• Unifies compute, data and application resources– Allows use of resources regardless of location

– Allows aggregation of discrete resources

• Analogous to the electric power grid. Resource available to the user can come from anywhere

The Grid• More than technology for high performance

computing -- it’s a different way of looking at computing and network-accessible resources

• There is an explosion in the complexity, diversity and distribution of hardware, software and information

• Mergers, acquisitions, joint ventures, and partnerships in all industries are creating the need for distributed and virtual organizations

• Consortial efforts to build consensus and standards (Global Grid Forum, GGF)

• Controlled vocabularies and ontologies are key

Build Shareable Ontologies• Express formalized ontologies in a common

language (or a small number of languages), facilitating representation and exchange of ontological knowledge

• Establish consortia and community-based initiatives to build common ontologies to establish shared understandings within the industry

• Do the experiment -- insert ontologies into the component, text mining and grid computing space!

Role of External Alliances and Collaborations in the Enterprise

Architecture

External Alliances and Collaborations

• Two essentials;– The job is too big for any one organisation– Standard components, infrastructure and ontologies

promote best-of-breed

• External alliances can play a vital role in defining & developing suitable services & standards

Engagement with alliances• Shopper / Victim

No alliance engagement: shop for (or simply accept) vendor-supported standards

• WatcherSemi-passive acceptance: evaluate & select from alliance (& other) products

• NavigatorActive participant: influences software & component development to suit enterprise strategic needs

Standards selection criteria

• Robustness

• Architectural fit

• Availability of implementations

• Stability

• Continuing development

• Level of adoption / acceptance

• Size & vigor of user community

• Cost of adoption / migration

Infrastructure standards (examples)• Data Interchange Services (e.g., PDF, HTML, ISO/IEC 10918 [JPEG], XML)• Data Management Services (ISO 9075:1992 [SQL], SQL CLI)• Graphics & Imaging Services (GIF, TIFF, GKS, CGM)• International Operation Services (ISO/IEC 10646-1 Universal Multiple-Octet

Coded Character Set)• Location & Directory Services (IETF RFC1738 [URL], RFC2251 [LDAP])• Network Services (IETF RFC 821 SMTP, X.400, IETF RFC 793 TCP)• Object-Oriented Provision of Services (CORBA, X/Open G302)• Operating System Services (IEEE Std 1003 [POSIX])• Security Services (ISO/IEC 7498-2, SSL, IETF RFC 2222 SASL)• Software Engineering Services (ISO/IEC DIS 14882 [C++], Java JDK, VM)• System & Network Management Services (SNMP)• User Interface Services (X Window system)

Source: Standards Information Base (The Open Group) www.opengroup.org/sib2/

Information standards examplesStandard Source Status(CML) Chemical MarkupLanguage

P. Murray-Rust Parsers and other toolsavailable

SMILES Daylight Commercial software packagesMOL2 Tripos Commercial software packagesMIF (Molecular Information File) Allen, et al. PublishedMolfile, Rgfile, Rxnfile, Sdfiles,RDfiles

MDL Info. Sys. Published. Commercial softwarepackages

CXF-10 CAS PublishedmmCIF IUCr Parsers, file manipulation

routines, viewersASN.1 Bioseq NCBI Published. Parsers and other

tools availableAGAVE DoubleTwist PublishedBSML LabBook Published. Viewers availableMicroarray and GeneExperiment (MAGE)-OM, -ML

OMG LSR Adoption vote in progress,implementations available

Gene Ontology GO Consortium Browsers and parsers available

• Fitness to purpose• Architectural fit• Platform requirements• Availability

– Open source– Vendor supported

• Flexibility, configurability• Staff training• Longevity, stability• Total cost of use (licensing terms)

Component/Service/Ontology Selection Criteria

Standardized components & servicesService / Component Source ScopeBiomolecular SequenceAnalysis

OMG LSR (EBI,NetGenics)

Generalized analysis engine forsequences, alignments

Open Genomic Maps OMG LSR(GSK/EBI)

Repository server for genomicmaps

Macromolecular Structure OMG LSR(SDSC)

Repository server for proteinand nucleic acid structures

Bibliographic Query Service OMG LSR(EBI)

Repository server forbibliographic information

BioJava, v. 1.10 BioJava Project Open-source Java classes formanuipulating and analyzingsequences

BioPerl, v. 0.7.2 BioPerl Project Open-source Perl classes formanuipulating and analyzingsequences

DAS (Distributed AnnotationSystem), v. 1.01

BioDAS Project Web-based system forexchanging genomicannotations

Sources of standards

• Vendors

• Information Providers

• Academic Research Projects

• Standards Organizations

• Industry Consortia

• Home-grown

Component & standards development alliances & consortia• ISO, ANSI, IEEE, IETF, OASIS, W3C• Health Level Seven (HL7)• Life Sciences Research DTF (OMG LSR)• Open Bioinformatics Foundation: Biopython,

BioJava, BioCORBA, Bioperl, BioDAS, BioMOBY, BioSOAP

• Microarray Gene Expression Database Group (MGED)

• Clinical Data Interchange Standards Consortium (CDISC)

• Interoperable Informatics Infrastructure Consortium (I3C)

• Global Grid Forum (GGF)

Alliance selection criteria• Technical scope of alliance mission (roadmap)• Alliance architectural commitments• Membership (breadth of industry participation)• Standards adoption process• Ability to influence• Ease of participation (cost, mechanism, openness)• Track record (i.e., stability, longevity, productivity)• IP Issues• Alliance staff support• Total cost of membership• Other benefits of membership?

Acknowledgements

• David Benton

• Jim Butler

• Filip Fuma

• Scott Harker

• Paula Matuszek

• Richard Moore