Post on 13-Jan-2016
description
BioMOBYInteroperability today, Integration Tomorrow Mark Wilkinson, iCAPTURE Centre, UBC, Vancouver, Canada
Presentation to the Australian Centre for Plant Functional GenomicsInstitute of Molecular BiosciencesUniversity of Queensland, Brisbane, AustraliaFebruary 29th, 2005.
There has to be a better way…
…and along came Web Services
• WWW forms defined in machine-readable terms together with a “yellow pages”
• Define inputs and outputs of services as “primitives” in a document called an “XML Schema”– Integer, Date/Time, String
• Don’t help the situation much…– A bioinformatics that consumes a “string” might be expecting a
FASTA sequence, or a keyword…?? – Web Service registries merely catalogue the chaos!
• Bioinformatics has many different ‘strings’!
Who is MOBY’s audience?• Information is distributed
– Beyond Flybase, MIPS, EnsEMBL and TAIR– MOST data never makes it off of the scientists hard drive– This data should be added to the global scientific archive
• Biologists, by and large, are willing and able, but…– The Web was embraced enthusiastically by biologists– In fact, most wet labs run a website!– Unfortunately, this only adds to the chaos…
The interoperability solution must be simple enough for a Biologist, with a little bit of computer
knowledge, to implement on their own
• Define data-types commonly used in bioinformatics• Organize these into an Ontology• Ontologically define web service inputs and outputs• Register the inputs and outputs in a “yellow pages”
• Machines can find an appropriate service• Machines can execute that service unattended
The MOBY-S PlanThe MOBY-S Plan
Gene names
MOBYCentral
MOBY hosts & services
SequenceAlignment SequenceExpress. Protein Alleles…
AlignPhylogenyPrimers
Overview of MOBY-S TransactionsOverview of MOBY-S Transactions
What makes MOBY go?• My disappointment with archetypal web services not being (easily)
able to distinguish between a FASTA sequence and a keyword led me to spend a lot of time thinking about data-types.
• This consideration became the core focus of MOBY-S
• Rich data-typing turns out to be largely sufficient!
• Constraints on MOBY-S are much more severe than on the archetypal computer-science solution– our target audience are not high-level programmers– Defining data types with XML schema is a non-starter: IT WILL NEVER
HAPPEN!
MOBY-S in detail• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
• The MOBY-S Service Ontology
• The MOBY Central Registry
Define: Semantic
• For a piece of data, its “semantics” are– its intention– its meaning– its raison d’etre– its context– its relationship to other data
MOBY-S Semantic Typing: Namespaces
• Any identifiable piece of data is an “entity”
• Identifiers fall into particular “Namespaces”– NCBI has gi numbers (gi Namespace)– GO Terms have accession numbers (GO Namespace)
• Namespaces indicate data’s semantic type.– GO:0003476 a Gene Ontology Term– gi|163483 a GenBank record
• However, we cannot tell if it is protein, RNA, or DNA sequence
• Namespace + ID precisely specifies a data “entity”
• The Namespace is assumed to be sufficiently descriptive of the data’s semantic type that a service provider can define their interface in terms of Namespaces
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
• The MOBY-S Service Ontology
• The MOBY Central Registry
Define: Syntax
• For a piece of data, its “syntax” are– its representation– its form– its structure– its language (of representation)
MOBY-S Syntactic Typing: The Object Ontology
• Syntactic types are defined by a GO-like ontology– Type (“Class”) name at each node– Edges define the relationships between Classes– GO used as a model because of its comprehension & familiarity
• Edges define one of three relationships– ISA
• Inheritance relationship• All properties of the parent are present in the child
– HASA• Container relationship of ‘exactly 1’
– HAS• Container relationship with ‘1 or more’
Define: Ontology
• A systematic representation of the entities that exist in a domain of discourse, and the relationships between them.
Child
Father
Female
Male
MotherhasParent
hasParent
hasGender
hasGender
partnerOf
A portion of the MOBY-SObject Ontology
…community-built!
The Object Ontology: A small slice
Object
NucleotideSequence
VirtualSequence
String
Integer
ISA
ISA
ISA
ISA
HAS-A
HAS-A
DNASequence
AminoAcidSequence
ISA
ISA
text/plain
text/html
ISA
ISA
text/base64ISA base64_gifISA
Generic Sequence
What’s an “Object”?
• The smallest unit of information that can be passed by MOBY-S
• Consists simply of– Namespace– ID
• Thus an Object is nothing more than a “reference” to a data entity
ISA relationship - inheritance
• Classes become more specialized as you move along the ISA relationship hierarchy
– DNA_Sequence – ISA
– Nucleotide_Sequence – ISA
– Generic_Sequence – ISA
– Virtual_Sequence– ISA
– Object
• Classes do not become more complex as a result of ISA relationships alone
• HASA and HAS relationships make Classes more complex by embedding Classes within Classes
• Virtual_Sequence ISA Object• Virtual_Sequence HASA Length (Integer)• Generic_Sequence ISA Virtual_Sequence• Generic_Sequence HASA Sequence (String)
• Annotated_GIF ISA Image (base_64_GIF)• Annotated_GIF HAS Description (String)
HASA & HAS relationships
The Object Ontology: A small slice
Object
NucleotideSequence
VirtualSequence
String
Integer
ISA
ISA
ISA
ISA
HAS-A
HAS-A
DNASequence
AminoAcidSequence
ISA
ISA
text/plain
text/html
ISA
ISA
text/base64ISA base64_gifISA
Generic Sequence
Legacy file formats
<NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’>TBLASTN 2.0.4 [Feb-24-1998]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Query= gi|1401126 (504 letters)
Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences 336,723 sequences; 677,679,054 total letters
Searchingdone
Score ESequences producing significant alignments: (bits) Value
gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA... 1009 0.0emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t... 58 4e-07emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein 53 1e-05gb|U12856|ATU12856 Arabidopsis thaliana Col-0 abscisic acid inse... 53 1e-05
</NCBI_Blast_Report>
• Inheriting from “String” allows us to define ontological classes that represent legacy data types (e.g. the 20 existing sequence formats!)
• NCBI_Blast_Report ISA text-formatted ISA String
Binaries – pictures, movies
<base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCCAv4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNVMIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCCAv4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNVBAgTDFdlc3Rlcm4gQ2FwZTESMBAGA1UEBxMJQ2FwZSBUb3duMQ8wDQYDVQQKEwZUaGF3dGUxHTAbBgNVBAsTFENlcnRpZmljYXRlIFNlcnZpY2VzMSgwJgYDVQQDEx9QZXJzb25hbCBGcmVlbWFpbCBSU0EgMjAwMC44LjMwMB4XDTAyMDkxNTIxMDkwMVoXDTAzMDkxNTIxMDkwMVowQjEfMB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEfMB0GCSqGSIb3DQEJARYQamprM0Bt
</base64_encoded_jpeg>
• We base64 encode binaries, and then define data classes that inherit from String
• base64_encoded_jpeg ISA text/base64 ISA text/plain ISA String
• With legacy data-types defined, we can extend them as we see fit• annotated_jpeg ISA base64_encoded_jpeg • annotated_jpeg HASA 2D_Coordinate_set • annotated_jpeg HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> <Integer namespace=‘’ id=‘’
articleName=“x_coordinate”>3554</Integer> <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>This is the phenotype of a ufo-1 mutant under long daylength,
16’C</String>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCCAv4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</annotated_jpeg>
Extending legacy data types
The same object…
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer> <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer> </2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCCAv4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</annotated_jpeg>
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
The same object…
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer> <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer> </2D_Coordinate_set> <String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C </String>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCCAv4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</annotated_jpeg>
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
<CrossReference><Object namespace=“TAIR_Allele” id=“ufo-1”/>
</CrossReference>
<CrossReference> <Object namespace=‘TAIR_Tissue’ id=‘122’/> </CrossReference>
The Object Ontology: Defines an XML Schema!
• Object Ontology terms have semantically rich names, but this is for human intuition only– DNA Sequence– Annotated_GIF
• Object Ontology does not define the meaning– NO SEMANTICS– (at least, to the machine…)
• It does define the XML Schema of their representation – SYNTAX
• An interesting discussion ensues from this– Does MOBY-S rely on human-readable semantics?– Does it matter?
The Object Ontology: Defines an XML Schema!
• The position of an ontology node precisely defines the syntax by which that node will be represented
• End-users can define new data-types without having to write XML Schema!– This was an important aim of the project
• A machine can “understand” the structure of any incoming message by querying its ontological type!
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
• The MOBY-S Service Ontology
• The MOBY Central Registry
The Service Ontology
• A simple ISA hierarchy
• Primitive types include:– Analysis– Parsing– Registration– Retrieval– Resolution– Conversion
A slice of the Service Ontology
Service
Blast
NCBI_Blast
WU_Blast
Parse_NCBI_Blast
Parsing
AlignmentAnalysis
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
• The MOBY-S Service Ontology
• The MOBY Central Registry
MOBY Central: The yellow pages
• A registry for MOBY-compliant services
• Services register:– “Service Signature” - a triple of [input, service_type, output]– A human readable description of the service– The URL to the service interface
• Provides two types of interfaces:– Register/Deregister– Search/Retrieve
A Simple MOBY-S Web Browser
• It isn’t a particularly powerful program
• It does not display the full “power” of the MOBY-S system
• However, it reveals some interesting “behaviors” that have never been observed before… ever!
• Biologists tend to find this interface “useless!”
• Computer scientists think it’s “Neat!!”
Semantic Web “on the fly”!
• This simple browser behaves very much like a semantic web browser– No explicit coordination– Dynamic discovery– Automatic retrieval and execution
• This is happening without semantics– Syntax only! (well… almost…)
• This is nice!– syntactic solutions are easy to build– semantic solutions are very Very VERY hard!
Conclusions from this Simple Browser Behavior
• Perhaps service interoperability is not a significantly semantic problem?!?
• Service discovery is definitely a semantic problem
• Data integration is still a problem, and we’ve just made that problem worse!
Ugh…. Frustrating!!
• The simple browser is too frustrating– design once, run once– Analysis of only one data-element at a time– No way to extract the data at the end of the analysis– No provision information is saved
• myGrid (UK) is working on similar problems
• myGrid has built MOBY-S support into one of their new tools
A fantastic client program that can now talk to MOBY Central
and execute MOBY Services
Taverna was written by Tom Oinn with MOBY-S input by Martin Senger as part of
the myGrid project
TAVERNA
MOBY-S: On reflection• Two years into the project• >140 services registered and growing• ~20 independent service providers (not part of the
BioMOBY project)• Codebase not yet developed beyond a working prototype• myGrid is making great progress, and has 25X more
funding than we have!
• It is now time to step back and take a critical look at what we achieved, where we failed, and where to go from here
What MOBY got RIGHT
• Open source, community driven
1. Involving the model organism community right from the start has made an enormous impact on the early acceptance and adoption of MOBY
2. Rapid feedback on success/failure– we had “real” users right from the prototype stage!
3. The community has been very forgiving of “hiccups” because they are included in the development process
What MOBY got RIGHT• Data typing
1. Does not attempt to re-structure legacy data-types– passed verbatim in a lightweight XML wrapper.– There are TONS of parsers out there– Entire software projects are built around extracting
information from these legacy formats.
2. Ontology dictates data structure/sub-structure– XML can be parsed, with the “meaning” of each sub-
structure encountered being defined by the ontology– Thus MOBY data is more “self-describing” than XML even
with an XML schema
What MOBY got RIGHT• Data typing
3. Provides a foundation for future data-type definitions– New data-types can be defined by end-users– New data-types can be defined in a structured, machine-
readable way, rather than by new ad hoc flat-file format.– Unsophisticated data providers have an “environment” that
structures their thinking about the data they are providing.– XML schema creation is unnecessary
– REMEMBER WHO OUR TARGET AUDIENCE IS!!
4. Object ontology simplifies creation of visualization tools in an environment where the number/nature of data types is changing daily.
What MOBY got RIGHT• Data typing
5. Provides a standard way of annotating the data object, and/or any of its sub-structures– Annotations are kept separate from the data itself
(versus e.g. hypertext)– Multiple annotations per data component– Mechanism for indicating the semantic relationship
between the annotation and the data being annotated
6. Separation of the semantic data-type from its syntax– The same data “entity” can be instantiated in a wide
variety of ways
What MOBY got RIGHT
• Data typing
7. Despite all of this potential richness, the data can be remarkably simple!!! – Often single XML tag is all that is required– REMEMBER WHO OUR TARGET AUDIENCE IS!!
What MOBY got RIGHT
• Messaging structure
1. Having a predictable messaging layer dramatically simplifies the interoperability problem– Yes, I know, this goes against the most fundamental rules of
the “open world” Web!– REMEMBER WHO OUR TARGET AUDIENCE IS!!
2. Provides a standardized structure into which provision information can be added
3. Dictates what constitutes an “error”– “I don’t know” is NOT an error in MOBY
What MOBY got WRONG
• Service typing
Conclusions from this Simple Browser Behavior
• Perhaps service interoperability is not a significantly semantic problem?!?
• Service discovery is definitely a semantic problem
• Data integration is still a problem, and we’ve just made that problem worse!
Chickens go in;Pies come out!
The problem with MOBY-S
The problem with MOBY-S
What sort o’ pies?
Apple!
The problem with MOBY-S
What MOBY got WRONG
• Service typing - semantics!
1. Describing bioinformatics services is HARD!
2. The MOBY plan was to simply describe them “the way a biologist speaks”1. “I’m going to Blast this sequence” Service type
Blast2. “I need to retrieve this sequence” Service type
Retrieve
3. This doesn’t really work, since services can be arbitrarily complex.
What MOBY got WRONG• Service typing
– MOBY Service ontology suffers from single-parenting… it’s just too simple!
• An “NCBI Blast Report Parsing” service is a unique node in the ontology.
• Better to have a service described as the intersection of a variety of orthogonal concepts:
• A Blast Report Parser is a Parser that operates on a Blast Report and there are NCBI and WU Blast Report types”
• The TAMBIS project (same research team as myGrid) is a perfect example of how this can and should be done.
What MOBY got WRONG• Service typing - the future
– MOBY needs a truly “semantic” service ontology– myGrid has one– myGrid will replace our service discovery process
• i.e. the end of MOBY Central– They have enough funding to ensure that the code is robust and well-designed– Can we make service description simple enough for biologists, even with the rich myGrid ontologies?
– REMEMBER WHO OUR TARGET AUDIENCE IS!!
Usage of MOBY Central 2004
API Calls
050000
100000150000200000250000300000350000400000
Jan
Feb Mar Apr
May Ju
n Jul
Month
MO
BY
Cen
tral
AP
I
API Calls
Early Adopters
The PlaNet Consortium
PlaNet Consortium Members
• Institute for Bioinformatics (IBI) / MIPS, Neuherberg
• Flanders Interuniversity Institute for Biotechnology (VIB), Gent
• Genoplante-Info, Evry• Nottingham Arabidopsis Stock Centre (NASC),
Nottingham• John-Innes-Centre, Norwich• Plant Research International (PRI), Wageningen• Centro Nacional de Biotecnología, Madrid (CNB)• …and others…
Early Adopters
CGIAR Generation Challenge Program
GCP Consortium Members
Unexpected phenomenon• These consortia have set up their own instances of
the MOBY Central registry– This was not how I had expected that MOBY would be used!
– Could be due to the lack of a descriptive service ontology
– Could be sociological
– Could be security (MOBY Central API is open)
– Probably a bit of each…
• This is a critical observation when it comes to architectural decisions v.v. registry setup– Deployment of “boutique” registries must be TRIVIAL!
– This will be an important consideration in our collaboration with myGrid…
Hey, those are all plant databases!
• For some reason, MOBY has been more rapidly adopted by the plant community than by other communities
• Could be personal (I’m a botanist)
• Could be the “founder effect”
• Could be ethical
…But, hearts are also
important!
(Murray and Lopez, The global burden of disease : a comprehensive assessment of mortality and disability from diseases, injuries, and risk factors in 1990 and projected to 2020, 1996)
CVD-Related Deaths for 2001(By WHO Region, Deaths in Thousands)
(Source: World Health Organization, The World Health Report 2002: Reducing Risks and Promoting Healthy Life, 2002)
Sharing the wealthMark Wilkinson & Bruce McManusiCAPTURE Centre for Cardiovascular and Pulmonary ResearchUBC, Vancouver, British ColumbiaCanada
Toward Optimal Knowledge Delivery in the Cardiovascular Sciences
“Sometimes what your listeners hear is more
interesting than what you’ve actually said.”
~ Don Moyer, Harvard Business Review
(I am once again talking about vaporware….)
“In 25 years, [information] will
double every three months. What will that do for learning
requirements?”
~Doug Engelbart
“Information is not knowledge.”
~Albert Einstein
“Science is organized
knowledge.”
~Herbert Spencer
“Where is all the knowledge we lost with information?”
~T. S. Eliot
(Source: Clarke and Rollo, Education and Training, 2001)
Problems of the post-genomic era
• Too much information!
• Too little knowledge!
• Once you have data, how do you:– Share it– Manage it– Use it– Package it– Translate it– Apply it– Turn it into knowledge!
"If HP knew what HP knows, we'd be three
times more profitable."
~Lew Platt, Non-executive Chairman, of The Boeing Company, former CEO of
Hewlett-Packard Company
BioMOBY and myGrid are not the solution either!!
• Deal with data (aggregation) not knowledge (organization)
• We have to take the next step
• Move from a data-centric architecture to a knowledge-centric architecture
Occam’sOccam’s Razor Razor
“Pluralitas non est ponenda sine neccesitate.”
“Plurality should not be posited without necessity."
“Why posit from simplicity when the full complexity
could be available?”
Nosology: (Gr noso “disease” +-logy)
a classification or list of diseases
Ontology (Gr: “things which exist” +-logy)An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that
hold among them.
Capturing and encoding knowledge is hard!
• Requires extensive collaboration between biomedical domain experts, and knowledge management experts (ontologists)
• At least the tools and standards are now becoming more stable
• The NCI has blazed a trail for us
CardioSHARE
Cardiovascular Semantic Health And Research
Environment
Wilkinson & McManusProposal to Genome Canada, 2004
CardioSHARE architecture: Increasingly complex ontological layers organize data into richer concepts, even hypotheses
Blood Pressure
Hypertension
Ischemia
Hypothesis
Database 1 Database 2 Database 3
BioMOBY& SemanticWeb “agents”
Bruce McManus – iCAPTURE Centre, UBCCarole Goble, Phillip Lord – myGrid @ U Manchester
Martin Senger – myGrid @ EBI Lincoln Stein - CSHL
Damian Gessler, Andrew Farmer, Gary Schiltz - NCGRBill Crosby, Matthew Links, Luke McCarthy – U of S
Heiko Schoof, Rebecca Ernst – MIPSLukas Mueller – formerly at TAIR
Midori Harris – GO ConsortiumMike Niemi – IBM
Fiona Cunningham, Shuly Avraham – CSHLKen Stuebe – SDSC
Funding and equipment donations from:
Genome Canada/Genome Prairie, CanadaNational Science Foundation (NSF), USA
Canadian Bioinformatics Resource, NRC, HalifaxOpen-Bio Foundation
IBM
Friends and Participants