Director, Image Bioinformatics Research Laboratory Oxford e-Science Centre Department of Zoology,...

Director, Image Bioinformatics Research LaboratoryOxford e-Science Centre

Department of Zoology, University of OxfordOxford OX1 3PS, UK

e-mail: david.shotton @zoo.ox.ac.uk

David Shotton

Bio-Ontologies Meeting

Glasgow

30/07/04

© David Shotton 2004

Using ontologies to provide semantic richness in biological image databases

(Sub-title: In Praise of Good Colleagues)

Acknowledgements

Chris CattonBioImage Development Manager: ImageStore Ontology and SABO developer

Simon SparksBioImage Software Engineer: OWLBase query engine developer

John PybusBioImage Systems Manager

European Commission funding of the ORIEL Project - IST-2001-32688

Chris WilsonSABO research project

Ruth DaltonSABO research project

Chris HollandImageBLAST research project

Outline of my presentation

Expert knowledge and tacit knowledge

The Semantic Web and ontologies

Ontologies in biology

The BioImage Database: its purpose, structure and ontology usage

Enabling ‘smart queries’ by importing external ontologies into BioImage

ImageBLAST: hypersearches across distributed biological databases

Concluding remarks and cautionary tales

This is a fairly straightforward article, but nowhere in it are you told that:

Caenorhabditis elegans is a nematode worm, one of the handful of model organisms for which the complete genome has been sequenced

or that

A transcription factor bind to nuclear DNA to control the readout of genetic information from a particular gene

These facts are so basic to the paper that they are assumed

Expert knowledge and tacit knowledge

Mutual understanding within any field of knowledge is based on a shared conceptualisation developed by scholars over the years

This shared conceptualisation is often implicit through scholars’ choice of vocabulary and theories when speaking or writing

Furthermore, in order to communicate at the highest level (as in the Nature paper), scholars must assume that those listening to or reading their words are part of this community and share the conceptualization

Much of what is communicated in a paper or an academic lecture is first a reinforcement and then an extension of the shared tacit knowledge.

It is this assumed tacit knowledge, every bit as much as the technical jargon, that makes scientific literature so impenetrable to non-specialists

My next few slides are designed to make explicit some of the key points relating to ontologies, for the benefit of those for whom this may be new

Electronic communication of complex knowledge

In human society, much of our knowledge is implicit or tacit - we know more than we think we know!

However, today, as more and more knowledge is held on-line, more and more communication needs to be M2M, from one computer to another

To accomplish such communication successfully, and to permit semantic reasoning over distributed information resources

such tacit knowledge must be made explicit, and the meaning of information must be specified unambiguously

This is difficult, and demands anal attention to detail

The next slide illustrates what I mean . . .

What is this?This is not a pandaThis is not a photograph of a pandaThis is not even a projected digital image of a photograph of a panda

This is a caption for a projected digital image of a photograph of a panda

In biology, meanings may be complex

In normal conversation, “daughter” means a female human child conceived by sexual intecourse between mother and father, and then born after a gestation of nine months within the mother’s uterus

In non-mammalian animal species, development is usually from eggs

But sex is not always required: female aphids can give birth to daughters by parthenogenesis, without the need for fertilization of the eggs by male sperm

And in the field of cell biology, the word “daughter” has an entirely separate meaning: two genetically identical “daughter cells” are produced every time a single cell divides

Biological ontologies have thus to understand the context in which the word “daughter” is used, in order to apply the correct meaning

What is the Semantic Web, and how can it help?

The concept of the Semantic Web was first clearly articulated in 2001 in an eponymous SciAm article by Tim Berners-Lee, Jim Hendler and Ora Lasilla

While the World Wide Web permits access to data in human-readable form, the Semantic Web provides access to information structured in a formal logical manner, such that computers can reason over it, extracting meaning

It involves three technologies, each resting hierarchically on the previous one: The use of XML as a markup language more expressive than HTML RDF triples that permits one to make simple logical statements (subject-

verb-object) written in XML, in a form that a computer can understand The use of ontologies – formal representations

of a particular domain of knowledge (e.g. the GO ontology about genes and gene products) – written in a high level ontology language such as OWL (W3C’s Web Ontology Language), which is itself expressed as a set of RDF statements

RDF triples

An RDF triple might state that a mouse is_a mammal, informing the computer that an entity ‘mouse’ is included in the more general category of ‘mammal’

This has the advantage that mouse inherits all class properties previously defined for mammal, such as the possession of four legs and fur

By using several RDF triples referring to the same subject, multiple attributes can be defined:

Subject (Entity) = Mouse (class) This mouse (instance)

Property (Attribute) = is_a / has_location / has_identifier

Object (Value) = Mammal / Oxford / 667

In RDF, the statement “This mouse is located in Oxford” is simply:

<rdf:RDF><rdf:Description about=“Mouse”>

<Location>Oxford</Location></rdf:Description>

</rdf:RDF>

What type of animal is shown in this image?

German taxonomists

claimed it was a bear

British taxonomists

claimed it was a racoon

Ailuropoda melanoleuca

US taxonomists weren’t quite sure

Today, the balance of opinion is “bear”

So what is an ontology? “An ontology is a formal explicit specification of a shared conceptualisation”

The role of an ontology is to facilitate the understanding, sharing, re-use and integration of knowledge through the construction of an explicit domain model

A panda is only a bear because we all now say it is!

Animal

is_a

Vertebrate

is_a

Mammal

is_a

Rodent

is_a

Mouse

We understand taxonomic hierarchies

In an ontology, one can express more complex relationships about a mouse, other than just its taxonomy

Group of Mus musculus organisms

is_a Colony

has_species_name member_of Mouse

proper_part_of has_ID

Leg has_mode_ 667(has_cardinality: 4) of_locomotion

(has_position: front / rear) (has_handedness: left / right)

(has_length: number)

used_for Locomotionproper_ is_apart_of Running

Fur hypothesised_(default_colour: white) function

(has_length: number unit) (has_density: number per unit area)

Escape

A partial ontology of ‘mouse’

How do you build an ontology?

You need to define all the terms within a domain of knowledge, and specify the relationships they have to one another

The structure of these relationships is a Directed Acyclic Graph, in which child terms can have more that one parent

The relationships of a child term to its two (or more) parent terms can be different, as shown in the previous example:

mouse is_a rodent – type relationship

mouse member_of colony – collective relationship

The thinking crow problem

To properly annotate videos of Betty, we need to be able to structure not only people’s interpretations of the world, but also Betty’s view of what is going on!

Biological ontologies

There is good ontological coverage of the genes and gene products of model organisms in the form of the Gene Ontology (http://www.geneontology.org)

But until very recently little work had been done at the other end of the biological spectrum, in the field of animal behaviour

However, my department is full of people undertaking whole animal biology

To be able to include their images and videos within the BioImage Database, we decided to develop a draft standard animal behaviour ontology, SABO

SABO is an upper level ontology designed to cover all of animal behaviour, build around Otto Tinbergen’s four questions: “How does it work? How did it develop? How is it used? and How did it evolve?”

Because interpretations of behavioural events can be very subjective, we have been careful to separate fact from hypothesis in the design of SABO, with emphasis on the authority for any claims

Fact and hypothesis in SABO

For example, a courtship event

Courtship behaviour in ducks

Male mallard ducks attract their mates using a “grunt-whistle”, which Konrad Lorenz hypothesised in 1941 was derived from body shaking

Using the SABO ontology, this can be recorded in the following RDF triples:

Grunt-Whistle (a type of courtship behaviour)generates hypothesis

Hypothesis About Evolutionary Origin (an ontology class)

Hypothesis About Evolutionary Origin hypothesised evolutionary origin

Body Shaking (a type of behaviour)

Hypothesis About Evolutionary Originhas author

“Lorenz, Konrad” (instance data)

Hypothesis About Evolutionary Origin has date

“1941” (instance data)

The Ethodata Ontology

SABO was used as one of the two starting points for a recent Animal Behaviour Metadata Workshop held at Cornell University, at which leading international ethologists worked together to create an Animal Behavior Metadata Standard

Our introduction of formal ontologies to this community was greatly helped by the fact that Chris Wilson, who had worked with us on SABO, recently started a Ph. D. at Cornell with Jack Bradbury, the workshop organiser

The Workshop output is a human-readable hierarchy of defined ethological terms, the draft Animal Behavior Metadata Standard (ethodata.comm.nsdl.org)

The Workshop has commissioned us to develop this hierarchy into a fully-fledged computable ontology of animal behaviour, for the benefit of the whole ethological community

Based on the draft Animal Behavior Metadata Standard and on SABO, and written in OWL, this has the new agreed name of the Ethodata Ontology

We have already made a start on this work, and will use it to enter structured ethological image metadata into the BioImage Database

A view of the

BioImage home page structure

www.bioimage.org

Note the hierarchical

browse categories and the alternative Browse / Search

arrangement

The BioImage Database Project

The value of digital image information depends upon how easily it can be located, searched for relevance, and retrieved

Detailed descriptive metadata about the images are essential, and without them, digital image repositories become little more than meaningless and costly data graveyards

The BioImage Database aims to provide a searchable database of high-quality multidimensional research images of biological specimens, both ‘raw’ and processes, with detailed supporting metadata concerning:

the biological specimen itself the experimental procedure details of image formation and subsequent digital processing the people, institutions and funding agencies involved the curation and provenance of the image and its metadata

to provide rich and accurate search results to queries over our data

and to integrate such multi-dimensional digital image data with other life science resources by providing links to literature and ‘factual’ databases

The organisation within BioImage

The basic unit of organisation within the BioImage Database is the BioImage Study, roughly equivalent to a scientific publication

A BioImage Study will contain one or more Image Sets, each corresponding to a particular scientific experiment or investigation

Each Image Set will contain one or more Images on a common theme

Such an Image may be of any form or dimensionality a 2D image, a 3D image, a video, or a 4D (x, y, z, time) image set

Users may browse or search the BioImage Database by Study, by Image Set or by Image

For each representation, a thumbnail representative image and core metadata of the results (title, authors, description, LSID) are initially presented, and deeper metadata is available by clicking the title

Browses and searches may then be progressively refined

The basic BioImage metadata model

Cell or organism

Preparation

Image capture

Image sets of multidimensional images, including videos

Subject or specimen

Researcher

Photographer or microscopist Camera or

microscope, illumination,

focus, etc

Experimental study conditions or manipulations

So people are related to objects and conditions / equipment through events

The structure of the BioImage Database

BioIm age S erver

V ideoW ork s W eb se rv e r

Browser interfaces

Jav a app le ts

A pac he W eb se rve r

Tax onom ies

Ontologies

O B O se rv e r

N C B I se rve r

K ey In te rna l p roc esses H T TP S O A P p ro toco ls

Tomcat

V iew X S L , JS P and S iteM esh

B io Im agem e tada ta - P ostg reS Q L

Loca lim age

file s to re

Log ic laye r (se rv le ts )

S O A Pin te r-faces

A dm in is tra tion se rv le t

Q ue ry s e rv le t

S O A P c lien ts

C o n tro l le r Stru ts

M ode l

(Javabeans)

O W LB ase que ry eng ine

S ubm iss ion se rv le t

Things to note about the architecture: external

User submission, searching and browsing activities are all mediated by the ImageStore Ontology

Submission forms are generated dynamically from the ontology, to suit the type of submission

Thus, for instance, people submitting light microscopy images are not asked for the accelerating voltage of their electron microscope

There is complete separation of content from presentation

Presentation to users is via HTML, while SOAP is used to communicate with Web Service clients

The Struts controller orchestrates data transfer between the system and the user

This permits simple customization of the appearance of the data

Multilingual capabilities enabled by Struts

This shows

the Access Control Interfac

e

The same HTML

page is being

viewed in both cases, using

alternate

resource

bundles

achieved simply by re-setting the default language of the user’s browser

Things to note about the architecture: internal

Data are exchanged within the system in XML format, using the BioImage schema

There is no hard-coded ‘business logic’ - structures and semantics are generated at run time

The ImageStore Ontology is the central data model

This single point of control greatly simplifies database maintenance, since changes are automatically and dynamically propagated throughout the system

The entire BioImage database structure can be automatically regenerated from the ImageStore Ontology whenever this is required (for example in a new form after updating the ImageStore Ontology), using metadata from a previous XML dump

This allows easy migration to a new DBMS, e.g. from PostgreSQL to Oracle

OWLBase is used to reference the ontology and to mediate data transfers

OWLBase thus provides an abstraction layer for submissions and queries

The ImageStore Ontology

The ImageStore Ontology was constructed using the Jena toolkit (www.hpl.hp.com/semweb) and our own open source Ontology Organiser, an ontology constraint propagator and datatype manager

ImageStore: uses a subset of the class model of the Advanced Authoring

Format (sourceforge.net/projects/aaf and www.aafassociation.org) to describe media objects

uses a subset of MPEG-7 to describe multimedia content, and has its own data model to describe scientific experiments

It is currently written in DAML+OIL

We are in the process of upgrading BioImage to use Jena 2, which will permit us to convert the ImageStore Ontology into OWL

What is required of an image ontology?

Such a generic image ontology as the ImageStore Ontology must describe all aspects of the images themselves:

their acquisition (including details of who took the original micrograph, where, when, under what conditions, for what purpose, etc.)

the media object itself (source and derivation, image type, dynamic range, resolution, format, codec, etc.)

the denotation of the referent (a description of exactly what is recorded by the image, e.g. the nature, age and pre-treatment of the subject), and

the connotation of the referent (i.e. the interpretation, meaning, purpose or significance imparted to the image by a human, its relevance to its creator and others, and its semantic relationship to other images).

In addition to these ancillary metadata about the image, there is yet a further need to record semantic content metadata related directly to the information content of the images or videos themselves

These semantic content metadata carry very high information value, since they relate directly to spatial (or spatio-temporal) features that are of most immediate relevance to human understanding of media content, namely “Where, when and why is what happening to whom?”

Image description – separating fact from hypothesis

BioImage Study title: Xklp1:a Xenopus kinesin-like protein essential for spindle organisation and chromosome positioning

Denotation (raw fact):

Immunofluorescence localization of Xklp1 in XL177 cells

Vernos et al., 1995

Connotation (interpretation):

Xklp1 is involved in chromosome localization during mitosis in embryonic Xenopus cells, since it is positioned at the metaphase plate

Representing fact and hypothesis within ImageStore

ClassSegment

range

RestrictionsubClassOf

onProperty

ObjectProperty

has

ClassConnotation

ClassDenotation

subClassOf

subClassOf

ClassEvent

Restriction

subClassOf

ObjectProperty

states

ObjectProperty

tool

ObjectProperty

participant

ClassFormOfExpression

Restriction

subClassOf

DataTypePropery

CameraMotionType

Restriction

subClassOf

onProperty

xsd:Mpeg7:cameraMotionType

range

ObjectProperty

location

subPropertyOf

subPropertyOf

subPropertyOf

DataTypePropery

weather

onProperty

DataTypePropery

habitat

onProperty

DataTypePropery

RegionOf Interest

onProperty

subClassOf

ClassNarrativeContentDescription

xsd:Mpeg7:SpatialMask D

range

Narrative WorldReal World

subClassOf

Media

RestrictionObjectProperty

hassubClassOf onProperty range

ClassEventContentDescription

subClassOf

Restriction

onProperty

ObjectProperty

has

rangerange

range

subPropertyOf

Collection

Rdf:Statement Rdf:Statement

intersectionOf

intersectionOf

Restriction

ObjectProperty

states

ObjectProperty

tool

ObjectProperty

participant

ObjectProperty

location

subPropertyOf

subPropertyOf

subPropertyOf

DataTypePropery

weather

onProperty

DataTypePropery

habitat

onProperty

range

subPropertyOf

Collection

Rdf:Statement Rdf:Statement

intersectionOf

intersectionOf

Real world Media world Narrative world

The BioImage advanced search interface

The Advanced Search Interface permits Boolean searches, search restrictions, and re-use of previous searches in combination with new terms

Automated SQL query generation

Stage one: user inputs a query “Find images of bears”

Stage two: the ontology reasons over the request

Stage three: OWLBase convert the request to SQL

Stage four: metadata is retrieved from the database

Stage five: metadata is returned to OWLBase as XML

In summary:

Queries are made by our ontology-driven database query engine, OWLBase

OWLBase passes a query via the ImageStore ontology to the underlying PostgreSQL metadata relational database

The database returns metadata of studies matching the search term: authors title description network locator (URI) for the representative thumbnail image IDs of all the component datasets and images

These XML data are then used to populate the HTML Study Results Web page that is displayed to the user

Many of these items link to deeper metadata

If the user now clicks on one of the nodes linking to deeper metadata, a new OWLBase query is initiated that returns information about that component

Search result,

showing Studies

What’s so special?

For each query, OWLBase builds in memory an RDF ‘knowledge graph’ representing the structure of the components of each of the matching studies

As the user clicks on nodes linking to deeper metadata, each new OWLBase query return is used to extend the RDF graph of the resource

In this way, the in-memory representation of the relevant metadata is built up dynamically and incrementally, as required

At present, this would not seem to provide much additional functionality over and above a conventional relational database SQL query system

However, the fact that the searches use the ImageStore Ontology and build up an OWLBase RDF graph opens the possibility to three novel advances:

Use of external third-party ontologies Smart queries within the BioImage Database

and Hypersearches across distributed resources

‘People’ metadata within BioImage

People have attributes: First and last names, dates of birth, addresses, phone numbers, etc

People have various affiliations: Current membership of an institution, e.g a university Former membership of another institution – e.g. undertook

the research while a postdoc there Simultaneous membership of a third organisation,

e.g. an international research project partnership

People have grants: “The work in this BioImage Study was funded by BBSRC”

People may have different roles within a BioImage Study: This person planned the study – Principal investigator That person prepared the specimen – Technician A third person undertook the electron microscopy – Postdoc Together they wrote the Nature paper – Authors

Use of external ontologies

Because all BioImage queries are passed through the ImageStore ontology, and because ImageStore can be extended using external third-party ontologies, we have the possibility of using such external ontologies to enhance BioImage searches

In its simplest form, this can just be used to simplify metadata submission

For example, an organisation such as a pharmaceutical company might choose to use an instance of the BioImage Database System internally, behind its own firewall, for the organization of its own confidential research images

If that company already had an ontology-controlled database of all its employees’ details, there would be no need to re-enter those metadata for each image these people wished to record – all that would be required would be to link the BioImage Database System to the employee records ontology

But external ontologies can do much more for us . . .

Using external biological ontologies within BioImage

Biological content can be described using external ontologies – currently the GO ontology (www.geneontology.org) for genes and gene products, and the NCBI taxonomy (www.ncbi.nlm.nih.gov/Taxonomy) to identify species and soon others will also be used, e.g. the Ethodata Ontology

We have already implemented the display of an interactive taxonomic hierarchy that permits the user to browse by narrowing or broadening the scope of the results displayed after a query, by clicking at different points in the taxonomy

Thus the images of specimens derived from all rodents can be refined to show only those from mice, or broadened to show all mammalian images

Similar modification of other parameters is also possible For instance from confocal fluorescence images to real-time confocal images

or to all fluorescence images (these relationships being structured within the ImageStore Ontology)

At present we can use third party ontologies only if we pre-import them

We wish now to extend this functionality by creating dynamic access to external ontologies that are published in XML on the Web, thus ensuring that we always access the most recent version

Smart queries within the BioImage Database

We propose next to use external ontologies to provide the ability to undertake semantically rich searches of the BioImage Database that can handle synonyms (‘mouse’ and ‘Mus musculus’) hierarchies (‘rodent’ and ‘mammal’) exclusions (not a computer mouse) and related terms (‘laboratory animal’ and

‘model species’)

rather than being limited to conventional ‘Google-like’ searching by means of exact keyword matching, results of which are rather unpredictable!

We do not yet know how this Semantic Web approach to database querying will scale with increasing database size, and we will need to undertake comparative research after implementing it

Hypersearches of distributed information sources

At present, the BioImage Database gives users the straightforward capability of linking out from a BioImage study, dataset or image via standard Web hyperlinks to relevant material elsewhere on the Web

For example, the Advanced Search Interface enables users to enter BioImage queries of the type: “Retrieve all images of Drosophila testes showing expression of the gene always early (aly)”, and then enable users to link out from these BioImage studies both to the gene sequences and to literature publications of relevance

What we cannot do at present, however, is to send complex queries across a set of databases, of the type: “Retrieve images of whole Drosophila, Xenopus and mouse embryos showing the comparative neural expression of the most anterior of their Hox genes at different developmental stages, and show me these gene sequences aligned to maximise homology”

We wish to investigate how to undertake complex integrated ‘hypersearches’ simultaneously over the BioImage Database and relevant ontology-enabled and Web Services-enabled sequence, structural and literature databases

How to implement hypersearches

The conventional way to search across disparate databases would be to map their schemas onto some common system, and then use that to distribute a query across them in a manner that each database can understand.

Our approach is somewhat different, and relies on the fact that OWLBase dynamically builds up an RFD representation of the information space of interest, and that external ontologies can be integrated with ImageStore

Specifically, we plan to import relevant sub-graphs from published external ontologies (i.e. class data rather than instance data) dynamically into the RDF graph being built up within OWLBase during each query

We will then use this extended graph to structure the hypersearches, by providing ‘internal’ knowledge about the structure of external databases

OWLBase will thus act as more than just a query engine. It will build dynamic graphs of relationships between stuff within BioImage and stuff outside, and then run queries over that bigger graph

ImageBLAST

The ability to mount semantically rich queries over a variety of database resources opens the possibility of developing new bioinformatics search tools

Our first proposal for this, initially envisioned by our collaborator Michael Ashburner at the ORIEL Varenna conference last September, is ImageBLAST

By analogy with the BLAST tool for identifying homologous genes, Michael’s vision was for a tool in which a researcher could enter a nucleotide sequence and have returned images of the normal and mutant expression patterns of the protein encoded by that sequence, from all the model organism image databases, together with detailed metadata describing all that is known about that gene and its protein

Recently, my student Chris Holland and I have been designing some possible user interfaces for ImageBLAST

I will show them to you in fairly swift succession, to give you a glimpse of the vision we have in mind

The ImageBLAST home page

The ImageBLAST hypersearch interface

Gene name disambiguation

‘SAP1’ is a synonym for three separate gene products: beta 4 defensin (DEFB4, aka HBD-2) EKT4 (aka ETS-domain protein), and proposin (aka GLBA). Such homonyn / synonym ambiguities are common

We will use the system developed by our ORIEL partner Martijn Schuemie of the Erasmus University in Rotterdam for gene name disambiguation, in combination with the ‘conceptual fingerprinting’ software of our industrial partner Collexis BV of Rotterdam

Conceptual fingerprinting involves weighting terms in a piece of text on the basis of their frequency and proximity. Terms are defined using the MESH system and the UMLS biomedical thesaurus

Comparing numerical conceptual fingerprints permits rapid matching of related texts, and enables resolution of gene name ambiguity on the basis of the context of its usage

Summary results on ‘adh’ in Drosophila

DNA results on ‘adh’ in Drosophila

Product results on ‘adh’ in Drosophila

Structure of Drosophila adh

Pathway results on ‘adh’ in Drosophila

Example of a specific pathway

Phenotype results on ‘adh’ in Drosophila

One phenotype study on ‘adh’ in Drosophila

Will ImageBLAST work?

To work, ImageBLAST will clearly requires intimate linkage between the ImageStore Ontology, the Gene Ontology, and the forthcoming Cell Ontology

It will also require integration with the Bio-MOBY Web Services for sequence bioinformatics (biomoby.org) developed by our Canadian colleague Mark Wilkinson

At present, our vision seems far from risk free

However, the pace of Semantic Web developments in which we have participated over the last two years has been truly astonishing

This gives reason to hope that, within a further two years, new developments in information space representation, and new methods for ontology integration and automated data extraction, will substantially aid us in attaining our goal

Such image bioinformatics tools, if indeed we succeed in developing them, will enormously facilitate knowledge mining within biological images, and will enable hitherto impossible types of on-line research to be undertaken

Populating the BioImage Database

But first the images must be made available in an ontology-driven database!

The BioImage Database will receive regular images from three main sources:

Journals: Three major scientific publications have already agreed to provide the BioImage Database with biological images on a regular basis:

The EMBO Journal

EMBO Reports

The Journal of Microscopy

Research projects and specialist databases: e.g. the Drosophila Testis Gene Expression Database

Laboratory image collections The Open Microscopy Environment

If you have collections of high quality research images that you wish to publish, please let me know or contact us via www.bioimage.org

Final words of caution

A cautionary tale

We recently wrote to a colleague requesting a copy of a beautiful confocal image that he had collected some years ago

His reply typifies the wasteful fate of an unfortunately large proportion of biological research images:

”Concerning the image data you requested - this is a tough one. The image was recorded about ten years ago, and I never managed to write a paper about the work so it was never published. The original data (if they still exist) must be on some magneto-optical disk in one of many boxes in my flat - quite hopeless to find at short notice. All I can promise is that I’ll look into this once I am back from my travels – but that will take a few months. Whether anyone still has hardware capable of reading the disc is quite another matter! Sorry about this.“

It is perhaps the best possible argument for the routine publication of images arising from publicly funded

research in databases such as the BioImage Database, that can provide a safe repository for them and free access to them for the community

and for the funding of such databases from the public purse

Ontologies are supposed to fit together neatly

- like irregular four-sided Penrose tiles

The blue shape

represents our Ethodata Ontology

– just one among many in the information landscape

. . . creating a harmonious whole

“Penroses” by Ruth McDowell

. . . but what if they don’t?

“Weeping woman” by Pablo Picasso

It is hoped that ontologies from different fields can be made ‘orthogonal’ to one another - non-overlapping and yet with no gaps between them

However, at present this is just an optimistic hope

As yet, there is insufficient ontological coverage of the universe of knowledge to know whether this particular vision of the Semantic Web can be realised

The data deluge and the paradigm trap

The volume of data generated in the Life Sciences is now estimated to be doubling every month

A single active cell biology lab may generate 10 to 100 Gbytes of multidimensional image data a month

Soon the only way to handle the data will be through the presuppositional ‘lens’ of an ontology – people will never have time to look at the raw data

Does that matter? After all, the ontology is a specification of the accepted paradigm established by the respected leading academics of the day

In other words, an ontology fossilized the prejudices of the old farts

Could this perhaps, maybe, just possibly, lead to a blinkered view of the world?

Might this hamper the process of discovery and inhibit the overthrow of incorrect hypotheses?

- what if Newton had written the ontology for physics? BEWARE!

Additional slides of relevance

Entity-Attribute-Value storage

Entity-Attribute-Value databases have recently found favour among healthcare professionals as a way of recording patient data

Like patient data, image descriptive data may be sparse – an image represents a small subset of the objects in the real world, just as a patient will have only a small subset of all possible diseases and treatments

Whereas in conventional relational database models, each description is stored in a specific column, the EAV approach uses row modelling - each description generates a row consisting of:

an entity (e.g. this_rose) an attribute (a property of the entity, e.g. has_colour), and a corresponding value of the attribute (e.g. red)

These EAV triples are easily encoded in RDF

For the BioImage Database, we use conventional relational tables for those items upon which searches are frequently made – author, title, species, etc. - and have adopted the EAV approach for those metadata items that are not

Patient records for blood parameters

Patient Values

First name

Last name

Disease White cell count

Cholesterol Ethyl alcohol

Prostate-specific antigen

Lots more columns . . .

Mary Smith Alcoholism 0.3 mg/dl Lots of blank values . . .John Smith Cancer 40 ng/ml

Ken Jones Heart disease

340 mg/dl

Barry Brown AIDS 630 cells/µl

A conventional relational database table, with lots of blanks

Adding new columns to the table to accommodate new tests is not easy

Person table

First Last ID

Mary Smith 125

John Smith 126

Ken Jones 127

Barry Brown 128

Auxiliary table

Resource Name

Resource ID

Property ID

Person 125 1

Person 126 2

Person 127 3

Person 128 4

Attribute - Value table

ID Attribute Value

1 Ethyl alcohol 0.3

2 Prostate specific antigen 40

3 Cholesterol 340

4 White cell count 630

Units appropriate to each attribute are defined in the

ontology, and so do not need to be specified in the table

EAV tables to record patient details

“Is Emily Jane’s father a Yorkshire clergyman?”

BIRTHS, MARRIAGES AND DEATHSBorn to Revd John and Mrs Marjorie Sanders of St Paul’sVicarage, Tadcaster Road, Leeds: a daughter Emily Jane,at 11:25 a.m. on 25th December 2003, weight 3.6 Kg.

Note that the only common element between the question and the press announcement is the child’s name

No conventional electronic query, formulated to interrogate a relational database containing the information within the press announcement, could possibly come up with the correct answer to this question

Why? People are able to employed deductive reasoning and extensive linguistic, cultural and geographical knowledge

Use of the correct ontology could help a computer to reach the same conclusion

An example from everyday life . . .

What would that ontology have to ‘know’?

That a daughter is a female child, and that a male parent is a father

That “John” is a man’s name

That “Revd” is an abbreviation for “The Reverend”, the title given to an ordained minister of religion

That a typical employment for a minister of religion in the Anglican Church is to be a vicar, i.e. the minister of a parish church

That Anglican parish churches are named after Christian saints;

That a “vicarage” is a house provided for the accommodation of a vicar and his/her family

That since Revd Sanders lives in St Paul’s Vicarage, as well as being an ordained minister of religion, it is highly likely that he is indeed the Anglican vicar of St Paul’s church;

That a synonym for “vicar” is “clergyman”

That Leeds is an English city within the county of Yorkshire

Do mountains exist?

Do mountains exist?

Are we at the top of Everest?

Do mountains exist?

Are we on Mount Fuji at all?

Ontologies

Ontologies can describe many different kinds of relationships

Bears

OmnivoreHerbivore

Diet

is_a

However, ontologies can have problems …

We classify pandas as herbivores because 99% of their diet is bamboo

What about the other 1%? Autopsy of one panda revealed bones of a bamboo rat in its stomach In captivity, pandas will eat pork coated with honey

Does this make the panda an omnivore?

Humans make ‘reasonable judgements’ when classifying things

However, machines usually reason over facts that are either true or false, and cannot easily be programmed to make subtle distinctions

Scientific imaging

Images and videos form a vital part of the scientific record, for which words are no substitute

In the post-genomic world, attention is now focused on the functional analyses of gene expression, and on organization and integration within cells

In a month a single active cell biology lab may generate between 10 and 100 Gbytes of multidimensional image data

But at present little of this is published

The problem of image publication

Even when images are published, they are often only processed images, not the original image data

For example one might publish a single section or a projection from a complete 3D confocal image

or a couple of frames from a movie

It would be of great value if more original image data were published

This would both permit re-analysis and secondary meta-research

and would be useful for teaching and learning

Using Protégé to define a class in the ontology

Ontology Organiser A constraint propagator and datatype manager

Eliminates the cognitive overload of the user during ontology development while asserting relationships between resources

Ontology Organiser has capabilities not found in other editors like OilEd

First it can reduce the cognitive overload of the user during ontology development while asserting relationships between resources. It:

evaluate constraints placed on relationships propagate any alterations necessary up through an ontology's hierarchy thereby maintaining ‘semantic robustness'.

Second, it addresses the more technical problem regarding the lack of support for datatypes in existing ontology editing packages. Ontology Organiser goes some way to aid the user in defining, modifying and referencing custom datatypes in their ontologies

Ontology Organiser is available from SourceForge. Details can be found at www.bioimage.org/publications.do

Director, Image Bioinformatics Research Laboratory Oxford e-Science Centre Department of Zoology,...

Documents

Transcript of Director, Image Bioinformatics Research Laboratory Oxford e-Science Centre Department of Zoology,...