Knowing what we’re talking about

34
Knowing what we’re talking about Robert Stevens Bio-health Informatics Group School of Computer Science University of manchester Oxford Road Manchester United Kingdom M13 9PL [email protected]

description

Invited talk at CSIR, pretoria,2013

Transcript of Knowing what we’re talking about

Page 1: Knowing what we’re talking about

Knowing what we’re talking about

Robert StevensBio-health Informatics GroupSchool of Computer Science

University of manchesterOxford RoadManchester

United KingdomM13 9PL

[email protected]

Page 2: Knowing what we’re talking about

We have an item of data

• 27

• 27 what?

• Units, with what is 27 associated?

• Even if I told you, would we interpret what I said in the same way?

27

Page 3: Knowing what we’re talking about

• text

27mm

Page 4: Knowing what we’re talking about

• text

tail of 27mm

Page 5: Knowing what we’re talking about

Mouse tail of 27 mm• … and we can carry

on: Mouse strain, where was it raised, on what was it fed, times, dates, etc. etc.

• All this data is necessary to interpret my original number

• Even if that metadata exists, we have to agree on the things the numbers describe

mouse tail of 27mm

Page 6: Knowing what we’re talking about

What is knowledge?

Page 7: Knowing what we’re talking about

Heterogeneity is rife

• We agree on units (more or less)…

• We don’t agree on much else when it comes to labels for the entities in our domain

• If we don’t know what we’re talking about….

• It’s difficult to interpret and exchange data and the results from data

Page 8: Knowing what we’re talking about

Categories and Category Labels

GO:0000368

U2-type nuclear mRNA 5' splice site recognition

spliceosomal E complex formation

spliceosomal E complex biosynthesis

spliceosomal CC complex formation

U2-type nuclear mRNA 5'-splice site recognition

Page 9: Knowing what we’re talking about

The Ogden Triangle

“Roast Beef“

Concept

[Ogden, Richards, 1923]

• Humans require words (or at least symbols) to communicate efficiently. The mapping of words to things is only indirectly possible. We do it by creating concepts that refer to things.

• The relation between symbols and things has been described in the form of the meaning triangle:

Page 10: Knowing what we’re talking about

We need to know what we’re talking about…

• … if we don’t, our data are useless

• Ifg we are to interpret our data then we need to know what entities it describes

• We need to share data and re-use it

• We need to find data; compare data; analyse data

• We need to know what we know….

Page 11: Knowing what we’re talking about

Manchester MercuryJanuary 1st 1754 Executed 18

Found Dead 34

Frighted 2

Kill'd by falls and other accidents 55

Kill'd themselves 36

Murdered 3

Overlaid 40

Poisoned 1

Scalded 5

Smothered 1

Stabbed 1

Starved 7

Suffocated 5

Aged 1456

Consumption 3915

Convulsion 5977

Dropsy 794

Fevers 2292

Smallpox 774

Teeth 961

Bit by mad dogs 3

Broken Limbs 5

Bruised 5

Burnt 9

Drowned 86

Excessive Drinking 15

List of diseases & casualties this year

19276 burials

15444 christenings

Deaths by centile

Page 12: Knowing what we’re talking about

A World of Instances

• The world (of information) is made up of things and lots of them

• Instances, individuals, objects, tokens, particulars.

• The Earth is a kind of Planet

• Robert Stevens (NE 67 41 58 A) is a Person

• All the individual Alpha Haemoglobins in my many Instances of Red Blood Cell

• Each cell instance in my Body has copies of some 30,000 Genes

• A Word, language, idea, etc.

• This Table, those Chairs,

• Any Thing with “A”, “The”, “That”, etc. before it….

Page 13: Knowing what we’re talking about

We Put things into Categories

• All these instances hang about making our world

• Putting these things into categories is a fundamental part of human cognition

• Psychologists study this as concept formation

• The same instances are put into a category

Page 14: Knowing what we’re talking about

We have Labels for the Categories and their Instances

• We label categories with symbols: Words

• “Lion” is a category of big cat with big teeth

• Gene, Protein, Cell, Person, Hydrolase Activity, etc.

• …and, as we’ve already seen, each category can have many labels and any particular label can refer to more than one category

• Semantic Heterogeneity

• “A lion” is an instance in that category

• Does the category “Lion” exist?

• Lions exist, but the category could just be a human way of talking about lions

• … we like putting things into categories

Page 15: Knowing what we’re talking about

A Controlled Vocabulary• A specified set of words and phrases for the

categories in which we place instances

• Natural language definitions for those words and phrases

• A glossary defines, but doesn’t control

• The Uniprot keywords define and control

• Control is placed upon which labels are used to represent the categories (concepts) we’ve used to describe the instances in the world

• …, but there is nothing about how things in these categories are related

Biopolymer

DNA

Enzyme

Nucleic acid

mRNA

Polypeptide

snRNA

tRNA

Page 16: Knowing what we’re talking about

We also like to Relate Things Together

• Categories have subcategories

• Instances in one category can be related in some way to instances in another

• Can relate instances to each other in many different ways

• Is-a, part-of, develops-from, etc.axes

• We can use these relationships to classify categories

• Things in category A are part is

• If all instances in category A are also in category B then As are kinds of Bs

Biopolymer

Nucleic Acid

Polypeptide

Enzyme

DNA RNA

tRNA mRNA smRNA

Page 17: Knowing what we’re talking about

Categories and sub-categories

biopolymer

polypeptide

Nucleic acid

enzyme

DNA

RNA

Page 18: Knowing what we’re talking about

Describing Category Membership

• We can make conditions that any instance must fulfil in order to be a member of a particular category

• A Phosphatase must have a phosphatase catalytic domain

• A Receptor must have a transmembrane domain

• A codon has three nucleotide residues

• A limb has part that is a joint

• A man has a Y chromosome and an X chromosome

• A woman has only an X chromosome

Page 19: Knowing what we’re talking about

Relationships

• These conditions made from a property and a successor relationship

• isPartOf, hasPart

• isDerivedFrom

• DevelopsFrom

• isHomologousTo

• …and many, many more

Page 20: Knowing what we’re talking about

A Structured Controlled Vocabulary

• Not only can we agree on the labels we give categories

• Can also agree on how the instances of categories are related

• And agree on the labels we give he relations

• Structure aids querying and captures knowledge with greater fidelity

Biopolymer

Nucleic Acid

Polypeptide

Enzyme

DNA RNA

tRNA mRNA smRNAGene

regionOf

transcribedFrom

trans

late

dFro

m

Page 21: Knowing what we’re talking about

A Stronger Definition

• a set of logical axioms designed to account for the intended meaning of a formal vocabulary used to describe a certain (conceptualisation of) reality [described in an information system) [Guarino 1998]

• “conceptualisation of” inserted by me

• “Logical axioms” means a formal definition of meaning of terms in a formal language

• Formal language—something a computer an reason with

• Use symbols to make inferences

• Symbols represent things and their relationships

• Making inferences about things computationally

Page 22: Knowing what we’re talking about

So what is an ontology?

Catalog/ID

Thesauri

Terms/glossary

Informal Is-a

FormalIs-a

Formalinstance

Frames(properties)

General Logicalconstraints

Valuerestrictions

Disjointness,Inverse, partof

Gene Ontology

Mouse AnatomyEcoCyc

PharmGKB

TAMBISArom

After Chris Welty et al

Page 23: Knowing what we’re talking about

What does it all mean anyway

• To interpret our data we need to know what it is we’re talking about

• We need to decide the things that we’re talking about and agree upon them

• We need to agree on how to recognise those entities

• We need to know how they are related to one another

• Ontologies are a mechanism for describing those entities and their definitions

• There’s more to knowledge representation than ontologies…

Page 24: Knowing what we’re talking about

All this knowledge needs representing

• We want this knowledge in a computational form• To make the knowledge available for software (and

humans)• To help us develop and manage the (often) complex

artefacts

Building ontologies is hard (getting all those relationships in the right place)

The Web Ontology Language (OWL) is a W3C recommendation for ontologies on the Semantic Web and in semantically enabled applications

A knowledge representation language with a strict semantics that is amenable to autoamted reasoning

Page 25: Knowing what we’re talking about

Web Ontology Language (OWL)

• W3C recommendation for ontologies for the Semantic Web

• OWL-DL mapped to a decidable fragment of first order logic

• Classes, properties and instances

• Boolean operators, plus existential and universal quantification

• Rich class expressions used in restriction on properties – hasDomain some (ImnunoGlobinDomain or FibronectinDomain)

Page 26: Knowing what we’re talking about

What are we saying?

Person

WomanMan

is-ais-a

•Are all instances of Man instances of Person?•Can an instance of Person be both a Manand an instance of Woman?•Can there be any more kinds of Person?

Page 27: Knowing what we’re talking about

What are we saying?

• What kinds of class can fill “has chromosome”?

• How many “Y chromosome” are present?

• Does their have to be a “Y chromosome”?

• What properties are sufficient to be a Man and which are simply necessary?

Y chromosomeMan has-chromosome

Y chromosomeManhas-chromosome

X chromosomehas-chromosome

autosomehas-chromosome

1

1

44

Page 28: Knowing what we’re talking about

OWL represents classes of instances

A

BC

Page 29: Knowing what we’re talking about

Necessity and Sufficiency

• An R2A phosphatase must have a fibronectin domain

• Having a fibronectin domain does not a phosphatase make

• Necessity -- what must a class instance have?

• Any protein that has a phosphatase catalytic domain is a phosphatase enzyme

• All phosphatase enzymes have a catalytic domain

• Sufficiency – how is an instance recognised to be a member of a class?

Page 30: Knowing what we’re talking about

Uses of ontologies

Page 31: Knowing what we’re talking about

Ontologies in software

Page 32: Knowing what we’re talking about

Problems Ontologies in Biology Try To Solve

• Provenance – where did it come from, who did it?

• Reproducibility – can I repeat and find results reported?

• Sharing – can others understand your data?

• Integration – can I readily take multiple (thousands of) data sets and use them without preparation?

• New knowledge – can we infer new knowledge as a sum of current knowledge (computationally)?

Page 33: Knowing what we’re talking about

The rise and rise of ontologies

Page 34: Knowing what we’re talking about

What are the prospects for ontologies