Post on 31-Dec-2015
description
1
How to build an ontology 2
Barry Smith
http://ontology.buffalo.edu/smith
2
The 3-level DistinctionLevel 1:
everything that exists (things, processes, data …);
Level 2:ideas in people’s minds (diagnoses, thoughts, images
in your head, expectations, beliefs, fears …)
Level 3:publicly available (published, written down, drawn,
recorded, saved) versions of level 2 entities (ontologies, databases, journal articles, newspaper reports, diaries …)
The 3-level DistinctionLevel 1:
#120: an incident that happened;
Level 2:#213: the interpretation by some cognitive agent that #120
is an security breach; #31: the expectation by some cognitive agent that similar
incidents might happen in the future;
Level 3:#402: an entry in and information system concerning #120;#1503: an entry in some other information system about
#31 for mitigation or prevention purposes.
5
How do we know which general terms designate universals?
Roughly: terms used by scientists to designate entities about which we have a plurality of different kinds of testable proposition
(cell, electron ...)
More precisely: terms which designate universals are:
1. General
2. Used in current scientific textbooks to express laws of nature
3. Logically non-compound (‘non-rabbit’, ‘rabbit or violin’ do not designate universals)
4. Contain no parts designating particulars (‘cat in Leipzig’, ‘Finnish spy’ do not designate universals
6
7
Class =defa maximal collection of particulars determined by a general term (‘cell’. ‘electron’ but also: ‘ ‘restaurant in Palo Alto’, ‘Italian’)
the class A = the collection of all particulars x for which ‘x is A’ is true
8
universals vs. their extensions
universals
{a,b,c,...} collections of particulars
9
Extension =def
The extension of a universal A is the class: instance of the universal A
(it is the class of A’s instances)
(the class of all entities to which the term ‘A’ applies)
10
Problem
The same general term can be used to refer both to universals and to collections of particulars. Consider:
HIV is an infectious retrovirus
HIV is spreading very rapidly through Asia
11
universals vs. classes
universals
{c,d,e,...} classes
12
universals vs. classes
universals
defined classes
13
universals vs. classes
universals
populations, ...
14
Defined class =def
a class defined by a general term which does not designate a universal
the class of all diabetic patients in Leipzig on 4 June 1952
15
OWL is a good representation of defined classes
• sibling of Finnish spy
• member of Abba aged > 50 years
16
Terminology =def.
a representational artifact whose representational units are natural language terms (with IDs, synonyms, comments, etc.) which are intended to designate universals together with defined classes.
17
universals, classes, concepts
universals
defined classes
‘concepts’ ?
18
universals < defined classes < ‘concepts’
‘concepts’ which do not correspond to defined classes:
‘Surgical or other procedure not carried out because of patient's decision’
‘Congenital absent nipple’
because they do not correspond to anything
19
(Scientific) Ontology =def.
a representational artifact whose representational units (which may be drawn from a natural or from some formalized language) are intended to represent
1. universals in reality
2. those relations between these universals which obtain universally (= for all instances)
lung is_a anatomical structure
lobe of lung part_of lung
20
Part II: How to Build an Ontology
21
How to build an ontology
work with scientists to create an initial top-level classification
find ~50 most commonly used terms corresponding to universals in reality
arrange these terms into an informal is_a hierarchy according to this Universality principle
A is_a B every instance of A is an instance of B
fill in missing terms to give a complete hierarchy
(leave it to domain scientists to populate the lower levels of the hierarchy)
22
Principle of Low Hanging Fruit
Include even absolutely trivial assertions (assertions you know to be universally true)
pneumococcal virus is_a virus
Computers need to be led by the hand
23
Goal: Each term in an ontology represents exactly one universal
there are universals also of collectivities:
population
complex of cells
24
the use-mention confusion
swimming is healthy and has eight letters
25
Principle
Avoid confusing between words and things
Avoid confusing between concepts in our minds and entities in reality
Recommendation: avoid the word ‘concept’ entirely
26
Principle
For the sake of interoperability with other ontologies, do not give special meanings to terms with established general meanings
(Don’t use ‘cell’ when you mean ‘plant cell’)
27
Principle
Supply definitions wherever possible
(both human-understandable natural language definitions, and equivalent formal definitions)
28
Principle
Each term should have at most one definition
which may have both natural-language and formal versions
29
The Problem of Circularity
A Person = def. A person with an identity document
cell = def. plant cell, consisting of protoplast and cell wall; ...
30
Principle
Avoid circular definitions
(The term defined should not appear in its own definition)
31
Principle
A definition should use terms which are easier to understand than the term defined
32
Principle
Use Aristotelian definitions
An A is a B which C’s.
A human being is an animal which is rational
33
Principle
Do not seek to define everything
34
In every ontology
some terms and some relations are primitive = they cannot be defined (on pain of infinite regress)
Examples of primitive relations:
identity
instance_of
35
Rules for formatting terms
• Avoid abbreviations even when it is clear in context what they mean (‘breast’ for ‘breast tumor’)
• Avoid acronyms• Avoid mass terms (‘tissue’, ‘brain
mapping’, ‘clinical research’ ...)• Treat each term ‘A’ in an ontology is
shorthand for a term of the form ‘the universal A’
36
Univocity Terms should have the same meanings on
every occasion of use.
(= They should refer to the same universals)
Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies
37
Universality
Ontologies are made of relational assertions
They should include only those which hold universally
pneumococcal virus causes pneumonia
38
Universality
Often, order will matter:
We can assert
adult transformation_of child
but not
child transforms_into adult
39
Universality
viral pneumonia caused by virus
but not
virus causes pneumonia
pneumococcal virus causes pneumonia
40
Universality
results analysis later_than protocol-design
BUT NOT
protocol-design earlier_than results analysis
41
Positivity
Complements of universals are not themselves universals.
Terms such as non-mammal non-membrane other metalworker in New Zealand
do not designate universals in reality
42
Positivity
What about non-smoker?
43
Objectivity
Which universals exist in reality is not a function of our knowledge.
Terms such as
unknown
unclassified
unlocalized
arthropathies not otherwise specified
do not designate universals in reality.
44
Keep Epistemology Separate from Ontology
If you want to say that
We do not know where A’s are located
do not invent a new class of
A’s with unknown locations
(A well-constructed ontology should grow linearly; it should not need to delete classes or relations because of increases in knowledge)
45
If you want to say
I surmise that this is a case of pneumonia
do not invent a new class of surmised pneumonias
Confusion of ‘findings’ in medical terminologies
Keep Sentences Separate from Terms
46
Single Inheritance
No kind in a classificatory hierarchy should have more than one is_a parent on the immediate higher level
47
Multiple Inheritance
thing
carblue thing
blue car
is_a is_a
48
Multiple Inheritance
is a source of errors
encourages laziness
serves as obstacle to integration with neighboring ontologies
hampers use of Aristotelian methodology for defining terms
hampers use of statistical search tools
49
Multiple Inheritance
thing
carblue thing
blue car
is_a1 is_a2
50
is_a Overloading
The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned.
51
Multiple Inheritance
thing
carblue thing
blue car
is_a1 is_a2
52
How to solve this problem
Create two ontologies:
of cars
of colors
Link the two together via cross-products
(= factoring, normalization, modularization)
53
Compositionality
The meanings of compound terms should be determined
1. by the meanings of component terms
together with
2. the rules governing syntax
54
Why do we need rules/standards for good ontology?
Ontologies must be intelligible both to humans (for annotation and curation) and to machines (for reasoning and error-checking): the lack of rules for classification leads to human error and blocks automatic reasoning and error-checking
Intuitive rules facilitate training of curators and annotators
Common rules allow alignment with other ontologies
think of ontologies as legends for cartoons
56
cartoons, like maps, always have a certain threshold of granularity
but they can be veridical representations of reality nonetheless
Goal: use logically well-structured ontologies to create algorithmic, dynamic cartoons
57
Randomized controlled trials
http://rctbank.ucsf.edu/ontology/outline/index.htm
58
Basic Formal Ontology
What the top level should look like
59
Two kinds of entities
occurrents (processes, events, happenings)
continuants (objects, qualities, states...)
60
Continuants (aka endurants)have continuous existence in timepreserve their identity through changeexist in toto whenever they exist at all
Occurrents (aka processes)have temporal partsunfold themselves in successive phasesexist only in their phases
61
You are a continuant
Your life is an occurrent
You are 3-dimensional
Your life is 4-dimensional
62
Dependent entities
require independent continuants as their bearers
There is no run without a runner
There is no grin without a cat
63
Dependent vs. independent continuants
Independent continuants (organisms, buildings, environments)
Dependent continuants (quality, shape, role, propensity, function, status, power, right)
64
All occurrents are dependent entities
They are dependent on those independent continuants which are their participants (agents, patients, media ...)
65
BFO Top-Level Ontology
ContinuantOccurrent
(always dependent on one or more
independent continuants)
IndependentContinuant
DependentContinuant
66
= A representation of top-level types
Continuant Occurrent
IndependentContinuant
DependentContinuant
cell component
biological process
molecular function
67
Top-Level Ontology
Continuant Occurrent
IndependentContinuant
DependentContinuant
Functioning
Side-Effect, Stochastic Process, ...
Function
68
Top-Level Ontology
Continuant Occurrent
IndependentContinuant
DependentContinuant
Functioning Side-Effect, Stochastic Process, ...
Function
69
Top-Level Ontology
Continuant Occurrent
IndependentContinuant
DependentContinuant
Quality Function Spatial Region
Functioning Side-Effect, Stochastic Process, ...
instances (in space and time)
70
71
72
Towards a Clinical Trial Ontology
To serve merger of data schemas
To serve flexibility of collaborative clinical trial research
To serve management of clinical trial research
To serve data access and reuse
73
CTO will be part of OBI
Ontology of Biomedical Investigations
http://obi.sourceforge.net
which is in turn part of the OBO Foundry
http://obofoundry.org
74
Overview of the Ontology of Biomedical Investigations
with thanks to Trish Whetzel on behalf of the FuGO Working Group
75
OBI
PurposeProvide a resource for the unambiguous description of the
components of biomedical investigations such as the design, protocols and instrumentation, material, data and types of analysis on the data
NOT designed to model biology
EnablesAllow consistent annotation of data across different
technological and biological domainsEnable powerful queriesFacilitate semantically-driven data integration
76
Motivation for OBI
Standardization efforts in biological and technological domains
Standard syntax - Data exchange formats To provide a mechanism for software
interoperability, e.g. FuGE Object Model
Standard semantics - Controlled vocabularies or ontology Centralize commonalities for annotation term
needs across domains to describe an investigation/study/experiment, e.g. FuGO
77
Biomedical Investigation Components
Computational/Higher Level Analysis
Data Pre-Processing
Instrumental Analysis
Sample Analysis Preparation
Treatments
Material and It's Characteristics
Investigation Design
Describe the material and characteristics.
Describe the manipulations or perturbations or observations performed on the material to meet the general aim of the investigation.
Describe how the material was prepared for analysis - e.g. labeling, protein digest, etc.
Describe the instrument and settings that were used.Describe the results from the instrument, e.g. what units are represented.
Describe the type analysis performed to confirm/deny the hypothesis, e.g. clustering.
Describe the design and purpose or general aim of the the Investigation.
78
FuGO Development Strategy Decisions
Unified Development
Pros
Overlap of terms is identified early in development
Universal/Common terms are defined by all those collaborating
Additional technological or biological terms can be added as needed by collaborators
Cons
Time needed to develop the ontology
Independent Development
Pros
Develop ‘Ontology’ in a time frame limited only by the community
Cons
Development of different working policies?
Use of different top level classes?
Overlap of terms at lower levels of the ontology tree
79
FuGO Development Process
Collect Use Cases - within community activity
Collect examples of investigations as performed within a community and present Use Cases to developers group
Bottom up approach - within community activity
Identify concepts to describe using controlled terms
Collect terms and their definitions
Bin terms in the top level ontology structure
Top down approach - collaborative activity
Build a top level ontology structure, is_a (vertical) relationships
Make a list of other foreseen (horizontal) relationships
Review how Top Level Nodes fit in with the Upper Level Ontologies
80
FuGO - Top Level Classes
Continuant: an entity that endure/remains the same through time Dependent Continuant: depend on another entity
E.g. Environment (depend on the set of ranges of conditions, e.g. geographic location)
E.g. Characteristics (entity that can be measured, e.g. temperature, unit)
- Realizable: an entity that is realizable through a process (executed/run)E.g. Software (a set of machine instructions)
E.g. Design (the plan that can be realized in a process)
E.g. Role (the part played by an entity within the context of a process)
Independent Continuant: stands on its ownE.g. All physical entity (instrument, technology platform, document etc.)
E.g. Biological material (organism, population etc.)
Occurrent: an entity that occurs/unfold in timeE.g. Temporal Regions, Spatio-Temporal Regions (single actions or Event)
Process E.g. Investigation (the entire ‘experimental’ process)E.g. Study (process of acquiring and treating the biological material)E.g. Assay (process of performing some tests and recording the results)
81
Emerging FuGO Design PrinciplesOBO Foundry ontology, utilize ontology best practices
Inherit top level classes from an Upper Level ontologyUse of the Relation OntologyFollow additional OBO Foundry principlesFacilitates interoperability with other OBO Foundry ontologies
Develop recommendations for naming conventions and metadataFormat for term names, e.g. underscore vs. camel case, no purals Use of Alphanumeric identifier for terms, I.e. something that does not have semantic
meaningMechanisms for adding synonyms, etc.
Open source approachProtégé/OWLWeekly conference callsShared environment using Sourceforge (SF) and SF mailing lists
82
Future Plans
Binning process - ongoing
Reconciliations into one canonical version
Iterative process
Common working practices - established
Each class consists of: unique alphanumeric identifier, human readable string name, definition and comments
Sourceforge tracker in place to collect comments on terms, definitions, relationships
Review ontology so that top level classes meet the needs of all involved ‘communities’
83
OBI Collaborating Communities
Crop sciences Generation Challenge Programme (GCP), www.generationcp.orgEnvironmental genomics MGED RSBI Group, www.mged.org/Workgroups/rsbiGenomic Standards Consortium (GSC), www.genomics.ceh.ac.uk/genomecatalogueHUPO Proteomics Standards Initiative (PSI), psidev.sourceforge.netImmunology Database and Analysis Portal, www.immport.orgImmune Epitope Database and Analysis Resource (IEDB),
http://www.immuneepitope.org/home.doInternational Society for Analytical Cytology, http://www.isac-net.org/Metabolomics Standards Initiative (MSI), msi.workgroups.sourceforge.netNeurogenetics, Biomedical Informatics Research Network (BIRN), www.nbirn.netNutrigenomics MGED RSBI Group, www.mged.org/Workgroups/rsbiPolymorphismToxicogenomics MGED RSBI Group, www.mged.org/Workgroups/rsbiTranscriptomics MGED Ontology Group, mged.sourceforge.net/ontologies
84
http://fugo.sourceforge.net
85
http://obi.sourceforge.net
86
87
88
89
90
91
92
93
Top-Level Class Hierarchy for RCT
Root Secondary-study
Trial-details
Trial
Concept • Generic-concept • Population-concept • Protocol-concept • Design-concept • Outcome-concept • Administrative-concept • Intervention-concept
94
Amended Top-Level Class Hierarchy for RCT
EntityContinuant
PopulationProtocolDesign
OccurrentTrial
Secondary-study Intervention
?? Trial-details ?? Outcome-concept ?? Administrative-concept
95
Concept • Generic-concept
– Term-information – Time-entity – Rule-concept
» Clinical-rule
Exclusion-rule
Inclusion-rule » Rule-entity
Recursive-rule
Base-rule » Ethnicity-language-rule » Age-gender-rule » Situation
96
97
98
Concept • Protocol-concept
– Follow-up-compliance – Follow-up-activity – Follow-up – Protocol-change – Treatment-assignment – Protocol – Reason – Outcomes-followup – Secondary-study-protocol
99
Amended Top-Level Class Hierarchy for RCT
EntityContinuant
Protocol• Secondary-study-protocol
Reason
Occurrent• Treatment-assignment • Follow-up
– Follow-up-activity
– Outcomes-follow-up
• Protocol-change
100
Concept • Population-concept
– Subgroup – Recruitment-flowchart – Population – Recruitment – Site-enrollment
101
Amended Top-Level Class Hierarchy for RCT
EntityContinuant
Protocol• Secondary-study-protocol
Recruitment-flowchart Reason Population
• Subgroup
Occurrent• Priors
– Recruitment– Site-enrollment – Treatment-assignment
• Follow-up – Follow-up-activity – Outcomes-follow-up
• Protocol-change
102
Concept • Administrative-concept
– Publication-concept – Study-site – Person – Ethics – Study-committee – Funder – Institution – Registry-ID
103
Continuant• Information object
– Publication – Registry-ID
• Study-site • Person • Institution
– Study-committee – Funder
???Ethics
104
Concept • Intervention-concept
– Blinding-concept – Compliance-details – Intervention-step – Intervention-arm – Co-intervention – Intervention – Compliance-result – Intervention-logic
105
Occurrent• Intervention
– Blinding– Intervention-step – Intervention-arm – Co-intervention
• ??? Intervention-logic
• ??? Compliance-result
• ??? Compliance-details
106
107
Test Case: Clinical Trial Ontology
primary outcomesecondary outcometimepoint clinical trialintervention groupcontrol groupassignment of populations to groupscomplex experimental designrandomizationplaceboresponseefficacycontrolprotocolnull hypothesis,confidence interval
108
FuGE idea: use OBI to design datatableshow to solve this problem of converting the ontology to a database schemawhat are ‘instances’annotating images (image repositories)annotation = shared understanding of a body of knowledge I run a trial I stick my data in Excell and create a datasetdesign database, design tables – that’s it – no more annotationsmetadata is added regarding provenance, this data was added by A and corrected by McBdo rare disease people share their data: here’s my data, here’s my data key, 1 is for males,
0 is for femalessharing is localbut UCSF (Clinical Data Repository) neurodegenerative people MS talk to Alzheimer’s they
can’t because (a) because of Hippa, (b) dataschemas are so different, (c) response to NIH: they put their excell spreadsheet out there, well gee whizz, (d) PharmGKB faced problems because of this (e) more obtuse the better. I can get another paper out of this data
no possibility of meta-analysis – opposite of biologists’ view
109