Post on 31-Dec-2015
description
e-SI Theme: Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next…
Prof. Jessie Kennedy
Exploiting Diverse Sources of Scientific Data
Science & Scientific Data
Science and Scientific Data are Complex…
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Geography
Ecology
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Exploiting Diverse Sources of Scientific Data
Individual Scientist
Small Scientific Community
Large Scientific Community Scientific Laboraotory
Scientific Community: complex
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Biochemistry
Climatology
Taxonomy
Meteorology
Nomenclature
Paleontology
GenomicsProteomics
Hydrology
Morphology
Geology
Oceanography
Ecology
Geography
Organism
Name
Taxon concept Gene
sequence
Pathway
Protein
Location
TemperatureDepth
Exploiting Diverse Sources of Scientific Data
Science & Scientific Data
Are continually changing Conclusions become
foundations for new hypotheses
New experiments invalidate existing knowledge
Knowledge is open to interpretation Different opinions
World continually changing
observation
experiment hypothesis
conclusion
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data: the visionTo provide scientists with technological
solutions to exploit the wealth and diversity of Scientific Data Discovery Access Sharing Integration/Linking Analysis
Which would thereby improve the potential for new scientific discovery
Exploiting Diverse Sources of Scientific Data
Projects in most sciences:
ESG
SEEK (Scientific Environment for Ecological Knowledge): Vision
• Research, develop, and capitalize upon advances in information technology to radically improve the type and scale of ecological science that can be addressed
– Scalable synthesis
Michener
Data Dispersion Challenges
• Data are massively dispersed– Ecological field stations and research centers (100’s)– Natural history museums and biocollection facilities (100’s)– Agency data collections (10’s to 100’s)– Individual scientists (1000’s)
– Maintenance must be local
Michener
Data Integration Challenges
• Data are heterogeneous– Syntax
• (format)
– Schema• (model)
– Semantics• (meaning)
Jones
Ecological Modeling Challenges
• Analysis and modeling tools are: – Specialized– Disconnected– Proprietary
• It is:– Difficult to revise analyses– Hard to document analyses– Impossible to reliably publish models to share with
colleagues– Hard to re-use models and analyses from colleagues– Difficult to use grid-computing for demanding computations– Labor-intensive to manage data in popular analysis software
Michener
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data: the approachesData Discovery/Access
Metadata To describe the data sets
Ontologies To define the terminology used
Standardisation of formats For the exchange of data
Life Science Identifiers (LSIDs) To uniquely identify and resolve data objects
Provenance of data To record where the data has come from And what has happened to it en route.
GRID/Web technology Distributed data management
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data: the approachesData Integration/Linking
Metadata To know how to interpret the data sets
Ontologies To know how data in the data sets might be related To aid automatic transformation of the data
Standardisation of formats To ease integration
Life Science Identifiers (LSIDs) To know when 2 things are the same
Workflows To enable refinement and repetition of integration
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data: the approachesData Analysis
Metadata To know how to interpret the data sets
Ontologies To know analytical/transformation processes appropriate
Workflow Tools To ease analytical processes Recording/reuse of analytical processes
Provenance Recording life history of data To enable validation
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data: the technologiesStandardisation of formatsMetadata OntologiesLife Science Identifiers (LSIDs)ProvenanceWorkflow Tools GRID/Web technology
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data: the technologiesStandardisation of formatsMetadata OntologiesLife Science Identifiers (LSIDs)ProvenanceWorkflow Tools GRID/Web technology
Exploiting Diverse Sources of Scientific Data
Meta Data: the vision
Meta data - "data about data" keywords, title, creator ….
If scientists marked up their data with the agreed meta data it would be trivial to find highly relevant data (sub-)sets for analysis…
Meta-utopia….
Exploiting Diverse Sources of Scientific Data
Meta-utopia
A world of complete, reliable metadata. In meta-utopia,
Everyone uses the same language and means the same thing…
The guardians of epistemology have rationally mapped out a schema or hierarchy of ideas. that everyone adheres to…
Scientists accurately describe their methods, processes and results. so anyone can do anything with it in the future…
Cory Doctorow
Exploiting Diverse Sources of Scientific Data
Meta Data: the approach
Common language XML Schemas to describe data/meta data
Domain specific exchange schemas Explosion of these in every domain
Exchanging data Archiving data
knb.ecoinformatics.org
Ecological Metadata Language
A look inside the meta-utopia of ecology
knb.ecoinformatics.org
Identification: dataset elements
knb.ecoinformatics.org
Identification: resource elements
knb.ecoinformatics.org
Identification: party elements
knb.ecoinformatics.org
Discovery: coverage elements
GeographicTemporal Taxonomic
knb.ecoinformatics.org
Evaluation Level Information
knb.ecoinformatics.org
Evaluation: Method Information
knb.ecoinformatics.org
Evaluation: Project Information
L3
knb.ecoinformatics.org
Access: Permissions Information
L4
knb.ecoinformatics.org
Access: Physical Information
knb.ecoinformatics.org
Access: Physical formatting details
knb.ecoinformatics.org
Access: Distribution Information
L4
knb.ecoinformatics.org
Integration Level Information
knb.ecoinformatics.org
Integration Level: Attribute structure
knb.ecoinformatics.org
Integration Level: attribute domains
knb.ecoinformatics.org
Integration Level: attribute domains
knb.ecoinformatics.org
Integration Level: measurementScale
Exploiting Diverse Sources of Scientific Data
Meta Data: the approach
Common language XML Schemas to describe data/meta data
Domain specific exchange schemas Explosion of these in every domain
Exchanging data Archiving data
Turned into extensive specifications Difficult to know where to stop…
Exploiting Diverse Sources of Scientific Data
but even this wasn’t enough…..
It’s not good enough to have meta-data, we need to know what the terms in the meta-data (schema or data values) mean.
Exploiting Diverse Sources of Scientific Data
Ontologies – the vision
If we understood the meaning of the schema and the terms used in the meta-data or databases we would be able to: find things more reliably, integrate things more easily, reason about what things are comparable….
because we have support for automatic inference
Exploiting Diverse Sources of Scientific Data
Ontologies – the approach
Common Language… OWL?
RDF, OWL lite, OWL DL, OWL full…..
Domain specific ontologies or project specific?
Map different ontologies Modularise the ontologies
Reuse..Build upper ontologies to which domain
ontologies extend/link
Biodiversity Base Ontology
Core Layer
BDI Core Taxon Name
BDI Core Taxon Concept
BDI Core BioSpecimen
BDI Core BioObservation
Similar to…
SEEK Observation ontology
Josh Madin
An extension point for domain-specific terms
entity
Josh Madin
Characteristic
Josh Madin
All the units, scales, indices, classifications, and lists used for ‘measuring’ a characteristic
Measurement standard
Similar to…
Josh Madin
Exploiting Diverse Sources of Scientific Data
Semantic Web for Earth and Environmental Terminology (SWEET)
Ontologies revised and validated Jan 26, 2006
Biosphere Data
Data Center Human Activity Material Thing
Numerics Sensor Space Time Units
Earth Realm Physical Phenomena Physical Process Physical Property Physical Substance Sun Realm
Takes us back to…
BDI Taxon Concept Ontology
…is really just a schema for representing
…
Exploiting Diverse Sources of Scientific Data
Biological TaxonomyClassify and name all organisms in the world
So we can talk about them, experiment with them Do life science…
The longest running attempt at building an ontology? Linnaeus binomial system of nomenclature started in 1758
An attempt to resolve a long standing problem in biology
Many ways to classify things Understanding continually changes with new discoveries &
technologies Classifications continually being redone
New things defined, New definitions given for things in existence
Lots of classifications over time Many compete at any one point in time
Exploiting Diverse Sources of Scientific Data
Aus aus L.1758
Aus L.1758
Aus bea Archer 1965
Archer 1965
Aus L.1758
Aus aus L.1758
Linneaus 1758
Aus L.1758
Aus aus L.1758
Aus bea Archer 1965
Aus cea BFry 1989
Fry 1989
Aus L.1758
Xus beus (Archer) Pargiter 2003.
Aus ceus BFry 1989
(vi) Xus Pargiter 2003
Pargiter 2003
Aus aus L. 1758
Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus.
Aus aus L.1758
Tucker 1991
Aus L.1758
Aus cea BFry 1989
Taxonomic history of imaginary genus Aus L. 1758
Pyle 1990
5 Revisions of Aus 1 name spelling change
Exploiting Diverse Sources of Scientific Data
Aus aus L.1758
Aus L.1758
Aus bea Archer 1965
Archer 1965
Aus L.1758
Aus aus L.1758
Linneaus 1758
Aus L.1758
Aus aus L.1758
Aus bea Archer 1965
Aus cea BFry 1989
Fry 1989
Aus L.1758
Xus beus (Archer) Pargiter 2003.
Aus ceus BFry 1989
(vi) Xus Pargiter 2003
Pargiter 2003
Aus aus L. 1758
Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus.
Aus aus L.1758
Tucker 1991
Aus L.1758
Aus cea BFry 1989
Taxonomic history of imaginary genus Aus L. 1758
Pyle 1990
• 8 Names• 2 genus• 6 species
N4 - Aus beus Archer 1965
N1 - Aus aus L.1758
N1
C1.5
C1.4
C1.3
C1.2
C1.1 C1.1 - Aus aus L.1758 sec. Linneaeus 1758
C1.2 - Aus aus L.1758 sec. Archer 1965
C1.3 - Aus aus L.1758 sec. Fry 1989
C1.4 - Aus aus L.1758 sec. Tucker 1991
C1.5 - Aus aus L.1758 sec. Pargiter 2003
N2 - Aus bea Archer 1965
N5 C5.5N5 - Aus ceus Fry 1989
C5.5 - Aus ceus Fry 1989 sec. Fry 1989
C6.5N6N6 - Xus beus Pargiter 2003
C6.6 - Xus beus Pargiter 2003 sec. Pargiter 2003
N2
C2.3
C2.2 C2.2 - Aus bea Archer 1965 sec. Archer 1965
C2.3 - Aus bea Archer 1965 sec. Fry 1989
N3
N4C3.4
C3.3N3 - Aus cea Fry 1989 C3.3 - Aus cea Fry 1989 sec. Fry 1989
C3.4 - Aus cea Fry 1989 sec. Tucker 1991
N0 - Aus L.1758
N0
C0.5
C0.4
C0.3
C0.2
C0.1 C0.1 - Aus L.1758 sec. Linneaeus 1758
C0.2 - Aus L.1758 sec. Archer 1965
C0.3 - Aus L.1758 sec. Fry 1989
C0.4 - Aus L.1758 sec. Tucker 1991
C0.5 - Aus L.1758 sec. Pargiter 2003
C7.5N7
N7 - Xus Pargiter 2003
C7.6 - Xus Pargiter 2003 sec. Pargiter 2003
8 Names 17 Concepts
Results in many
concepts for each name
Exploiting Diverse Sources of Scientific Data
Possible interpretations of Aus aus L. 1758 Request data sets about Aus aus (N1)
what’s returned?
Original concept: C1.1 Most recent concept: C1.5 Preferred Authority (e.g. Fry 1989): C1.3 Everything ever named N1:
Union(C1.1,C1.2,C1.3,C1.4,C1.5) Best fit according to some matching algorithm
Best(C1.1,C1.2,C1.3,C1.4,C1.5) New concept containing only those features
common to all concepts with the name N1: Intersection(C1.1,C1.2,C1.3,C1.4,C1.5)
Is it appropriate to link or merge data on this? Depends on the user’s purpose Level of precision required
N1 - Aus aus L.1758
N1
C1.5
C1.4
C1.3
C1.2
C1.1
Exploiting Diverse Sources of Scientific Data
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N0 N7
N1 N2N5 N6N3 N4
Classifications synonymy relationships between concepts and names.
In the literature taxonomists tell us names that are synonymous with their concepts
Parent child relationships in 5 revisions
Names for each of the concepts
Exploiting Diverse Sources of Scientific Data
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N0 N7
N1 N2N5 N6N3 N4
Classifications synonymy relationships between concepts and names.
Which can result in anything being returned for Aus aus by traversing the synonymy links
Exploiting Diverse Sources of Scientific Data
C1.5 C5.5
C0.5
C1.4 C3.4
C0.4
C1.1
C0.1
C1.2 C2.2
C0.2
C1.3 C2.3 C3.3
C0.3
C6.5
C7.5
N1N5 N6
N2 N3 N4
N0 N7
= =
Classifications with set relationships between concepts.
What we need are the set relationships from concepts in a revision to earlier concepts
and name changes related to earlier names
We can build systems to return data suit for purpose
Exploiting Diverse Sources of Scientific Data
Real Taxonomic RevisionsGerman mosses
14 classifications in 73 years covering 1548 taxa only 35% thought to be stable concepts
65% of names used in legacy data sets are ambiguous and we don’t know which ones?? we need computers to help understand this…
Smaller classifications are combined into large classifications ITIS – integrated taxonomy (also changing) approx. 250,000
taxaTaxonomic Revision of genus Alteromonas
34 years: from 1972 to 2006 Thanks to George Garrity, Michigan State Univ.
macleodii(T)
communis
Alteromonas
1972
vaga
communisvagahaloplanktis
Alteromonasmacleodii(T)
1972 1973
communisvagahaloplanktisrubra
Alteromonas
1972 1973 1976
macleodii(T)
communisvagahaloplanktisrubracitrea
Alteromonas
1972 1973 1976 1977
macleodii(T)
communisvagahaloplanktisrubracitreaesperjianaundina
Alteromonas
1972 1973 1976 1977 1978
macleodii(T)
communisvagahaloplanktisrubracitreaesperjianaundinaaurantia
Alteromonas
1972 1973 1976 1977 1978 1979
macleodii(T)
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedai
Alteromonas
1972 1973 1976 1977 1978 1979 1981
macleodii(T)
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceae
Alteromonas
1972 1973 1976 1977 1978 1979 1981 1982
macleodii(T)
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceae
vagacommunis(T)
Marinomonas Alteromonas
commune
vagum
1972 1973 1976 1977 1978 1979 1981 1982 1984
multiglobiferum
japonicumminutiumbiejerinckiimaris
maris
hiroshimense
pelagicumpusillum
jannaschiikreigii
Oceanosprillum
mariswilliamsae
linum(T) macleodii(T)
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedai
vaga benthicahanedai
Marinomonas Alteromonasputrifaciens(T)
Shewanella
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum
Oceanosprillum
mariswilliamsae
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
luteoviolaceae
communis(T)linum(T) macleodii(T)
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987
communisvagahaloplanktisrubracitreaesperjianaundinaaurantia
hanedailuteoviolaceaedenitrificans
vaga benthicahanedai
Marinomonas Alteromonas Shewanella
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum
Oceanosprillum
mariswilliamsae
putrifaciens
putrifaciens(T)communis(T)linum(T) macleodii(T)
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
vaga benthicahanedai
Marinomonas Alteromonas Shewanella
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum
Oceanosprillum
mariswilliamsae
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988
colwelliana
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedai
Marinomonas Shewanella
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonis
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990
colwelliana
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktis
putrifacienshanedai
denitrificans
rubracitreaesperjianaundinaaurantia
luteoviolaceae
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktis
putrifacienshanedai
denitrificans
rubracitreaesperjianaundinaaurantia
luteoviolaceae
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra
haloplanktishaloplanktis(T)
Pseudoalteromonas
undina
haloplanktistetradonis
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra
Pseudoalteromonas
undinaantartica
elyakoviii
haloplanktistetradonis
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra
Pseudoalteromonas
undinaantartica
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolacea
bacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolytica
haloplanktistetradonis
mediterannea
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra
Pseudoalteromonas
undinaantartica
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolacea
bacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis
japonica
haloplanktistetradonis
mediterannea
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
Pseudoalteromonas
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis
haloplanktistetradonis
mediterannea
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
Pseudoalteromonas
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis
haloplanktistetradonis
12 others
mariniintestinasaireschlegelianagaetbuli
mediteranneaprimoryensis
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
stellipolarislitorea 5 others
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004 2005
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
Pseudoalteromonas
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis
haloplanktistetradonis
14 others
mariniintestinasaireschlegelianagaetbuli
mediteranneaprimoryensis
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
stellipolarislitorea 8 others2 others
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004 2005 2006
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafuliginea
Pseudoalteromonas
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis
haloplanktistetradonis
14 others
mariniintestinasaireschlegelianagaetbuli
mediteranneaprimoryensis
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
stellipolarislitorea 13 others2 others
Alteromonas
Alteromonadacea
Alteromonadales
Gammaproteobacteria
Alishewanella
Aestuariibacter
FerrimonasColwellia
Idiomarina
Glaciecola
Marinobacterium
Marinobacter
Pseudoalteromonas
Microbulbifer
Incertae sedis
Psychromonas
Teredinibacter
Shewanella
Thalassomonas
Ferrimonadacea
Idiomarinaceae
Moritella
Moritellaceae
Pseudoalteromonadaceae
Ferrimonas
Idiomarina
Pseudoalteromonas
Psychromonadaceae
Algicola
Psychromonas
Moritella
Shewanellaceae Shewanella
Incertae sedis
Teredinibacter
Agarvorans
Alishewanella
Marinobacterium
Marinobacter
Microbulbifer
Salinomonas
Colwelliaceae
Colwelliaceae
Thalassomonas
May 2004 November 2004
At the species level 18 “emendations”
21 new species19 species reassigned to 4 genera
3 new combinations6 synonyms 2 species to subspecies2 subspecies to species
50 names, five genera, five families, and two classes but….only 5 validly published species.
At the higher level1 Family 16 genera -> 8 families 12 genera
1 unclassified genus -> 7 unclassified generaWhich is correct?Which is supported/recorded in the data?What is the impact on Analysis?
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
What is meta-data? Your meta data is my
data… Depends on your
perspective How you see the world What’s important to you What you want to do with
the “data”
Ecological Data set
Meta data
Taxonomic Data
ME
TA
DA
TA
DA
TA
PinaceaePicea
PiceaPicea rubens
PiceaPicea abies
Higher TaxonTaxon
Name: LinnaeusYear: 1758
Data
It’s all data anyway….. But it’s useful to
differentiate for certain purposes
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
Schemas aren't neutral Presumes there is a "correct" way of modelling or
categorising ideas that, given enough time and incentive, people can agree
on the correct way…
Any hierarchy of concepts necessarily implies the importance of some axes over others.
Exploiting Diverse Sources of Scientific Data
Geographic/cartographic perspective Instance of Picea rubens
is-a feature that can be mapped
Features inherently have geospatial coordinates.
Pinaceae
Picea
Picea rubensPicea abies
Building
Feature
Observation
Organismoccurrence
Picea rubens
Taxonomic perspective Instance of Picea rubens is a
specimen of some biological taxon
Taxa inherently have characteristics used in classification
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
There's more than one way to describe something
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
There's more than one way to describe something Reasonable people can disagree forever on how to
describe something. Requiring scientists to use the same vocabulary to
describe their data enforces homogeneity in ideas. Which could limit science…
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
Metrics influence results Agreeing to a common metric for measuring
important things in a domain necessarily privileges the items that score high on that metric, regardless of those items' overall suitability.
Ranking axes are mutually exclusive software that scores high for security scores low for
convenience, Everyone wants to emphasize their high-scoring
axes and de-emphasize (or, if possible, ignore altogether) their
low-scoring axes.
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
People are not altruistic Scientists have their own immediate deliverables
Doesn’t leave time for thinking about who else might do what with their data
Metadata exists in a competitive world. People want their work cited and will (ab)use meta-data to do so.
People are busy e-Scientists understand the importance of excellent
metadata Jo-scientist is mainly concerned about publishing the results.
No time for added extras
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
People make mistakes Even when there's a positive benefit to creating
good metadata, people don’t exercise enough care and diligence in their metadata creation.
Mission Impossible? Simple observation demonstrates people are poor
observers of their own behaviours. Therefore any meta data will be a poor representation
Exploiting Diverse Sources of Scientific Data
Life Science Identifiers (LSIDs): the visionWWW provides a globally distributed communication frameworkLSID and the LSID Resolution System
will provide a simple mechanism to globally resolve locally named objects distributed over the WWW.
LSIDs will allow us to know what kind of object it is, who originated it, who is responsible for it, how to interface to it and what computations might be carried out on it.
Adoption of LSIDs will facilitate more reliable integration of multiple knowledge bases,
each of which has partial information of a shared domain will encourage stronger global collaboration in life sciences.
Clark T., Martin S., Liefeld T. Globally Distributed Object Identification for Biological Knowledgebases Briefings in Bioinformatics 5.1:59-70, March 1, 2004.
Exploiting Diverse Sources of Scientific Data
URI based naming scheme urn:lsid:ipni.org:names:1234-1
retrieval framework
http://lsid.sourceforge.net/
Life Science Identifiers
LSID resolver
Get data
Get metadata
Data record
RDF
An LSID has data- gene sequence in GenBank- ecological data set (in excel, or in a text file)- image
The data should never change- can version
An LSID has metadata- format of the data- display title for clients - Dublin core metadata-anything you want
The metadata can change
Exploiting Diverse Sources of Scientific Data
Issues For Each Community
What gets an LSID? Real life objects
Biological specimen
Abstract concepts Taxon concept or name – Bellis perennis
Electronic representations of things Image of specimen, description of specimen or concept
For each thing, what’s the data and metadata? LSIDs
Data doesn’t change but Meta data can Should all data become meta data? Maybe it implies a temporal database approach
Exploiting Diverse Sources of Scientific Data
Issues For Each CommunityWho issues LSIDs?
Owner of data Not always clear who owns data especially legacy data
A central authority One authority responsible for issuing LSID for specific types of
information This would help enforce a 1:1 mapping of LSIDs and data items It MAY also reduce the likelihood of LSIDs becoming unresolvable
A respected authority This would help enforce a 1:1 mapping for those who use the authority It may also be more feasible
Free for all (possibly with an index) List your LSID authority in an index so your LSIDs are easy to find
Perhaps structured delegation has best potential to globally unite science
Exploiting Diverse Sources of Scientific Data
Organizations Using LSIDsBiopathways consortium
National Center for Biotech Information (NCBI) Pubmed, Genbank
European Bioinformatics Institute (EBI)BioMOBY – an biological database interoperability program
(biomoby.org) represent all entities in MOBY Ontologies (Object, Service, and
Namespace), as well as all instances of BioMOBY services. myGrid (mygrid.org.uk)
used throughout as object naming deviceTDWG (tdwg.org)
IPNI – plant names Index Fungorum – fungi names
US Long Term Ecological Research Network (LTER) SEEK (seek.ecoingformatics.org) Used in Kepler – actors, components, TOS – taxon concepts…
Use of LSIDsUse of LSIDs
Linedseahorse
Hippocampus erectus Perry 1810urn:lsid:biocast.org:concept:347
HippocampusmarginalisKaup, 1856
Hippocampustetragonous
Mitchill, 1814
Hippocampuserectus
347347347
TAX
347
347
347
347
Ecological Data Sets
Exploiting Diverse Sources of Scientific Data
Moving to a world of LSIDs
Using LSIDs alone will not address all issues of data sharingData repositories must (re)use LSIDs to cross reference data
within and outwith their own repository. it is important that we use the same LSID to refer to the same entity
If multiple LSIDs exist for the same entity we would be required to decide whether or not two LSIDs were really the same thing. We would be in a worse situation than we are today,
for example when trying to decide if two taxonomic names mean the same.
Generating LSIDs for any self contained data set is a fairly trivial task
Appointing LSIDs to existing data from an authoritative repository to re-use them is more challenging Investigate what’s involved…
Exploiting Diverse Sources of Scientific Data
Specimen PublicationConcept Name
Hexacorallia Data
Triple Store
Person
Hexacorallia Data Provider
Map to ontology
Convert Data Provider to use LSIDs
Original data repository (target)RDF Data to be updated with LSIDs
from authority providers
LSID+ RDF
LSID+ RDF
LSID+ RDF
LSID+ RDF
LSID+ RDF
Map to ontology
Match data from repository with data in LSID resolvers and return LSID to repository
LinkerTool
Match data from repository with data in LSID resolvers and return LSID to repository
Authority LSID resolution services
(source)
Exploiting Diverse Sources of Scientific Data
Linking….WASABI Service Request Dispatcher
LSIDSPARQL OAI
WASABI Service Request Dispatcher
LSIDSPARQLLinker OAI
authoritative (“source”) provider & linker
local (“target”) provider
Linker Client
Hexacorallia Thematic
Triple Store
PersonTriple Store
Request linkable
classes and select one to
be linked
Exploiting Diverse Sources of Scientific Data
Linking….WASABI Service Request Dispatcher
LSIDSPARQL OAI
WASABI Service Request Dispatcher
LSIDSPARQLLinker OAI
authoritative (“source”) provider & linker
local (“target”) provider
Linker Client
Hexacorallia Thematic
Triple Store
PersonTriple Store
Select class to be linked
Exploiting Diverse Sources of Scientific Data
Linking….WASABI Service Request Dispatcher
LSIDSPARQL OAI
WASABI Service Request Dispatcher
LSIDSPARQLLinker OAI
authoritative (“source”) provider & linker
local (“target”) provider
Linker Client
Hexacorallia Thematic
Triple Store
PersonTriple Store
Request possible LSIDs
Exploiting Diverse Sources of Scientific Data
Confirm/Skip Annotations
Person to find LSID
forChoice of possible persons with LSIDs
Exploiting Diverse Sources of Scientific Data
Issues in converting to LSIDsMapping to ontology
LSIDs RDF schema? ontology? agreement on ontology - problem?
Replace or annotate existing data? If we replace an author with a person LSID what is returned when resolving that LSID won’t likely be what data was
stored in DB for an author.Dependencies between objects with LSIDs
If you link via a taxon name LSID – the resolved name should have embedded an LSID for a publication – so there shouldn’t be any need (in principal) to match publications for names
What about authorities that issues LSIDs but don’t map to other authorities e.g. name providers not mapping to either publication or specimen
providers
Exploiting Diverse Sources of Scientific Data
Issues in converting to LSIDsWhat support would a linking tool need to provide end users?
How would users want to process this data How much automation?
E.g. above a certain confidence level Would this be trusted? Order of matching
E.g. match all instances of persons at once Match of persons by publication?
Other Issues… Performance of existing linking tool approach
Lots of data passing going on Need more efficient approach which matches user needs
Finding authorities that provide linking services How do scientists find out about authorities with linking services? How do you they which ones to use?
Exploiting Diverse Sources of Scientific Data
To Summarise….We have seen that (Life) Science is
Complex & ChangingThe fundamental challenges of science that have always been
there are still here Now we have additional opportunities associated with the explosion of
scientific information and the move to a virtual world And now the challenge is how best to exploit these….
e-Science uses computation to aid scientists By providing appropriate infrastructure and tool support
Speed up scientific processes Do them repeatedly Re-evaluation
Can give scientists time for more thoughtful science… May require a change of emphasis in how scientists work
Must support the inherent features of science, scientists and scientific data
Exploiting Diverse Sources of Scientific Data
e-Science: Complex Science Support decomposition of scientific domains,
problems and associated data Fundamental to data & software analysis and design
Support re-composition, linking or building on the components Need to know when components or links have changed
Identify the overlaps/linkages in the different domains Need useful approximations of things to simplify linked
domain Need to understand the approximations or linking points well
Raise level of abstraction Artefact of storage mechanisms Implies lingua franca Need more evaluation of the different approaches
Exploiting Diverse Sources of Scientific Data
e-Science: Changing Science
Science is full of legacy data Today’s scientific research is tomorrow’s legacy data
Provide long-term persistent storage Any published scientific discovery should store the data as
evidence Data needs to be accurately annotated
Sufficient to repeat analyses to test hypotheses
e-Science already changing the way scientists do science But to be effective it needs to change even more… More emphasis on well curated, accessible, persistent data
Evidence for results
Exploiting Diverse Sources of Scientific Data
Meta Data & Ontologies?Do we throw out meta data/ontologies, then?
No… To benefit from stored data we need to know what it means!
However, there are no large-scale benefits while there is insufficient coverage of meta data if only 10% data has meta data people won’t use meta
data… Need to reach the tipping point…
Controlled vocabulary and schemas shown useful for large projects or small communities with common goal Need long-term projects to see if they sustain their value as
the community and the science evolves.
Exploiting Diverse Sources of Scientific Data
Describe or Prescribe?
Descriptions become a vocabularies used by others
Folksonomy or ontologies? Informal versus formal or free versus constrained Informal can be basis for something formal
Move towards common vocabularies with built in flexibility and extensibility
Issue of what language(s)…Need more research evaluating these issues…
Exploiting Diverse Sources of Scientific Data
Reliability of Meta Data
Automatic recording of meta data From machines, software, workflows… Avoids labour Starting to happen Helps reach critical mass of available meta data
Still need to decide what it is that the machines/software are collecting… Human input still needed
Purpose of experiment, deviations from planned protocol etc.
Exploiting Diverse Sources of Scientific Data
SupportCommunity ontologies need to be easily available to
all scientists Listing the known ontologies on a web site is not enough
Need to understand when (meta) data is fit for purpose Accurate enough, not overly precise
Need collaborative approaches to extending ontologies Allow users to be involved to achieve community buy-in
Ontologies are difficult for people to comprehend Need good visualisation Need to trust system
Exploiting Diverse Sources of Scientific Data
ToolsSimple tools would go a long way to helpContextual data is consistent for many data sets
e.g. observer/location Tools should support collection and re-use of this data
Make use of (incorporate) existing ontologies into tools
Get the software to do as much work as possible Good at repetitive tasks, faster than humans
Personalisation How application specific do tools have to be to be useful Generic/ Domain specific/ Individual? The more generic the more widely applicable
Pluggable components for personalisation?
Exploiting Diverse Sources of Scientific Data
Finally… It will take time and commitment for any of these approaches to
work.Focus on central important resources that are reused in many
(sub-)domains Ensure the data are well managed and curated, identified, described,
easily available, lasting and evolvingObserve whether they benefit the community or act as a
straight jacketA good test case for this approach is the development of a
taxon concept name resolution service To allow scientists to find correct names for the concepts they are
working with, Mark up their data, Resolve their concepts against other scientists’ data so they know they
are talking about the same thing. Is central to communication in all life sciences Poses many computational, social and data research issues
Exploiting Diverse Sources of Scientific Data
Acknowledgements
E-Science Institute for sponsoring theme leadershipMalcolm Atkinson
For support and many interesting discussions on exploiting scientific data.
Collaborators on SEEK project,
Matt Jones, Bill Michener, Aimee Stewart, Robert Gales, Josh Madin, Shaun Bowers
Collaborators in TDWG/GBIF Robert Kukla, Roger Hyam,
funding, slides, interesting problems