System concept and development by: Tony Rees Divisional Data Centre
Tony Rees: Towards a Hierarchical Classification of All Life
-
Upload
tony-rees -
Category
Technology
-
view
621 -
download
1
description
Transcript of Tony Rees: Towards a Hierarchical Classification of All Life
Towards a Hierarchical Classification of All Life – the IRMNG data assembly project
Tony Rees – CSIRO Marine and Atmospheric Research, Australia
October 2011
Tony Rees: Hierarchical Classification of All Life
Why a hierarchical classification?
• Hierarchical classifications assist us to organize knowledge
Tony Rees: Hierarchical Classification of All Life
Why a hierarchical classification?
“borrowed” from R. Page presentation, 2011
• Hierarchical classifications allow us to infer information about lower levels from higher ones (don’t have to explicitly re-specify / verify / know everything)
• Hierarchical classifications allow us to make / test predictions based on degree of “relatedness”
• Hierarchical classifications assist us to construct +/- automated “expert systems”
Tony Rees: Hierarchical Classification of All Life
Why a hierarchical classification?
Functional view
Thesystem
Structural view
genus / species
name “X”
useful information
on taxon “X”
Tony Rees: Hierarchical Classification of All Life
What should “the system” ideally hold? – something like…
(etc.)
• Expanded to information on “all life”:
• Animals, plants, fungi, protists, bacteria + archaea (prokaryotes), viruses
• Both extant and fossil organisms
• Aim for comprehensive coverage – no gaps – to desired level of the hierarchy
• Information held in consistent terminology, machine-readable content
• Either human user, or machine user access point (or both)
• Hyperlinked cross-refs for web users
• Continuously updated & upgraded
• Provenance for all content
(probably plus more…)
Tony Rees: Hierarchical Classification of All Life
What should “the system” ideally hold?
x 50+…
• Taxon Scientific names are preferred units of currency & identity in the world of biology:
• More stable / authoritative than common (vernacular) names
• Indicate the genus to which a species belongs
• Higher classification allows nesting intoprogressively larger taxa, each with definablecharacteristics
• “Linnaean” ranks: kingdom through species(NB some intermediate ranks also important,should handle in due course)
(*taxon = named “taxonomic unit”, a defined unit atany rank, i.e. species, genus, family, etc.)
Tony Rees: Hierarchical Classification of All Life
System is based on scientific names of taxa
2+ million
~250k
~10k
~2k
Kingdoms (5/6/7/8)
~400
~140Phyla
Classes
Orders
Families
Genera
Species
• All life to family level:
• Parker (ed.), 1982, Synopsis and Classification of Living Organisms, print, 2 vols, ~2,300 pp.: ~7k family descriptions in a common hierarchy (extant taxa only)
• Benton (ed.), 1993, The Fossil Record 2, print, ~850 pp.: ~5k family brief treatments, mainly fossil
• Code-specific to genus level:
• Zoology: Nomenclator Zoologicus (to 2004), (print + online) then Zoo. Record / ION, online (NB, Nomen. Zool. has no detailed higher taxonomy)
• Botany: Index Nominum Genericorum (ongoing), online, also IPNI, TROPICOS, etc.
• Bacteriology: List of Prokaryotic names with Standing in Nomenclature (LPSN) (online)
• Viruses: International Committee on Taxonomy of Viruses (ICTV) database, online
Tony Rees: Hierarchical Classification of All Life
Availability of comprehensive treatments
• Taxon specific to species level: Global Species Databases (“GSDs”) exist for specific groups e.g.
• Mammals: Mammal Species of the World (2005, print + online)
• Fishes: Eschmeyer’s Catalog of Fishes (ongoing, online)
• Higher Plants: The Plant List (2010, online) + contributing DB’s
• Fungi: Index Fungorum and Species Fungorum (ongoing, online)
• Algae: AlgaeBase (ongoing, online)
• Others: AntBase, Systema Dipterorum, LepIndex+ many more
• (also viruses and prokaryote lists as per previous slide)
…>100 GSDs aggregated into a singleCatalogue of Life compilation (annual editions2000-current) produced by Sp2000 + ITIS (USA)
Tony Rees: Hierarchical Classification of All Life
Availability of comprehensive treatments – cont’d
• A great project BUT…
• ~30% of extant species (plus relevant higher taxa) still missing
• only a subset of species synonyms included, and no genus synonyms stated
• no fossil taxa (although Paleobiology Database has some / many of these)
Tony Rees: Hierarchical Classification of All Life
Can we use Catalogue of Life as a comprehensive resource?
• a few higher tax. conflicts
• no intermediate ranks (e.g. subphylum, infraorder)
• no genus authors or publication info
• latency for new names (esp. in some groups)
• no target completion date
…GBIF experience: only ~30% of incoming species names are in the Catalogue of Life (not much good for data aggregators).
Tony Rees: Hierarchical Classification of All Life
What about “names aggregator” activities
• Collect names as used in primary + secondary sources, mix of “clean” (verified) and “dirty” (unverified) names
• Authority portion of the names not standardized (same name may appear on the list multiple times)
• Frequently lacking coherent / any higher taxonomy
… Potentially a useful “superset” of (most) “good” names, but requires work to filter these out.
(etc.)
• Answer “yes” BUT…
• Need to knit them all together across Codes, also no single source is complete, even within a Code
• Need to add family allocations where missing, e.g. from Nomenclator Zoologicus, also taxonomic synonyms, consistent hierarchy information, etc. etc.
• Need to deal with inconsistencies / overlaps between data sources (editorial decisions), also “house style” issues
• Need to back-fill residual data gaps
• As desired, also would like to add non-taxonomic “attributes” e.g. extant / fossil status / geologic range, habitat information, geographic distribution, more ???
• Bonus short cut
• Leverage the hierarchy to avoid having to add attributes at every lower level – e.g. inherit genus / species attributes from higher up where these are unambiguous
• Examples: all dinosaurs are extinct, all cephalopods are marine, etc. etc.
• Similarly, all species of a marine-only genus will also be marine, etc.
Tony Rees: Hierarchical Classification of All Life
Genus level compilations are much more complete, can we use those?
• IRMNG – the Interim Register of Marine and Nonmarine Genera
• Aims to fill the gaps and produce an “interim” hierarchical classification of all life (HCAL), extant + fossil, to at least genus level (species lists to be added as readily accessible)
• Utilizes Parker, 1982 and Benton, 1993 family compilations as starting point for higher classification
• Specific sectors then upgraded through time, also incorporating relevant marine/nonmarine and extant/fossil flags
• Genera added from the most comprehensive available sources (over time)
• “Interim” status used to indicate lesser degree of scrutiny / authoritativeness than e.g. Cat. of Life, however hopefully still useable
• home page: www.obis.org.au/irmng,data access page: www.cmar.csiro.au/datacentre/irmng/
• Will hold more names than valid taxa, due to synonymy:
• Nomenclatural synonyms – add maybe 5% to genera, 300% to species
• Taxonomic synonyms – add maybe 100%-200% to genera and species
Tony Rees: Hierarchical Classification of All Life
The IRMNG concept
• 1 record per every name / publication instance (valid or invalid) including:
• the name itself
• the author and year for the name (1 version only)
• publication details as available
• source/s used, with or without editorial adjustment
• for botanical names, include full (not abbreviated) author name, also year of publication (normally omitted)
• nomenclatural and taxonomic status, as known (plus any relevant comments)
• placement in the tax. hierarchy (every record knows its parent, child records reference this one), plus cross-links as required
• selected attributes, initially:
• Extant/fossil status: Extant / Fossil / both / unknown
• Habitat flag: Marine / Nonmarine / both / unknown
• provenance, degree of verification for all content
Tony Rees: Hierarchical Classification of All Life
IRMNG desired content
Family placement – editorial decisions may be needed
Tony Rees: Hierarchical Classification of All Life
• e.g. for (botanical) genus “Pachydiscus”:
Data aggregation complicated by genus level homonyms e.g.:
Tony Rees: Hierarchical Classification of All Life
• also by variant authority citations e.g.:
• (etc.)
Perseverance produces the following(subset of genus table, 453k names as at Oct 2011):
Tony Rees: Hierarchical Classification of All Life
Tony Rees: Hierarchical Classification of All Life
A glimpse of the IRMNG “master genus” table(currently 452,827 records)
Tony Rees: Hierarchical Classification of All Life
A glimpse of the IRMNG “master genus” table(currently 452,827 records)
(Mabberley plant names list)
Tony Rees: Hierarchical Classification of All Life
Detail showing example source/s used
• High-level overview + relevant statistics for “all life” (currently possible for names, in future for valid taxa)
• Navigate the hierarchy in any direction
• Generate hierarchical lists
• Generate alphabetic lists
• Sort / filter by any desired criteria
• Generate lists of homonyms, within or across Codes
• Indicate current tax. hierarchy, nomenclatural / taxonomic status, and attributes (to varying degrees) for any input name
• Indicate near match targets to any input name (“did you mean…”) – using TAXAMATCH fuzzy matching (custom solution for tax. databases)
Tony Rees: Hierarchical Classification of All Life
Services / views this currently supports
Tony Rees: Hierarchical Classification of All Life
IRMNG-generated statistics for “all life” (web query 6 Oct 2011)
• (Important note – can actually generate these lists as required, by navigating the hierarchy)
Tony Rees: Hierarchical Classification of All Life
Other services / products e.g. full hierarchical lists
however with caveat: some / many genera may still be classified only at higher level (e.g. “Mammalia – unallocated”) at this time (more work to do).
Tony Rees: Hierarchical Classification of All Life
Check batches of entered names
(1,406 genus names…)
Tony Rees: Hierarchical Classification of All Life
Check batches of entered names
(start of IRMNG search result)
Tony Rees: Hierarchical Classification of All Life
Check batches of entered names
Tony Rees: Hierarchical Classification of All Life
Check batches of entered names
?
Tony Rees: Hierarchical Classification of All Life
Query by taxon name (correctly spelled or misspelled)
Tony Rees: Hierarchical Classification of All Life
Check batches of entered names
• Basically this is then a Taxonomic Name Resolution Service (TNRS), similar to the one developed in 2011 by the (U.S.) iPlant team over TROPICOS, but across all groups:
Tony Rees: Hierarchical Classification of All Life
Linking names with literature
Tony Rees: Hierarchical Classification of All Life
The “microcitation” (Nomenclator’s favourite…)
• Typically just author name, year, page no. in work, e.g.:
• Would prefer full article-level titles / authors / pagination if possible – i.e. a bibliographic module
• Could optionally offer onward links to page views in BHL, abstracts, full text as pdfs, etc. as available (small sample populated in IRMNG at this time)
Name pluspage in work
List of all works as data objects
Tony Rees: Hierarchical Classification of All Life
Expanded citation info in IRMNG - example
Tony Rees: Hierarchical Classification of All Life
Expanded citation info in IRMNG - example
Tony Rees: Hierarchical Classification of All Life
Expanded citation info in IRMNG - example
IP issues regarding bibliographies, etc.
Tony Rees: Hierarchical Classification of All Life
• Many sources assert copyright over bibliographies, potentially an issue
• Does copyright exist in individual references extracted from a third party collection
• What about subsets of the collection
• What about new composite supersets
• Law may be different in different countries
• Licensing / terms of use may be different from law
… still very unclear (to this author) what is / is not permissible with respect to assembling new bibliographies which include content from elsewhere – including copy/paste vs. re-keying…
• Will be a recurring issue for other bibliography-assembly projects e.g. CiteBank, Mendeley… but think of the value (a “bibliography of life”)
IRMNG content – recent missing genera…
Tony Rees: Hierarchical Classification of All Life
Tony Rees: Hierarchical Classification of All Life
IRMNG content – genus names published by year, 1995-current(as at Oct 2011), excluding virus names (which are undated)
(NB could disaggregate further as desired, e.g. by detailed tax. group, or extant vs. fossil…)… also would expect a small number of residual names missed for ostensibly “complete” years
presumedmissing names
Tony Rees: Hierarchical Classification of All Life
IRMNG 2011 content cf. Cat. of Life 2011
Note, Chapman, 2009 estimates c.1.9m described extant species (see earlier slide)
On that basis, CoL has 70% of valid extant species names, maybe 70% of valid extant genera (with subset of genus-level synonyms)
IRMNG is missing est. 10k genera from 2004-2011 (from last slide), maybe further 2-3% overall (say 10k-15k), “complete” list would thus be ~475k at this time (increasing at ~2k/year).
Cat. of Life - 2011 edition
% with auth's
IRMNG –Oct 2011 -
extant + fossil% with auth's
IRMNG –Oct 2011 - fossil only
Kingdoms 8 7 0
Phyla 111 153 12
Classes 288 509 64
Orders 1,233 2,645 715
Families 8,071 0% 19,639 22.1% 6,542
Subfamilies
Genera 178,515 0% 452,848 97.1% 90,278
Subgenera
Species (valid) 1,347,224 ~100% 1,020,519 ~100% 16,792
Species (synonyms) 895,441 ~100% 440,738 ~100% 100
Tony Rees: Hierarchical Classification of All Life
Many unfinished tasks
• Update / standardize the higher classification across groups (start made, much still to do)
• Fill gaps in nomenclatural / taxonomic status, synonym reconciliation, family allocations for significant subset of names
• Legacy names acquisition, where currently missing (i.e., not in major nomenclators)
• New names acquisition (~25k species, 2k+ genera / year…), plus taxonomic reallocations – ongoing task, requires resources or (preferably) automated feeds
• Extension to “all species”… ???
Tony Rees: Hierarchical Classification of All Life
Potential integration / replacement with “GN” components…
• MBL staff and collaborators are currently engaged in constructing components of a “Global Names Architecture” i.e.:
• GNI – Global Names Index
• GNUB – Global Names Usage Bank
• GNITE – Global Names Index Taxonomic Editor
• GNA CLR / GBIF ChecklistBank
• CiteBank – publication citation repository
• ZooBank – register for new / old animal names
• more…
• Some / much of this has potential overlap with IRMNG (present focus of my MBL visit).
Tony Rees: Hierarchical Classification of All Life
Potential integration / replacement with “GN” components…
D. Patterson et al., from 2010-11 NSF proposal
Proposed “Global Names” infrastructure components:
Contact detailsPhone: +61 3 6232 5318
Email: [email protected] Web: www.cmar.csiro.au/datacentre/
Thank you
Thanks to:
- OBIS, GBIF and Atlas of Living Australia for financial support, numerous data providers for data
- CSIRO for salary and in-kind support, 2006-present
- D. Patterson / MBL / NSF (this trip funding + hosting)
Tony Rees: Hierarchical Classification of All Life
Tony Rees: Hierarchical Classification of All Life
Supplementary slides
Tony Rees: Hierarchical Classification of All Life
Where to from here…
The names publishing / discovery landscape:
Tony Rees: Hierarchical Classification of All Life
New names: potential discovery paths
new virus n
ames
new prokaryote names
new botanical names – algae & fungi (except
fossils)
new botanical names – bryophytes through angiosperms (except fossils)
new zoological names
publication discovery official registers taxon-specific DB’s integrated DB’s “all names”
Botany
Zoology
Newly published names – primary
literature (print,
electronic)
Newly published names – primary
literature (print,
electronic)
ICTV Viruses DBICTV Viruses DB
LPSN(Prokaryote names)
LPSN(Prokaryote names)
ICBN DecisionsICBN Decisions
ICZN DecisionsICZN Decisions
Journal TOC’s, RSS feeds,text mining
Journal TOC’s, RSS feeds,text mining
Abstracting servicesAbstracting services
Subject bibliographies
Subject bibliographies
Reviews, secondary literature
Reviews, secondary literature
Zoological RecordZoological Record ION (Index of Organism Names)ION (Index of Organism Names)
ChecklistBank
GNI
GNUB
ZooBank?
ChecklistBank
GNI
GNUB
ZooBank?
Catalogue of Life annual
editions
Catalogue of Life annual
editions
ITISNCBI Taxonomy
WoRMSetc.
ITISNCBI Taxonomy
WoRMSetc.
CyanoDBCyanoDB
Index FungorumMycoBank
Index FungorumMycoBank
AlgaeBaseAlgaeBase
Plant GSD’sPlant GSD’s
PaleoDBPaleoDB
Animal GSD’sAnimal GSD’s
other compilations e.g. regional lists, Wikispecies, Wikipedia, more…
other compilations e.g. regional lists, Wikispecies, Wikipedia, more…
IRMNGIRMNG
Tony Rees: Hierarchical Classification of All Life
New names: potential discovery paths
new virus n
ames
new prokaryote names
new botanical names – algae & fungi (except
fossils)
new botanical names – bryophytes through angiosperms (except fossils)
new zoological names
publication discovery official registers taxon-specific DB’s integrated DB’s “all names”
Botany
Zoology
Newly published names – primary
literature (print,
electronic)
Newly published names – primary
literature (print,
electronic)
ICTV Viruses DBICTV Viruses DB
LPSN(Prokaryote names)
LPSN(Prokaryote names)
ICBN DecisionsICBN Decisions
ICZN DecisionsICZN Decisions
Journal TOC’s, RSS feeds,text mining
Journal TOC’s, RSS feeds,text mining
Abstracting servicesAbstracting services
Subject bibliographies
Subject bibliographies
Reviews, secondary literature
Reviews, secondary literature
Zoological RecordZoological Record ION (Index of Organism Names)ION (Index of Organism Names)
ChecklistBank
GNI
GNUB
ZooBank?
ChecklistBank
GNI
GNUB
ZooBank?
Catalogue of Life annual
editions
Catalogue of Life annual
editions
ITISNCBI Taxonomy
WoRMSetc.
ITISNCBI Taxonomy
WoRMSetc.
CyanoDBCyanoDB
Index FungorumMycoBank
Index FungorumMycoBank
AlgaeBaseAlgaeBase
Plant GSD’sPlant GSD’s
PaleoDBPaleoDB
Animal GSD’sAnimal GSD’s
other compilations e.g. regional lists, Wikispecies, Wikipedia, more…
other compilations e.g. regional lists, Wikispecies, Wikipedia, more…
IRMNGIRMNG
Lots of manual effort
Tony Rees: Hierarchical Classification of All Life
New names: potential discovery paths
new virus n
ames
new prokaryote names
new botanical names – algae & fungi (except
fossils)
new botanical names – bryophytes through angiosperms (except fossils)
new zoological names
publication discovery official registers taxon-specific DB’s integrated DB’s “all names”
Botany
Zoology
Newly published names – primary
literature (print,
electronic)
Newly published names – primary
literature (print,
electronic)
ICTV Viruses DBICTV Viruses DB
LPSN(Prokaryote names)
LPSN(Prokaryote names)
ICBN DecisionsICBN Decisions
ICZN DecisionsICZN Decisions
Journal TOC’s, RSS feeds,text mining
Journal TOC’s, RSS feeds,text mining
Abstracting servicesAbstracting services
Subject bibliographies
Subject bibliographies
Reviews, secondary literature
Reviews, secondary literature
Zoological RecordZoological Record ION (Index of Organism Names)ION (Index of Organism Names)
ChecklistBank
GNI
GNUB
ZooBank?
ChecklistBank
GNI
GNUB
ZooBank?
Catalogue of Life annual
editions
Catalogue of Life annual
editions
ITISNCBI Taxonomy
WoRMSetc.
ITISNCBI Taxonomy
WoRMSetc.
CyanoDBCyanoDB
Index FungorumMycoBank
Index FungorumMycoBank
AlgaeBaseAlgaeBase
Plant GSD’sPlant GSD’s
PaleoDBPaleoDB
Animal GSD’sAnimal GSD’s
other compilations e.g. regional lists, Wikispecies, Wikipedia, more…
other compilations e.g. regional lists, Wikispecies, Wikipedia, more…
IRMNGIRMNG
Lots of automated
feeds + expert
curation
Tony Rees: Hierarchical Classification of All Life
New names: potential discovery paths
new virus n
ames
new prokaryote names
new botanical names – algae & fungi (except
fossils)
new botanical names – bryophytes through angiosperms (except fossils)
new zoological names
publication discovery official registers taxon-specific DB’s integrated DB’s “all names”
Botany
Zoology
Newly published names – primary
literature (print,
electronic)
Newly published names – primary
literature (print,
electronic)
ICTV Viruses DBICTV Viruses DB
LPSN(Prokaryote names)
LPSN(Prokaryote names)
ICBN DecisionsICBN Decisions
ICZN DecisionsICZN Decisions
Journal TOC’s, RSS feeds,text mining
Journal TOC’s, RSS feeds,text mining
Abstracting servicesAbstracting services
Subject bibliographies
Subject bibliographies
Reviews, secondary literature
Reviews, secondary literature
Zoological RecordZoological Record ION (Index of Organism Names)ION (Index of Organism Names)
ChecklistBank
GNI
GNUB
ZooBank?
ChecklistBank
GNI
GNUB
ZooBank?
Catalogue of Life annual
editions
Catalogue of Life annual
editions
ITISNCBI Taxonomy
WoRMSetc.
ITISNCBI Taxonomy
WoRMSetc.
CyanoDBCyanoDB
Index FungorumMycoBank
Index FungorumMycoBank
AlgaeBaseAlgaeBase
Plant GSD’sPlant GSD’s
PaleoDBPaleoDB
Animal GSD’sAnimal GSD’s
other compilations e.g. regional lists, Wikispecies, Wikipedia, more…
other compilations e.g. regional lists, Wikispecies, Wikipedia, more…
IRMNGIRMNG
Lots of automated
feeds + expert
curation
Lots of useful
services
Tony Rees: Hierarchical Classification of All Life
How many taxa?
2+ million
~250k
~10k
~2k
Kingdoms (5/6/7/8)
~400
~140Phyla
Classes
Orders
Families
Genera
Species
valid extant + fossil taxa (est.)
How many species?estimates according to Chapman, 2009 (valid, extant taxa only);“others” comprise c. 54k protists, 10k prokaryotes, 2k viruses
NB inverts. includes “~1,000,000” for Insects – probably +/- 60k
Fossil species – no published estimates – maybe 500k names, 300k valid
Tony Rees: Hierarchical Classification of All Life
Relevant information domain: all life
PROTISTS
Fig. i-1 in Margulis & Schwartz, 1998
Tony Rees: Hierarchical Classification of All Life
How many kingdoms…
PROTISTS
Fig. i-1 in Margulis & Schwartz, 1998
7 kingdoms (5 in Margulis & Schwartz, 8 in Cat. of Life…):
Animals, Fungi, Plants: 3 kingdoms
Protists: 1 (or 2 if Stramenopiles [Heterokonts] recognized,= Cavalier-Smith’s Kingdom “Chromista”)
Bacteria + Archaea: 2 (=1 in Margulis & Schwartz)
Viruses: 1 (not in Margulis & Schwartz)
Tony Rees: Hierarchical Classification of All Life
Nomenclature governed by four separate Codes, i.e. Zoological, Botanical, Bacteriological, Viruses
PROTISTSZoo. Code
Bact. Code
Bot. Code
Vir. Code:viruses (not shown) Fig. i-1 in Margulis & Schwartz, 1998
CiteBank as a remote references repository?
Tony Rees: Hierarchical Classification of All Life
Unexplored questions…
• how well populated is CiteBank either now, or in near future
• can third party bibliographies be uploaded into it (with / without infringing IP)
• Zoo. Record and similar operators do this already on a commercial basis – how to reconcile these activities / avoid redundant effort
• would CiteBank IDs / outward links be an adequate substitute to storing / inspecting / displaying this info locally
Parker, 1982 content example
Tony Rees: Hierarchical Classification of All Life
Benton, 1993 content example
Tony Rees: Hierarchical Classification of All Life
Rees TAXAMATCH fuzzy matching poster (start)
Tony Rees: Hierarchical Classification of All Life
Schematic of TAXAMATCH operation
Tony Rees: Hierarchical Classification of All Life