Reconciling succeeding
taxonomic classifications
Nico M. Franz
School of Life Sciences, Arizona State University
Mingmin Chen, Shizhuo Yu, Bertram Ludäscher *
Department of Computer Science, University of California at Davis
ESA Annual Meeting 2012
November 14, 2012 – Knoxville, TN
* PI – NSF-IIS 1118088: A logic-based, provenance-aware system for merging scientific data under context and classification constraints.
Challenge – describing classification provenance beyond synonymy
Source: Weakley. 2005. Flora of the Carolinas, Virginia, and Georgia. Available at http://www.herbarium.unc.edu/flora.htm
Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005
Challenge – describing classification provenance beyond synonymy
Source: Weakley. 2005. Flora of the Carolinas, Virginia, and Georgia. Available at http://www.herbarium.unc.edu/flora.htm
Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005
Individual columns represent past classifications of Andropogon.
Challenge – describing classification provenance beyond synonymy
Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005
Individual rows represent equivalent taxonomic entities, (almost)regardless of their name labels.
Challenge – describing classification provenance beyond synonymy
Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005
Individual rows represent equivalent taxonomic entities, (almost)regardless of their name labels.Name/synonymy relationships are not sufficiently granular tocapture this evolution of taxonomic views of Andropogon species.
Tracking classification provenance with concepts and articulations
Definition: A taxonomic concept is the underlying meaning of a scientific name as stated
by a particular author and publication. It represents the author's full-blown
view of how the name reaches out to un-/observed objects in nature.
Labeling: The abbreviation sec. for the Latin secundum, or "according to", is preceded by
the full Linnaean name and followed by the specific author and publication.
Source: Berendsohn. 1995. The concept of "potential taxa" in databases. Taxon 44: 207–212.
Tracking classification provenance with concepts and articulations
Definition: A taxonomic concept is the underlying meaning of a scientific name as stated
by a particular author and publication. It represents the author's full-blown
view of how the name reaches out to un-/observed objects in nature.
Labeling: The abbreviation sec. for the Latin secundum, or "according to", is preceded by
the full Linnaean name and followed by the specific author and publication.
Examples: Andropogon virginicus L. sec. Radford et al. (1968) [earlier, wider concept]
Andropogon virginicus L. sec. Weakley (2005) [later, narrower concept]
Utility: Representing multiple classifications (revisions) through concepts makes it possible
to track their similarities and differences through articulations.
Source: Berendsohn. 1995. The concept of "potential taxa" in databases. Taxon 44: 207–212.
Five basic articulations between two concepts C1, C2 (set theory)
equivalence proper inclusion
overlapinverse proper
inclusion
exclusion
Source: Franz & Peet. 2009. Towards a language for mapping relationships among taxonomic concepts. Syst. Biodiv. 7: 5–20.
Use of "OR" to express uncertainty.Example: C1 == OR > C2
How does it work? Connecting Hackel 1889 and Small 1933
Hackel 1889 (1-12)
Small 1933 (13-16)
Step 1: Transcribe two concept hierarchies… …and add unique IDs
Hackel 1889 (1-12)
Step 2: Create a table with all concept labels
Small 1933 (13-16)
How does it work? Connecting Hackel 1889 and Small 1933
Hackel 1889 (1-12)
Step 3: Create a table with corresponding parent/child relationships ('is_a')
Small 1933 (13-16)
How does it work? Connecting Hackel 1889 and Small 1933
Hackel 1889 (1-12)
Step 4: Create a table with a suitable set of articulations
Small 1933 (13-16)
How does it work? Connecting Hackel 1889 and Small 1933
Hackel 1889 (1-12)
Step 4: Create a table with a suitable set of articulations
Small 1933 (13-16)
How does it work? Connecting Hackel 1889 and Small 1933
Translation
Congruence
Co
nce
pt
hie
rarc
hie
s
Articulations
Technical challenges to creating articulations
Input of concept hierarchies
Lack of a server-based platform (e.g. Global Names Architecture)
Lack of user-friendly classification input / visualization tools
Input of concept hierarchies
Lack of a server-based platform (e.g. Global Names Architecture)
Lack of user-friendly classification input / visualization tools
Input of articulations (goal: achieve a complete and consistent mapping)
Taxonomic experts will not input ∞ articulations
Taxonomic experts will miss relevant articulations ("mir")
Taxonomic experts could be uncertain of articulations ("possible worlds")
Taxonomic experts could posit logically inconsistent articulations
Technical challenges to creating articulations
Input of concept hierarchies
Lack of a server-based platform (e.g. Global Names Architecture)
Lack of user-friendly classification input / visualization tools
Input of articulations (goal: achieve a complete and consistent mapping)
Taxonomic experts will not input ∞ articulations
Taxonomic experts will miss relevant articulations ("mir")
Taxonomic experts could be uncertain of articulations ("possible worlds")
Taxonomic experts could posit logically inconsistent articulations
"CleanTax" is being developed to explore solutions to these challenges. 1
Technical challenges to creating articulations
1 There is continuation/overlap with the "Exploring Taxonomic Concepts" project that focuses on character matching (DBI-1147266).
CleanTax – technical specifications
CleanTax = a set of Python programming scripts stored on bitbucket.org
(initially developed by Dave Thau; now being developed further on many fronts)
CleanTax reads in concept/articulation tables from a PostgreSQL database
CleanTax transforms the input for processing by logic reasoners; including:
Prover9 / Mace4 theorem provers – first-order logic [thorough, yet slow]
OWL / HermiT – description logic , knowledge representation [complex]
DLV System – propositional logic, answer set programming [promising!]
CleanTax = a set of Python programming scripts stored on bitbucket.org
(initially developed by Dave Thau; now being developed further on many fronts)
CleanTax reads in concept/articulation tables from a PostgreSQL database
CleanTax transforms the input for processing by logic reasoners; including:
Prover9 / Mace4 theorem provers – first-order logic [thorough, yet slow]
OWL / HermiT – description logic , knowledge representation [complex]
DLV System – propositional logic, answer set programming [promising!]
CleanTax assesses consistency and completeness of articulations
Output of the set of maximally informative relationships – "mir"
Report , causal explanation, interactive repair of inconsistent articulations
Calculate multiple possible worlds (if ambiguous articulations are present)
CleanTax – technical specifications
CleanTax = a set of Python programming scripts stored on bitbucket.org
(initially developed by Dave Thau; now being developed further on many fronts)
CleanTax reads in concept/articulation tables from a PostgreSQL database
CleanTax transforms the input for processing by logic reasoners; including:
Prover9 / Mace4 theorem provers – first-order logic [thorough, yet slow]
OWL / HermiT – description logic , knowledge representation [complex]
DLV System – propositional logic, answer set programming [promising!]
CleanTax assesses consistency and completeness of articulations
Output of the set of maximally informative relationships – "mir"
Report , causal explanation, interactive repair of inconsistent articulations
Calculate multiple possible worlds (if ambiguous articulations are present)
CleanTax creates multiple user-preferred views of the input and merge taxonomies
Reduced Containment Graph – RCG; and Directed Acyclic Graph – DAG
CleanTax – technical specifications
'Training' CleanTax on abstract examples
Initial expert-madeset of articulationsNew!
'Training' CleanTax on abstract examples
Input Output – raw hmtl list of articulations ("look-up" + inferred)
'Training' CleanTax on abstract examples
Input Output – 72 maximally informative relationships = mir
Based on the mir, all theoretically possible articulations
of the R32 lattice can be logically deduced.
Abstract Example 1 – Reduced Contained Graph of the merge
Blue circles shared concepts
Black circles unique concepts
Black solid arrows expert input
Grey dashed arrows deducible
Red solid arrows newly inferred
Input
More CleanTax training… our infamous Abstract Example 4
Example 4 – representing multiple 'possible worlds'
3/5 articulations are disjoint (OR)
Reduced Containment Graphs of 7 'possible worlds' (combined or's)
Example 4 – CleanTax infers 7 possible worlds (user can view / select / repair / rerun)
Asserted by expert
Implied articulations
Inferred by CleanTax
Shared concepts
Unique concepts
Reduced Containment Graphs (RCGs)
Exploring "views" of the merge - circular Euler diagrams of PW1
Table of mir Corresponding Euler diagram (circular)
Identical
informationcontent
Correspondence of circular and Directed Acyclic Diagrams
PW1: Typical Euler circles Euler-DAG of PW1
Identical
informationcontent
Real life examples
Real-life examples, I – reconciling two weevil classifications 1
Curculionoidea sec. Kuschel 1995 Curculionoidea sec. Marvaldi & Morrone 2000
Concepts 117-157
Concepts 348-372
1 Initial articulations provided by NMF.
Merge taxonomy of Kuschel 1995 / Marvaldi & Morrone 2000
CleanTax RCG – 1 newly inferred articulation ( ) + several inconsistencies
Microcerinae sec. M&M 2000 [363] are included in Brachycerinae sec. KU 1995 [148]
(yes, I missed that; Kuschel 1995 only mentions it in the text, not in the main taxon list)
Real-life examples, II – reconciling two weevil classifications
Curculionoidea sec. Crowson 1981 Curculionoidea sec. Marvaldi & Morrone 2000
Concepts 1-17
Concepts 348-372
CleanTax RCG – 4 newly inferred articulations ( ) / does not depict overlap (><)
e.g. {Aglycyderidae [2], Allocorynidae [3], Oxycorynidae [17]} sec. Crowson 1981
are included in Belidae [353] sec. M&M 2000
Merge taxonomy of Crowson 1981 / Marvaldi & Morrone 2000
Euler-DAG of the Crowson / Marvaldi & Morrone merge taxonomy
Solid lines – proper inclusion
Black solid line given
Green solid line inferred
Orange solid line explanatory
[Red solid line inconsistent]
Dashed lines - overlap
Black dashed line given
Green dashed line inferred
Orange dashed line explanatory
Red dashed line inconsistent
Concept boxes - concepts
Orange square box shared
Black square box unique
Dashed square box combined
Dashed oval box inconsistent
DAGs generate "combined concepts" intersections of overlaps
Belidae
sec. MM2000
Belidae
sec. Cro1981
"Belidae"
INT(Cro/MM)
Shared - [2,3,17,357]
Concept AInput
Output
Concept B
Concept A – Concept B
AAttelabidae CR81
AttCR81 [9]
BAttelabidae MM00
AttMM00 [55]
ABAttelabidae CR81 – Attelabidae MM00
AttCR81.AttMM00
* Simple extension to three or more congruent concepts.
New naming/viewing conventions – simple merges (shared, unique) *
Concept AInput
Euler
Concept B
ABelidae CR81BelCR81 [10]
BBelidae MM00
BelMM00 [353]
AbBelCR81.belMM00
A B
Ab AB aB
aBBelMM00.
belCR81
ABBelCR81.BelMM00
DAG
New naming/viewing conventions – combined merges (overlap; T1, T2)
DAG A B
Abc ABc aBc
C
abCaBCAbC
ABC
EulerAbc
CURCR81.
curKU95.
curMM00
aBcCurKU95.
curCR81.
curMM00
abCCurMM00.
curCR81.
curKU95
AbCCurCR81.
CurMM00.
curKU95
aBCCurKU95.
CurMM00.
curCR81
ABcCurCR81.
CurKU95.
curMM00
ABCCurCR81.
CurKU95.
CurMM00
Concept AInput Concept BA
Curculionidae CR81CurCR81
BCurculionidae KU95
CurKU95
Concept CC
Curculionidae s.s. MM00CurMM00T1, T2, T3
Future directions
Current workflow / "usability" (CleanTax on "Lore" server, UC Davis)
Input script
Output file
Inconsistency Repair, explanation
Possible worlds
VisualizationEuler-DAG
Interactivereduction of PWs
(decision tree)
Shared, real use cases (Perelleschus) with ETC feature-based project
5 taxonomies, 48 concepts, expert articulations, plus textual feature diagnoses
Conclusions and outlook
Improvements to CleanTax will remove many of the technical challenges towards a
full-blown taxon concept approach ( improved tracking of classification provenance).
Other technical challenges are being addressed (server platform, algorithmic
scalability, intensional/ostensive articulations, visualization [Euler, combined
concepts], workflow integration).
Many non-technical challenges remain (in short: transparent/consistent use).
Conclusions and outlook
Improvements to CleanTax will remove many of the technical challenges towards a
full-blown taxon concept approach ( improved tracking of classification provenance).
Other technical challenges are being addressed (server platform, algorithmic
scalability, intensional/ostensive articulations, visualization [Euler, combined
concepts], workflow integration).
Many non-technical challenges remain (in short: transparent/consistent use).
The current approach treats concepts as a 'black box' – the input data are simple and
make no reference to type specimens, synapomorphies, diagnostic features, etc.
"Exploring Taxonomic Concepts" project will develop tools for a balanced view.
Nevertheless, the articulations can expose deep and varied semantic links among
succeeding classifications.
Conclusions and outlook
Improvements to CleanTax will remove many of the technical challenges towards a
full-blown taxon concept approach ( improved tracking of classification provenance).
Other technical challenges are being addressed (server platform, algorithmic
scalability, intensional/ostensive articulations, visualization [Euler, combined
concepts], workflow integration).
Many non-technical challenges remain (in short: transparent/consistent use).
The current approach treats concepts as a 'black box' – the input data are simple and
make no reference to type specimens, synapomorphies, diagnostic features, etc.
"Exploring Taxonomic Concepts" project will develop tools for a balanced view.
Nevertheless, the articulations can expose deep and varied semantic links among
succeeding classifications.
CleanTax may be the first attempt to 'explain' classification provenance to logic
reasoners. This could have considerable implications for future data integration.
Acknowledgments
Shawn Bowers, Dave Thau, Alan Weakley
NSF-IIS 1118088: "III-SMALL: A logic-based, provenance-aware system for merging scientific data under
context and classification constraints"
"Euler" team, UC Davis
Top Related