Post on 16-Apr-2017
Addressing the name:meaning driftchallenge in open biodiversity
information environments
Please
@taxonbytes
Nico M. Franz1 , Salvatore A. Anzaldo1, Edward E. Gilbert1,
M. Andrew Jansen1, M. Andrew Johnston1 & Bertram Ludäscher2
1 School of Life Sciences, Arizona State University2 iSchool, University of Illinois at Urbana-Champaign
Symposium: Building the Biodiversity Knowledge Graph for Insects – Components, Progress, and Challenges2016 XXV International Congress of Entomology, Orlando, FL – September 26, 2016 (#ICE2016)
Presentation available @ SlideShare: http://tinyurl.com/franz-et-al-ice-2016
Our biodiversity informatics research program, summarized
• We are no longer just putting articles and monographs on library shelves.
91dd0ee1-8a37-4efc-85b7-8176874cf5be
Our biodiversity informatics research program, summarized
• We are no longer just putting articles and monographs on library shelves.
• This is more than 'just technology'; we must develop new systematic theory
to deal with inherently dynamic, open data systems.
91dd0ee1-8a37-4efc-85b7-8176874cf5be
Our biodiversity informatics research program, summarized
• We are no longer just putting articles and monographs on library shelves.
• This is more than 'just technology'; we must develop new systematic theory
to deal with inherently dynamic, open data systems.
• The concept taxonomy approach has practical implications for strengthening
the roles that individual experts play in big biodiversity data environments.
91dd0ee1-8a37-4efc-85b7-8176874cf5be
Products – concept taxonomy in theory and in practice ZooKeys. doi:10.3897/zookeys.528.6001
Semantic Web. doi:10.3233/SW-160220
Biological Theory (in review). doi:10.1101/022145
PloS ONE. doi:10.1371/journal.pone.0118247
Systematics Biodiv. doi:10.1080/14772000.2013.806371
Systematic Biology. doi:10.1093/sysbio/syw023
Biodiversity Data Journal (in review). #6093Research Ideas and Outcomes (in review). #6302
Premise: We're lucky that insect revisions are not so frequent
"In biology, there are many taxa that are so under-studied thatthey are only known from their original description and
none or very few subsequent references […].
The name alone, so long as it is a unique name,is sufficient to locate all related material."
– David Remsen 2016: 213
Source: Remsen. 2016. The use and limits of scientific names […]. ZooKeys 550: 207–223. doi:10.3897/zookeys.550.9546
Diagnosis:
What happens in dynamic, open systems?
Snapshot of a more frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
Snapshot of a more frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
• Vertical sections identify taxonomic concept regions
Snapshot of a more frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
• Vertical sections identify taxonomic concept regions
• Colors identify lineages of taxonomic names (epithets) in use
Snapshot of a more frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
• Vertical sections identify taxonomic concept regions
• Colors identify lineages of taxonomic names (epithets) in use
• There is no consensus! Five incongruent schemata are used concurrently
Premise:
If incongruent taxonomies are endorsed– locally, provisionally, and democratically –
then what is the impact foraggregated biodiversity data?
Conclusion:
Taxonomy becomes a variable that we need to represent,
and control for
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus'
• Query: "Where do these orchid species occur?"
• Same set of 250 orchid specimens, according to 4 taxonomies.
"Contr
olling
the t
axonom
ic var
iable" Example: the Cleistes use case
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
"Contr
olling
the t
axonom
ic var
iable"
• Query: "Where do these orchid species occur?"
• Same set of 250 orchid specimens, according to 4 taxonomies.
Example: the Cleistes use case
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly) federal 'standard'"C
ontr
olling
the t
axonom
ic var
iable"
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly) federal 'standard'
The 'best', latest regional flora"C
ontr
olling
the t
axonom
ic var
iable"
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly) federal 'standard'
The 'best', latest regional flora"C
ontr
olling
the t
axonom
ic var
iable"
Expert views are in conflict
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly) federal 'standard'
The 'best', latest regional flora"C
ontr
olling
the t
axonom
ic var
iable"
Expert views are in conflict
"Just bad"
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly) federal 'standard'
The 'best', latest regional flora
Impact:Name-based aggregation has created
a novel synthesis that nobody believes in
"Contr
olling
the t
axonom
ic var
iable"
"Just bad"
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly) federal 'standard'
The 'best', latest regional flora"C
ontr
olling
the t
axonom
ic var
iable"
"Just bad"
Expert views are in conflict
Solution:Instead of aggregating
an artificial 'consensus', …
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly) federal 'standard'
The 'best', latest regional flora"C
ontr
olling
the t
axonom
ic var
iable"
"Just bad"
Expert views are reconciled
Solution:Instead of aggregating
an artificial 'consensus',build translation services
Challenges:
How can we redesign aggregation to yieldhigh-quality biodiversity data packages?
Challenges:
How can we redesign aggregation to yieldhigh-quality biodiversity data packages?
What does this mean for Darwin Core1
and how we use this aggregation standard?
1 Wieczorek et al. 2012. Darwin Core: an evolving […]. PLoS ONE 7(1): e29715. doi:10.1371/journal.pone.0029715
Preview of solution with 8 steps
• DwC is insufficient, and part of the problem
Step 7:
# 1: Represent only taxonomic concept labels (TCLs) 1
• Syntax (TCL): taxonomic name [author, year, page] sec. source
1 Multi-taxonomy input/alignment visualizations generated with Euler/X toolkit: https://github.com/EulerProject/EulerX
Cleistes divaricatasec. Gregg & Catling 1993
Pogoniasec. Brown & Wunderlin 1997
# 1: DwC score keeping TCLs are optional; < 1% realized?
• TCL ~ DwC: nameAccordingTo
• SCAN: 19,722 of nearly 9 million records have TCLs (0.2%)
• Lack of enforcement to use TCLs makes standard less big data-ready
DwC record with nameAccordingTo (TCL)(BDJ)
"Who authors GBIF's Backbone?"https://storify.com/taxonbytes/who-authors-gbif-s-backbone
# 2: Represent each source coherently (Parent-Child relationships)
• Syntax (PC): TCL1 is a child/parent of TCL2 [where TCL1/2 = same source]
Cleistesiopsis bifaria sec. Pans. & de Barr. 2008
is a child ofCleistesiopsis sec. Pans. & de Barr. 2008
# 2: DwC score keeping Not (adequately) represented
• PC ~ DwC: genus, family, order (etc.; higherClassification)
• However, higher-level names in DwC are not modeled as TCLs
• Taxonomic coherence of sources cannot be preserved with DwC alone
DwC record with higherClassification(BDJ)
# 3: Do not force a single hierarchy onto all tip-level TCLs
• Syntax (PC): Tip-level TCL1 , TCL2 , etc. [where TCL1/2 = different sources]
# 3: DwC score keeping Optional Not (ever?) practiced
• No PC ~ DwC: infra-/specificEpithet only
• Typically, a single, 'unitary' higher-level classification is represented
• Combinations of algorithmic and social practices achieve the single hierarchy
"Who authors GBIF's Backbone?"https://storify.com/taxonbytes/who-authors-gbif-s-backbone
# 4: Link TCLs via expert-provided RCC–5 articulations
• Syntax (RCC–5): TCL1 {==, >, <, ><, !} TCL2 [where TCL1/2 = diff. sources]
• RCC–5 = Region Connection Calculus
• 14 articulations provided by: http://tinyurl.com/Weakley-Flora-2015
Cleistes bifaria "Coastal Populations" sec. Smith et al. 2004== (is congruent with)
Cleistesiopsis oricamporum sec. Brown & Pans. 2009==
Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf
Region Connection Calculus (semantics: set constraints)
== < > >< !• Two regions N, M are either:
• congruent (N == M)• properly inclusive (N < M)• inversely properly inclusive (N > M)• overlapping (N >< M)• exclusive of each other (N ! M)
Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf
Region Connection Calculus (semantics: set constraints)
== < > >< !• Two regions N, M are either:
• congruent (N == M)• properly inclusive (N < M)• inversely properly inclusive (N > M)• overlapping (N >< M)• exclusive of each other (N ! M)
• RCC–5 articulations answer the query: "can we join regions N and M?"
• Taxonomies have multiple RCC–5 alignable components: nodes (parents, children), node-associated traits, even node-anchoring specimens
# 4: DwC score keeping Not (adequately) represented
• RCC–5 ~ DwC: accepted(Scientific)Name(Usage), relationshipOfResource,
taxonomicStatus (etc.;
nomenclatural relationships)
• Nomenclatural relationships are type-focused, not region-focused
• "Taxonomic Concept Schema" yes! (however: http://www.tdwg.org/standards/117)
Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063
Example:Milkweed butterflies
Oscillating meanings of the epithet hyalites – 1911 to 2003
Phenotypic diversityTy
pe-a
ncho
red
nam
e id
entit
y re
latio
ns
Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063
# 5: Identify occurrence records only to TCLs
Records: EKY39235 MTSU003611 NCSC00040204 …
Records: BOON8098 CLEMS0061133 WILLI39399 …
Records: GMUF-0039355 IBE006808 USCH58399 …
Records: CONV0006268 MDKY00006482 NCU00038930 …
Records: BRYV0023582, BRYV0023584 KHD00032030, MISS0016604 MMNS000227, NCSC00040206 USMS_000002923, USMS_000002924 VSC0053223, VSC0065528 …
Records: ARIZ393087 DBG39049 USCH51217 …
Records: NCU00040710 USCH96248 VSC0053218 …
Records: CLEMS0012881 FUGR0003293 GA023130 …
Records: BOON8100 NCSC00040210 SJNM45487 …
Records: GA023144 LSU00012494 MISS0016608 …
Records: IBE006810, IND-0012374, MMNS000227
Records: NY8654
• Syntax (ID): Occurrence / organism is identified to TCL
"CLEMS0012881"is identified to
Cleistes divaricata sec. Smith et al. 2004
[additional ID metadata]
DwC record with Identification metadata(BDJ)
# 5: DwC score keeping ID metadata optional; > 50% realized
• ID ~ DwC: Identification, (date)identified(By), identificationReference
• SCAN: 4,715,277 of nearly 9 million records have ID metadata (52.5%)
• Enforcement…still also require use of TCLs
# 6: Generate comprehensive, consistent RCC–5 alignments
• Euler/X is a toolkit that infers logically consistent RCC–5 alignments
# 6: Generate comprehensive, consistent RCC–5 alignments
• Valued-added: MIR – set of Maximally Informative Relations containing
the RCC–5 articulation for every possible TCL pair scalability
Reasoner inference
# 7: Joining occurrence-to-TCL identifications & RCC–5 alignments
Records: BOON8098, CLEMS0061133, CONV0006268, EKY39235 GMUF-0039355, IBE006808, IBE006810, IND-0012374 MDKY00006482, MMNS000227, MTSU003611, NCSC00040204 NCU00038930, NY8654, USCH58399, WILLI39399 …
Records: ARIZ393087, BRYV0023582, BRYV0023584, DBG39049 KHD00032030, MISS0016604, MMNS00022, NCSC00040206 USMS_000002923, USMS_000002924, VSC0053223, VSC0065528 …
Records: BOON8100, CLEMS0012881, FUGR0003293 GA023130, GA023144, LSU00012494 MISS0016608, NCSC00040210, NCU00040710 SJNM45487, USCH96248, VSC0053218 …
• Specimen integration is fully driven by TCL-to-TCL RCC–5 signals
Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)
The 'consensus' The 'bible'
The (formerly) federal 'standard'
The 'best', latest regional flora"C
ontr
olling
the t
axonom
ic var
iable"
Impact:"Please select your preference (A – D);
we can perform all translations"
• We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records resolves incongruent lineage of name usages
# 8: "Do you trust us now?" Aggregation as a translational service
• We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records resolves incongruent lineage of name usages
• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"
• Returns record subset resolving only one narrowly circumscribed concept
# 8: "Do you trust us now?" Aggregation as a translational service
# 8: "Do you trust us now?" Aggregation as a translational service
• We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records resolves incongruent lineage of name usages
• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"
• Returns record subset resolving only one narrowly circumscribed concept
• "Now show specimens identified to the TCL Cleistes divaricata sec. RAB 1968,
yet translated into the more granular TCLs sec. Weakley 2015"
• Returns (again) many records, yet represents and contrasts two treatments,
as opposed to providing the ambiguous lineage view (above)
• "Show all specimens with ambiguous 2010/2015 TCL identifications…" (etc.)
Conclusions – designing trusted biodiversity data services
• The Darwin Core standard for aggregating biodiversity data:
(1) Has under-utilized options for better representing taxonomic expertise
(2) Is part of a design paradigm that undermines the plurality of expertise
• The Darwin Core standard for aggregating biodiversity data:
(1) Has under-utilized options for better representing taxonomic expertise
(2) Is part of a design paradigm that undermines the plurality of expertise
• We are developing new solutions – including TCLs, PC relations, RCC–5,
and scalable logic applications – that realize data aggregation via
translational services, without disrupting the formation of expert-licensed,
high-quality biodiversity data packages
Conclusions – designing trusted biodiversity data services
• The Darwin Core standard for aggregating biodiversity data:
(1) Has under-utilized options for better representing taxonomic expertise
(2) Is part of a design paradigm that undermines the plurality of expertise
• We are developing new solutions – including TCLs, PC relations, RCC–5,
and scalable logic applications – that realize data aggregation via
translational services, without disrupting the formation of expert-licensed,
high-quality biodiversity data packages
• All of us – not just aggregators – "own" the responsibility of designing
systems where the plurality of taxonomic expertise is fairly accommodated
Conclusions – designing trusted biodiversity data services
Acknowledgments & links to products
• Cleistes use case: Alan Weakley (UNC)
• Euler/X toolkit: Shizhuo Yu (UC Davis)
• Data trajectories: Beckett Sterner (ASU)
• OBKMS design: Viktor Senderov (Pensoft)
• NSF DEB–1155984, DBI–1342595 (PI Franz)
• NSF IIS–118088, DBI–1147273 (PI Ludäscher)
• Euler/X code @ https://github.com/EulerProject/EulerX
• Franz et al. 2016. Two influential primate classifications logically aligned. Systematic Biology 65(4): 561–582. Link
Interested in exploringmulti-taxonomy and/or-phylogeny alignments?
Please contact me.
nico.franz@asu.edu@taxonbytes
https://biokic.asu.edu/