BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the...
-
Upload
blaise-harris -
Category
Documents
-
view
215 -
download
0
Transcript of BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the...
![Page 1: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/1.jpg)
BIS TDWG Conference28 October 2013, Florence
Documenting data quality in a global network: the challenge for GBIF
Éamonn Ó Tuama, Andrea Hahn, Markus DöringGlobal Biodiversity Information Facility (GBIF)
![Page 2: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/2.jpg)
Outline1. The GBIF network and the Data
Quality challenge
2. Current DQ processes in GBIF Portal
3. DQ and GBIF Nodes
4. Addressing DQ in GBIF work programme 2014-2016
![Page 3: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/3.jpg)
GBIF is … - a connected community
- an informatics infrastructure
- a window on biodiversity
- a tool for science and society
http://www.gbif.org/resources/2311
![Page 4: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/4.jpg)
Addressing data quality
Meeting the challenge of documenting data quality as the network and volume of data grow …
![Page 5: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/5.jpg)
Aug-0
7
Nov-0
7
Feb-
08
May-0
8
Aug-0
8
Nov-0
8
Feb-
09
May-0
9
Aug-0
9
Nov-0
9
Feb-
10
May-1
0
Aug-1
0
Nov-1
0
Feb-
11
May-1
1
Aug-1
1
Nov-1
1
Feb-
12
May-1
2
Aug-1
2
Nov-1
2
Feb-
13
May-1
380
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
400
420
440
Pri
mary
bio
div
ers
ity r
eco
rds
(millions)
As of August 2013: >405,720,500 indexed records from 10,139 datasets from 493 publishers and spanning a wide range of geospatial, temporal and taxonomic coverages.
http://tinyurl.com/gbifMap
Current GBIF Network Data Coverage
![Page 6: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/6.jpg)
DQ processes in GBIF portal
• Minimum obligatory metadata• Check geographic values• Check taxonomic values
![Page 7: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/7.jpg)
Packaging metadata with data
![Page 8: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/8.jpg)
Verbatim data asserted to originate in USA as shared on the network
Geographic attributes
![Page 9: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/9.jpg)
Data following quality check• Coastal regions recognised• Offshore islands recognised
Geographic attributes85% (355/417 mil)georeferenced records
2.7% (9.4 million)georeferenced with issues
![Page 10: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/10.jpg)
Trochilidae (Hummingbirds)Using verbatim higher classification
Taxonomic attributes
![Page 11: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/11.jpg)
Taxonomic attributes
Trochilidae (Hummingbirds)Classification based on authoritative sources
56% of name usages also found in CoL
![Page 12: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/12.jpg)
Authoritative checklists• Fill gaps in the
GBIF taxonomic backbone
• Increase list of known synonyms
• Increase the number of common names known to GBIF
![Page 13: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/13.jpg)
New improved algorithm for GBIF backbone taxonomy
• Some taxa (mainly autonyms) do not have stable IDs
• Too many accepted species created because of lack of a good database of taxonomic synonyms
![Page 14: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/14.jpg)
Working with Catalogue of Life
GBIF backbonetaxonomy
Catalogue of Life
Global Species
Databases
GBIFChecklistBankDwC-A
Checklists
![Page 15: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/15.jpg)
GBIF backbonetaxonomy
Catalogue of Life
Global Species
Databases
GBIFChecklistBankDwC-A
Checklists
Working with Catalogue of Life
• 8188 names annotated• 6825 rejected names• 541 placed names (added to ILDIS)• remaining have syntactical problems
(CoL issue, not ILDIS)
First backbone based on CoL feedback loop expected around December 2013
The first two GSDs have already provided annotations:International Legume Database & Information Service (ILDIS)
Scarabs: World Scarabaeidae Database • 1339 names annotated• 0 rejected names
![Page 16: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/16.jpg)
Data Quality issuesNon-standardised valuesExample: dwc:country (http://rs.tdwg.org/dwc/terms/country)
29,052 distinct values for country namesOf these, 18,704 (concerning 2.2 mil records) could not be mapped to an ISO country code.
Typical issues:• Variants: 126 different values for “Italy”• Mismappings: taxon names instead of country
names• Incorrect level of detail: sub-national units, non-
country geographical entities
![Page 17: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/17.jpg)
Data Quality issuesNon-standardised valuesExample: dwc:basisOfRecord (http://rs.tdwg.org/dwc/terms/basisOfRecord)
625 values that cannot be interpreted at all (accounting for 13.3 mil records) Typical issues:• Spelling variants / language variants• Mismappings• Misunderstanding definition
30 mil records with no value or “unknown”
Interpretable values quite variede.g. 31 values mapped to “observation”, 146 to “specimen”
![Page 18: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/18.jpg)
DQ and GBIF Nodes
Desirable improvements• Better metadata• Persistent IDs• Controlled vocabularies• Annotations• Independently validated datasets• Genetic validation of taxonomy
![Page 19: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/19.jpg)
DQ and GBIF Nodes
Implementing improvements• Collate experiences of all
Nodes and share best practices• Build reusable DQ components
(e.g., tools, vocabularies, workflows)
![Page 20: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/20.jpg)
DQ and GBIF Nodes
Next steps• Expand Data Quality Interest
Group• Establish a collaboration platform
![Page 21: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/21.jpg)
Addressing Data Quality
inGBIF Work Programme
2014-2016
![Page 22: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/22.jpg)
• Ensure stable identifiers for datasets and records• Provide a method for citation of data sets• Enable annotation of data
GBIF Work Programme2014-2016
Essential Infrastructure to support Data Quality
![Page 23: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/23.jpg)
Engagement of expert communities to form fitness-for-use working groups
• enhancements to data standards and classes of data in use in GBIF
• criteria and algorithms for evaluating data quality, fitness-for-use, coverage and completeness
• content mobilisation priorities (inc. improving already mobilised data)
• identification and curation of reference data sets
GBIF Work Programme2014-2016
![Page 24: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/24.jpg)
Guidelines and supporting tools to assess and improve metadata completeness for all data
• Evaluation and reporting on metadata completeness and quality
• Seeking to ensure that the basis of record is clear for each data record
GBIF Work Programme2014-2016
Criteria from fitness-for-use working groups
![Page 25: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/25.jpg)
GBIF portal upgrades to report data quality and fitness-for-use for each data set and species
Standards compliance
Metadata completeness
Presence of key data elements
Automated checks for issues and outliers
Endorsements of data publishers and data sets by Nodes, fitness-for-use working groups and other stakeholders
GBIF Work Programme2014-2016
Criteria from fitness-for-use working groups
![Page 26: BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d9f5503460f94a89f9e/html5/thumbnails/26.jpg)
Thank you
GBIF SecretariatUniversitetsparken 15DK-2100 Copenhagen ØDenmark
www.gbif.org
E-mail: [email protected]: +45 3532 1470Fax: +45 3532 1480