Darwin Core extension for germplasm (11th December 2013)
-
Upload
dag-endresen -
Category
Technology
-
view
132 -
download
0
description
Transcript of Darwin Core extension for germplasm (11th December 2013)
e-‐Conference on Germplasm Data Interoperability, December 11th 2013. Dag Endresen, GBIF-‐Norway (UiO).
Why did we make a germplasm extension for Darwin Core?
à Upgrade germplasm data pathways to use web services
The objecNve was to enable sharing of germplasm informaNon using the standard web-‐service based biodiversity data publishing toolkits maintained by the Global Biodiversity InformaNon Facility (GBIF) and the Biodiversity InformaNon Standards (TDWG).
à Upgrade data types to include trait data The objecNve was to expand on the germplasm data types published to germplasm data portal from basic passport data to include in parNcular crop trait informaNon.
2
The compaNbility of data standards between PGR and biodiversity collecNons made it possible to integrate the worldwide germplasm collecNons into the biodiversity community (TDWG, GBIF).
PotenNal of the GBIF technology
hTp://data.gbif.org/datasets/network/2 hTp://www.gbif.org/network/ae3a42e4-‐5829-‐4210-‐8d8a-‐84b0cbda47bc
Using GBIF/TDWG technology (and contribuNng to its development), the PGR community can more easily establish specific PGR networks without duplicaNng GBIF's work.
2,106,765 records of germplasm data (status 2013)
3
Genebank dataset
Global Crop Registries
European EURISCO Catalog
European Crop Databases
4
GBIF
MulNple-‐purpose data export services
genesys-‐pgr.org
The GENESYS gateway to geneNc resources provides access to informaNon on more than 2.3 million genebank accessions, hTp://www.genesys-‐pgr.org/
2,348,549 records of germplasm accessions
The European GeneNc Resources Search Catalogue (EURISCO) receives data from the NaNonal Inventories (NI) and provides access to all ex situ PGR accessions in Europe, hTp://eurisco.ecpgr.org
1,074,136 records of germplasm accessions
6
A total of 64 ECPGR Central Crop Databases have been established by individual insNtutes and the ECPGR Working Groups. The databases hold passport data and, to varying degrees, characterizaNon and primary evaluaNon data of the major collecNons of the respecNve crops in Europe, hTp://www.ecpgr.cgiar.org/germplasm_databases/central_crop_databases.html
(8 databases)
(10 databases)
(6 databases)
(10 databases)
(8 databases)
(22 databases)
7
Possible Upgraded PGR Network Model
8
v Each dataset is shared from the holding gene bank.
v The National Inventory (NI) endorse all national gene banks for EURISCO.
v ECPGR Crop databases can access passport data from EURISCO and additional crop specific data from the gene bank IPT interface.
v Standard data sharing tools ensure that the genebank dataset is available to other relevant decentralized thematic, regional or global networks. IllustraNon from the GBIF
annual report 2009, page 47.
9
Background and context
10
MCPD revisions
1997 2001 2012
11
May 2009
Some of the data publishing toolkits ICIS (Java, 1996 à)
BioMOBY (Perl, 2001 à)
EURISCO (tab-‐delimited, 2003 à)
DiGIR (PHP, 2001 -‐ 2006)
TapirLink (PHP, 2007 à)
BioCASE (Python, 2001 à)
TAPIR PyWrapper3 (Python, 2006 – 2008)
GBIF IPT (Java, 2009 à)
2
12
Demo project in 2005 using BioCASE
13
Mapping of MCPD à ABCD v2.06 was required before using BioCASE
National Inventory Code Institute Code Accession Number
Collecting Number Collecting Institute Code
Genus Species
Species Authority „Subtaxa“
„Subtaxa“ Authority Common Crop Name Accession Name Acquisition Date
Country of Origin Location of Collection Site Latitude of CS Longitude of CS Elevation of CS Collecting Date of Sample Breeding Institute Code Biological Status of Accession Ancestral Data Collecting/Acquisition Source
Donor Institute Code Donor Accession Number Other Identification (Number) associated
with the accession Location of Safety Duplicates Type of Germplasm Storage Remarks Decoded Collecting Institute Decoded Breeding Institute Decoded Donor Institute Decoded Safety Duplication Location Accession URL
Helmut Knüpffer IPK Gatersleben
Walter Berendsohn BGBM, Berlin
Berendsohn, W. and H. Knüpffer (2004 -‐ 2006). Dral mapping of Eurisco descriptors to ABCD 2.06. Available at hTp://www.bgbm.org/tdwg/codata/Schema/Mappings/EURISCO-‐2-‐ABCD.pdf
14
Highlight in green good match, orange acceptable match, red no match (was included as PGR extension in ABCD v2.06).
2005 : BioCASE demo
Genebank/germplasm extension to the ABCD v2.06 15
Demo project in 2010 using the GBIF IPT
16
The Darwin Core germplasm extension was required for meaningful descripNon of germplasm data sets using Darwin Core and the GBIF IPT. A mapping of MCPD terms to Darwin Core.
Plus some addiGonal terms to describe germplasm: • breeding/culNvaNon event (source: MCPD), • crop trait experiments (source: EPGRIS3/ECPGR), • and internaNonal crop treaty regulaNons.
The first DRAFT version was released in August 2009.
17
Mapping of MCPD à Darwin Core was required before using the GBIF IPT
v EURISCO v NordGen (Nordic countries) v Bioversity-Montpellier (France) v IPK Gatersleben (Germany) v BLE (Germany) v WUR CGN (The Netherlands) v CRI (Czech Republic) v VIR (Russian Federation) v SeedNET (Balkan) v Baltic (Estonia, Latvia, Lithuania)
2010 : IPT installaNons for EURISCO
18
Darwin Core “The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observa;ons, specimens, and samples, and related informa;on.” • a well-‐defined standard core vocabulary • a flexible framework to maximize re-‐usability • approved as TDWG standard 2009 hTp://rs.tdwg.org/dwc/ Wieczorek J., D. Bloom, R. Guralnick, S. Blum, M. Döring, R. Giovanni, T. Robertson, D. Vieglais (2012). Darwin Core: An Evolving Community-‐Developed Biodiversity Data Standard. PLoS ONE 7(1): e29715. doi:10.1371/journal.pone.0029715 19
Darwin Core star schema
Germplasm Breeder Trait
Audubon core
Can relate elements one-to-one or one-to-many.
1:1
1:many 1:many
1:many 1:many
20
Darwin Core Archive (DwC-‐A) v DwC-A publish Darwin Core records including extensions v Simple text based format v Zipped single file archive
Germplasm.txt
21
22
The Darwin Core extension for genebanks is an extension to the Darwin Core standard. Provides a mapping of MCPD terms and Darwin Core terms. And it includes addiNonal terms required for describing germplasm resources that were missing in Darwin Core.
• Endresen, D., S. Gaiji, and T. Robertson (2009). Darwin Core Germplasm extension and deployment in the GBIF infrastructure. Proceedings of TDWG 2009, Montpellier, France. Bioversity InformaNon Standards (TDWG).
• Endresen, D.T.F. and H. Knüpffer (2012). The Darwin Core extension for genebanks opens up new opportuniNes for sharing genebank data sets. Biodiversity InformaNcs 8:11-‐29.
Darwin Core extension for genebanks
Namespace (SKOS/RDF) (stable version) hTp://purl.org/germplasm/germplasmTerm# Code repository (stable version)hTp://code.google.com/p/darwincore-‐germplasm Community discussion (development version) hTp://terms.tdwg.org/wiki/Germplasm
Darwin Core extension for genebanks
23
MCPD (2012) Darwin Core MCPD (2012) Darwin Core
(missing) dwc.datasetID 15.5 COORDUNCERT dwc.coordinateUncertaintyInMeters
(missing) dwc.occurrenceID 15.6 COORDDATUM dwc.geodeNc.Datum
1 INSTCODE dwc.insNtuNonCode 15.7 GEOREFMETH dwc.georeferenceSources
2 ACCENUMB dwc.catalogNumber 16 ELEVATION dwc.minimumElevaNonInMeters
3 COLLNUMB dwc.recordNumber 17 COLLDATE dwc.eventDate
4 COLLCODE g.collecNngInsNtuteCode 18 BREDCODE g.breederInsNtuteID
4.1 COLLNAME dwc.recordedBy 18.1 BREDNAME g.breedingInsNtute
4.1.1 COLLINSTADDRESS (dwc.recordedBy) 19 SAMPSTAT g.biologicalStatus
4.2 COLLMISSID dwc.collecNonCode 20 ANCEST g.ancestralData, g.purdyPedigree
5 GENUS dwc.genus 21 COLLSRC g.acquisiNonSource
6 SPECIES dwc.specificEpithet 22 DONORCODE g.donorInsNtuteID
7 SPAUTHOR dwc.scienNficNameAuthorship 22.1 DONORNAME g.donorInsNtute
8 SUBTAXA dwc.infraspecificEpithet 23 DONORNUMB g.donorsIdenNfier
9 SUBTAUTHOR (dwc.scienNficNameAuthorship) 24 OTHERNUMB dwc.otherCatalogNumbers
10 CROPNAME dwc.vernacularName 25 DUPLSITE g.safetyDuplicaNonInsNtuteID
11 ACCENAME g.breedingIdenNfier 25.1 DUPLINSTNAME g.safetyDuplicaNonInsNtute
12 ACQDATE g.acquisiNonDate 26 STORAGE g.storageCondiNon
13 ORIGCTY dwc.countryCode 27 MLSSTAT g.mlsStatus
14 COLLSITE dwc.locality 28 REMARKS dwc.occurrenceRemarks
15.1 DECLATITUDE dwc.decimalLaNtude
15.2 LATITUDE dwc.verbaNmLaNtude
15.3 DECLONGITUDE dwc.decimalLongitude
15.4 LONGITUDE dwc.verbaNmLongitude 24
Mapping of DwC to to MCPD
Data set
dcmitype:Dataset (Darwin Core: Record-‐level terms)
dwc:datasetID (missing in MCPD)
dwc:datasetName (eurisco: NICODE)
dwc:collecNonID
dwc:collecNonCode mcpd: COLLMISSID
dwc:insNtuNonID
dwc:insNtuNonCode mcpd: INSTCODE
25
dwc = hTp://rs.tdwg.org/dwc/terms/
g = hTp://purl.org/germplasm/germplasmTerm#
dcmitype = hTp://purl.org/dc/dcmitype/
dsw = hTp://purl.org/dsw/
mcpd = hTp://www.bioversityinternaNonal.org/index.php?id=244&tx_news_pi1%5Bnews%5D=1350&cHash=d953e45ada3ab285d635593b5068a38f
epgris3 = hTp://www.epgris3.eu/docs/acNviNes/2-‐05/Inclusion%20of%20C&E%20data.pdf
Nomenclature
dwctype:Taxon dwc:taxonID (missing in MCPD)
dwc:scienNficNameID
dwc:scienNficName
dwc:genus mcpd: GENUS
dwc:specificEpithet mcpd: SPECIES
dwc:scienNficNameAuthorship mcpd: SPAUTHOR, SUBTAUTHOR
dwc:vernacularName mcpd: CROPNAME
26
Germplasm accession
g:GermplasmAccession (see also: dsw:Specimen)
dwc:occurrenceID (missing in MCPD)
g:germplasmID (epgris3: GENOTYPE_NUMBER)
dwc:catalogNumber mcpd: ACCENUMB
g:germplasmIdenNfier mcpd: ACCENAME
g:biologicalStatus mcpd: SAMPSTAT
g:storageCondiNon mcpd: STORAGE
dwc:otherCatalogNumbers mcpd: OTHERNUMB
dwc:occurrenceDetails (eurisco: ACCEURL)
dwc:occurrenceRemarks mcpd: REMARKS
27
CollecNng event g:CollecGngEvent (dwc.Event, dcmitype:Event)
dwc:eventID
dwc:recordNumber mcpd: COLLNUMB
dwc:decimalLaNtude mcpd: DECLATITUDE [geo:lat]
dwc:decimalLongitude mcpd: DECLONGITUDE [geo:long]
dwc:geodeNcDatum mcpd: COORDDATUM
dwc:minimumElevaNonInMeters mcpd: ELEVATION [geo:alt]
dwc:eventDate mcpd: COLLDATE
dwc:locality mcpd: COLLSITE [geo:locaNon]
dwc:countryCode mcpd: ORIGCTY [mcpd: ISO 3166-‐1 alpha-‐3]
dwc:verbaNmLaNtude mcpd: LATITUDE
dwc:verbaNmLongitude mcpd: LONGITUDE
dwc:georeferenceSources mcpd: GEOREFMETH
g:collecNngInsNtuteID mcpd: COLLCODE
dwc:recordedBy mcpd: COLLNAME
dwc:eventRemarks
28
Breeding event
g:BreedingEvent (see also dcmitype:Event)
g:breedingID
g:breedingIdenNfier mcpd: ACCENAME
g:breedingYear
g:breedingCountry
g:breedingCountryCode
g:breedingInsNtuteID mcpd: BREDCODE
g:breedingInsNtute mcpd: BREDNAME
g:breedingPerson
g:ancestralData mcpd: ANCEST
g:purdyPedigree (mcpd: ANCEST)
g:breedingRemarks
29
AcquisiNon event
g:AcquisiGonEvent (see also dcmitype:Event)
g:acquisiNonID
g:donorsID
g:donorsIdenNfier mcpd: DONORNUMB
g:donorInsNtuteID mcpd: DONORCODE
g:donorInsNtute mcpd: DONORNAME
g:acquisiNonDate mcpd: ACQDATE
g:acquisiNonSource mcpd: COLLSRC
g:acquisiNonRemarks
30
Safety duplicaNon
g:SafetyDuplicaGon (see also dcmitype:Event)
g:safetyDuplicaNonID
g:safetyDuplicaNonDate
g:safetyDuplicaNonInsNtuteID mcpd:DUPLSITE
g:safetyDuplicaNonInsNtute mcpd: DUPLINSTNAME
g:safetyDuplicaNonRemarks
31
Treaty or legislaNon
g:TreatyOrRegulaGon (see also dcmitype:Text, foaf:Document)
g:treatyOrRegulaNonID
g:treatyOrRegulaNonName
g:treatyOrRegulaNonGoverningBody
g:mlsStatus mcpd: MLSSTAT
32
Measurement method (trait)
g:MeasurementMethod (see also dwc:MeasurementOrFact)
g:measurementMethodID epgris3: TRAIT_NUMBER
g:measurementMethodName epgris3: TRAIT_NAME
g:measurementMethodCategory
g:measurementMethodScale
g:measurementMethodSource
g:measurementMethodRemarks epgris3: TRAIT_REMARK
dwc:measurementType
dwc:measurementMethod
33
Measurement experiment
g:MeasurementExperiment (see also dcmitype:Event)
g:measurementEperimentID
g:measurementExperimentIdenNfier
g:measurementExperimentYear
g:measurementExperimentReport
g:measurementExperimentRemarks
34
Measurement or fact
dwc:MeasurementOrFact dwc:measurementID
dwc:measurementValue
dwc:measurementUnit
dwc:measurementAccuracy
dwc:measurementDeterminedDate
dwc:measurementDeterminedBy
g:measurementByInsNtuteID
g:measurementGrowthStage
35
Controlled value vocabulary Biological status type wild (100) | natural (110) | semiNaturalWild (120) | semiNaturalSown (130) | weedy (200) | landrace (300) | breedingResearchMaterial (400) | breedersLine (410) | syntheNcPopulaNon (411) | hybrid (412) | founderStock (413) | inbredLine (414) | segregaNngPopulaNon (415) | clonalSelecNon (416) | geneNcStock (420) | mutant (421) | cytogeneNcStock (422) otherGeneNcStock (423) | advancedCulNvar (500) | GMO (600) | otherBiologicalStatus (999) Acquisi1on type wildHabitat (10) | forest (11) | shrubland (12) | grassland (13) | desertOrTundra (14) | aquaNcHabitat (15) | culNvatedHabitat (20) | field (21) | orchard (22) | backyard (23) | fallowLand (24) | pasture (25) | farmStore (26) | threshingFloor (27) | park (28) | marketOrShop (30) | insNtuteOrGenebank (40) | seedCompany (50) | ruderalHabitat (60) | roadside (61) | fieldMargin (62) | otherAcquisiNon (99) [Most of these could perhaps be replaced by their respec;ve term from the Environmental Ontology (EnvO).] Storage type seedCollecNon (10) | shortTerm (11) | mediumTerm (12) | longTerm (13) | fieldCollecNon (20) | inVitro (30) | cryopreserved (40) | DNA (50) | otherStorage (99)
36
Some proposed addiNons
In situ conservation (proposed) IUCNCategory, numberOfSeeds, bioRegion, inSituCountry, inSituRecoveryDateStarted, inSituRecoveryInstitute, inSituRecoveryRemarks Germplasm distribution Perhaps add new terms to facilitate the reporting of germplasm distribution and standards material transfer (SMTA) agreements for the International Treaty for Genetic Resources for Food and Agriculture (ITPGRFA). Germplasm management The Millennium Seed Bank (Kew) contributed feedback to the DwC-G modeling and proposed to include terminology for seed management.
• Seed processing terms • Seed cleaning • Seed germination testing
37
38
Germplasm vocabulary of terms (RDF/SKOS) …
… hTp://purl.org/germplasm/germplasmTerm#
39
Darwin Core Archive extension for IPT
…
hTp://rs.gbif.org/extension/germplasm/20120911/GermplasmAccession.xml
Term Wiki
hTp://terms.tdwg.org/wiki/ 40
Concept Vocabulary (rdf, skos)
Term Wiki For vocabulary development
Resources Repository
1. Mint and maintain concepts and terms, in domain-‐expert working groups. 2. Release final version as a Concept Vocabulary. 3. Publish at the GBIF Resources Repository. REUSE terms from published concept vocabularies and ontologies when designing new applicaNon schema such as DwC-‐A controlled term and value vocabularies.
2
1
3
Work-‐flow for Vocabulary management
hTp://rs.gbif.org/terms/ hTp://terms.tdwg.org/wiki/
41
Concept Vocabulary (rdf, skos)
Ontologies (rdf, owl)
Biodiversity ontology development
REUSE terms from concept vocabularies whenever possible.
Biodiversity ontology repository
hTp://bis.bioportal.bioontology.org/ontologies?filter=BIS
42
Example: master SKOS/RDF resource
hTp://rs.gbif.org/terms/dwc/dwc_translaNons.rdf
[ [ [ [ en
es
zh
ja
43
• Provide a shared understanding of what we mean when describing biodiversity enNNes.
• What kind of thing or property. • A list of things we as a community can agree upon the meaning of.
• “Concept repository” with terms idenNfied by URIs.
Vocabularies/ontologies
TDWG Technical Roadmap 2008 (convened by Roger Hyam). Photo CC-‐by-‐3.0 by Hannes Grobe/AWI. Palaeoclimate archives.
44
• Vocabularies/ontologies are one of the three core components in the TDWG technical architecture.
Vocabulary management
Hyam, R (2006). A technical architecture for TDWG standards. 45
GBIF, Global Biodiversity Information Facility http://www.gbif.org
TDWG, Biodiversity Information Standards http://www.tdwg.org BioCASE, The Biological Collection Access Service for Europe
http://www.biocase.org Bioversity International
http://www.bioversityinternational.org NordGen, The Nordic Genetic Resources Center
http://www.nordgen.org
“Things can happen in a band, or any type of collabora;on, that would not otherwise happen” (Jim Coleman, Jazz-‐musician).
46