2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

27
Data publishing and the Darwin Core data standard

Transcript of 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

Page 1: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DatapublishingandtheDarwinCoredatastandard

Page 2: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

GBIF provides a data publishing infrastructure

Page 3: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

GBIFprovidesaservicefordatadiscovery

globalregistry dataportal

thatisdependentonresolvablestableiden0fiersforefficientfunc0onality

Page 4: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

Research institute

Biodiversity ConservationBiodiversity

AnalysisGBIF portal

Global information systems

Scientific Research

MULTIPLE-PURPOSE DATA SERVICES

Page 5: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

Darwin Core data exchange standard

Page 6: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

WHAT IS BIODIVERSITY DATA?

Digital text or multimedia data record detailing facts about the instance of occurrence of an organism, i.e. on the what, where, when, how and by whom of the occurrence and the recording.

Page 7: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

BIODIVERSITY DATA TYPES

http://www.gbif.org/publishing-data/summary#datatypes

Checklists(oftaxonnames)

Occurrences

Metadata(datasetdescripCon)

Page 8: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

BIODIVERSITY DATA TYPES – SAMPLE DATA

http://www.gbif.org/newsroom/news/sample-based-data

Samples

IntroducConoftheEventcoreinMarch-October2015

Page 9: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DATA STANDARDS

Slide source: GB23 Nodes Madagascar October 2015 & iDigBio Florida January 2015 - http://www.tdwg.org/standards/

ABCD Access to Biological Collection Data (2005) DwC Darwin Core (2009) AC Audubon Core Multimedia Resources Metadata Schema (2013) NCD Natural Collection Descriptions (Draft 2008) EML Ecological Metadata Language (Ecological Society of America)

Page 10: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

Darwin Core – a vocabulary of terms

WieczorekJ,BloomD,GuralnickR,BlumS,DöringM,DeGiovanniR,RobertsonT,andVieglaisD(2012)DarwinCore:AnEvolvingCommunity-DevelopedBiodiversityDataStandard.PLoSONE7(1):e29715.(doi:10.1371/journal.pone.0029715)

Page 11: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

h[p://rs.tdwg.org/dwc/terms/

Record-levelTermsdcterms:type|dcterms:modified|dcterms:language|dcterms:rights|dcterms:rightsHolder|dcterms:accessRights|dcterms:bibliographicCitaCon|dcterms:references|ins2tu2onID|collec2onID|datasetID|ins2tu2onCode|collec2onCode|datasetName|ownerInsCtuConCode|basisOfRecord|informaConWithheld|dataGeneralizaCons|dynamicProperCesOccurrenceoccurrenceID|catalogNumber|recordNumber|recordedBy|individualCount|organismQuanCty|organismQuanCtyType|sex|lifeStage|reproducCveCondiCon|behavior|establishmentMeans|occurrenceStatus|preparaCons|disposiCon|associatedMedia|associatedReferences|associatedSequences|associatedTaxa|otherCatalogNumbers|occurrenceRemarksOrganismorganismID|organismName|organismScope|assocoatedOccurrences|associatedOrganisms|previousIdenCficaCons|organismRemarksMaterialSample|LivingSpecimen|PreservedSpecimen|FossilSpecimenmaterialSampleIDEvent|HumanObserva2on|MachineObserva2oneventID|parentEventID|fieldNumber|eventDate|eventTime|startDayOfYear|endDayOfYear|year|month|day|verbaCmEventDate|habitat|samplingProtocol|sampleSizeValue|sampleSizeUnit|samplingEffort|fieldNotes|eventRemarksLoca2onloca2onID|higherGeographyID|higherGeography|conCnent|waterBody|islandGroup|island|country|countryCode|stateProvince|county|municipality|locality|verbaCmLocality|verbaCmElevaCon|minimumElevaConInMeters|maximumElevaConInMeters|verbaCmDepth|minimumDepthInMeters|maximumDepthInMeters|minimumDistanceAboveSurfaceInMeters|maximumDistanceAboveSurfaceInMeters|locaConAccordingTo|locaConRemarks|verbaCmCoordinates|verbaCmLaCtude|verbaCmLongitude|verbaCmCoordinateSystem|verbaCmSRS|decimalLa2tude|decimalLongitude|geodeCcDatum|coordinateUncertaintyInMeters|coordinatePrecision|pointRadiusSpaCalFit|footprintWKT|footprintSRS|footprintSpaCalFit|georeferencedBy|georeferencedDate|georeferenceProtocol|georeferenceSources|georeferenceVerificaConStatus|georeferenceRemarksGeologicalContextgeologicalContextID|earliestEonOrLowestEonothem|latestEonOrHighestEonothem|earliestEraOrLowestErathem|latestEraOrHighestErathem|earliestPeriodOrLowestSystem|latestPeriodOrHighestSystem|earliestEpochOrLowestSeries|latestEpochOrHighestSeries|earliestAgeOrLowestStage|latestAgeOrHighestStage|lowestBiostraCgraphicZone|highestBiostraCgraphicZone|lithostraCgraphicTerms|group|formaCon|member|bedIden2fica2oniden2fica2onID|idenCfiedBy|typeStatus|idenCficaConQualifier|dateIdenCfied|idenCficaConReferences|idenCficaConVerificaConStatus|idenCficaConRemarksTaxontaxonID|scien2ficNameID|acceptedNameUsageID|parentNameUsageID|originalNameUsageID|nameAccordingToID|namePublishedInID|taxonConceptID|scien2ficName|acceptedNameUsage|parentNameUsage|originalNameUsage|nameAccordingTo|namePublishedIn|namePublishedInYear|higherClassificaCon|kingdom|phylum|class|order|family|genus|subgenus|specificEpithet|infraspecificEpithet|taxonRank|verbaCmTaxonRank|scienCficNameAuthorship|vernacularName|nomenclaturalCode|taxonomicStatus|nomenclaturalStatus|taxonRemarksResourceRela2onship(AuxiliaryTerms)resourceRela2onshipID|resourceID|relatedResourceID|relaConshipOfResource|relaConshipAccordingTo|relaConshipEstablishedDate|relaConshipRemarksMeasurementOrFact(AuxiliaryTerms)measurementID|measurementType|measurementValue|measurementAccuracy|measurementUnit|measurementDeterminedDate|measurementDeterminedBy|measurementMethod|measurementRemarks

Page 12: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DARWIN CORE ARCHIVE (DWC-A) v  DwC-A publish DwC records including terms

from DwC-A extensions. v  Simple text based format. v  Zipped single file archive.

occurrence.txt

Page 13: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DARWIN CORE ARCHIVE

A Darwin Core Archive (DwC-A) is the text representation of data formatted to Darwin Core. A DwC-A is a compressed file containing a minimum of three files.

http://rs.tdwg.org/dwc/terms/guides/text/index.htm

Page 14: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

STAR SCHEMA EXAMPLE - OCCURRENCE

Media

OccurrenceCore

Geographical

DeterminaCon

meta.xml

EML.xml

+

DwCArchiveOccurrence

Germplasm

Page 15: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

STAR SCHEMA EXAMPLE - CHECKLIST

Literature

TaxonCore

DescripCon

Occurrences

meta.xml

EML.xml

+

DwCArchiveChecklist

Vernacular

DistribuCon

Types

Page 16: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

STAR SCHEMA EXAMPLE - EVENT

EventCore

Occurrences

MeasurementorFact

meta.xml

EML.xml

+

DwCArchiveSamplesRelevé

Page 17: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DATA NORMALIZATION

What is data normalization? Reasons to normalize a database Normal forms

http://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/, http://databases.about.com/od/specificproducts/a/normalization.htm, http://www.dotnet-tricks.com/Tutorial/sqlserver/756N210512-Database-Normalization-Basics.html

"Datanormaliza,onistheprocessofreducingdatatoitscanonicalform.Forinstance,Databasenormaliza0onistheprocessoforganizingthefieldsandtablesofarela0onaldatabasetominimizeredundancyanddependency"(Wikipedia)."Denormaliza,onistheprocessofaGemp0ngtoop,mizethereadperformanceofadatabasebyaddingredundantdataorbygroupingdata"(Wikipedia).

Page 18: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

Publish your biodiversity data with GBIF

Page 19: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

PUBLISH DATA IN GBIF da

ta p

ublis

hing

Step 1: data holding research institutes seek endorsement as an approved data publisher.

Step 2: datasets are identified and converted to standard Darwin Core format.

Step 3: datasets can be published directly from the data node and/or with the assistance from a national GBIF node.

Citizen science data platforms also publish in GBIF.

Page 20: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

Datapublishingguidelines

h[p://www.gbif.org/resources?f[0]=gr_purpose%3A955

Page 21: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

WHAT IS DATA PUBLISHING?

“Publishing” refers to making biodiversity datasets publicly accessible and discoverable, in a standardized form, via an access point, typically a web address (a URL).

IPT

Page 22: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

TheGBIFIntegrateddataPublishingToolkit(IPT)isafreeopensourcesorwaretoolwri[eninJavathatisusedtopublishandsharebiodiversitydatasetsthroughtheGBIFnetwork.

h[p://www.gbif.org/ipt

IPTUserManual:

h[ps://github.com/gbif/ipt/wiki/IPT2ManualNotes.wiki

RobertsonT,DöringM,GuralnickR,BloomD,WieczorekJ,BraakK,OteguiJ,RussellL,DesmetP(2014).TheGBIFintegratedpublishingtoolkit:FacilitaCngtheefficientpublishingofbiodiversitydataontheinternet.PLoSOne9(8).doi:10.1371/journal.pone.0102623

Page 23: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DATA PUBLISHING METHODS

Page 24: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DATA PUBLISHING LANDSCAPE

DiGIR(2001),BioCASE

(2001),TapirLink(2007)inusefor

publishingbiodiversitydata

Ideaforsimple,compressedtext-basedfileforpublishing

introducedatTDWG

GBIFintroducesIPT1.0

GBIFredevelopsIPTwithlessmemory

requirements

GBIFintroducesIPT2.0

IPTmorethan100installaConsandservingmore

than800datasets

Nodesandaggregators

(includingGBIFNorway)begintoinstallanduse

IPTs

Demo/testEventcore

developedbyGBIFandEU

BON

2007 2008 2009 2010 2011 2012 20142013 2015

Eventcoreisreleasedforuse(October2015).

DatasetDOIswithDataCite(March2015).IPTbecomesthe

dominantdata-publishingsoluConinGBIF.

Page 25: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

Researchgrade:anobservaConmusthavemedia,coordinates,adate,andpassqualitymetrics,butnowthecommunityIDmustbefinerthanfamily.

NeedsID:AnyobservaConthatcouldbecome“Research”gradebutneedsmoreiden0fica0ons.

Casual:AnyobservaConthatcannotbecome“Research”grade.

Morethan17millionNorwegianoccurrencerecordspublishedinGBIF

Page 26: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DATA PUBLISHING LANDSCAPE: STATUS 2016

TheconCnuedGBIFcommitmenttoimprovingaccesstobiodiversitydata.

RefinementandexpansionofstandardsandpublishingsoQware.

Evolvingsocialnorms.

MostdatasCllpublishedwithsimpleoccurrencecore.

Portalsdonotcontainthefeaturestosupportricherdata.

ManyinsCtuConssCllneedconvincingtopublishbiodiversitydata.

http://www.gbif.org/page/82104

Page 27: 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

NodeteamatNHM,UniversityofOsloDagEndresen,NodemanagerChrisCanSvindseth,Databasemanager

FridtjofMehlum,ResearchdirectorEinarTimdal,AssociateprofessorGeirSøli,AssociateprofessorVidarBakken,Consultant

Artsdatabanken,Trondheim

WouterKochNilsValland

NTNUUniversityMuseumAndersFinstad,GBIFSciencecommiGee

ResearchCouncilofNorway

PerBacke-Hansen,Headofdelega0on

Contactusat:[email protected]