2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

Post on 13-Apr-2017

67 views 0 download

Transcript of 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.

DatapublishingandtheDarwinCoredatastandard

GBIF provides a data publishing infrastructure

GBIFprovidesaservicefordatadiscovery

globalregistry dataportal

thatisdependentonresolvablestableiden0fiersforefficientfunc0onality

Research institute

Biodiversity ConservationBiodiversity

AnalysisGBIF portal

Global information systems

Scientific Research

MULTIPLE-PURPOSE DATA SERVICES

Darwin Core data exchange standard

WHAT IS BIODIVERSITY DATA?

Digital text or multimedia data record detailing facts about the instance of occurrence of an organism, i.e. on the what, where, when, how and by whom of the occurrence and the recording.

BIODIVERSITY DATA TYPES

http://www.gbif.org/publishing-data/summary#datatypes

Checklists(oftaxonnames)

Occurrences

Metadata(datasetdescripCon)

BIODIVERSITY DATA TYPES – SAMPLE DATA

http://www.gbif.org/newsroom/news/sample-based-data

Samples

IntroducConoftheEventcoreinMarch-October2015

DATA STANDARDS

Slide source: GB23 Nodes Madagascar October 2015 & iDigBio Florida January 2015 - http://www.tdwg.org/standards/

ABCD Access to Biological Collection Data (2005) DwC Darwin Core (2009) AC Audubon Core Multimedia Resources Metadata Schema (2013) NCD Natural Collection Descriptions (Draft 2008) EML Ecological Metadata Language (Ecological Society of America)

Darwin Core – a vocabulary of terms

WieczorekJ,BloomD,GuralnickR,BlumS,DöringM,DeGiovanniR,RobertsonT,andVieglaisD(2012)DarwinCore:AnEvolvingCommunity-DevelopedBiodiversityDataStandard.PLoSONE7(1):e29715.(doi:10.1371/journal.pone.0029715)

h[p://rs.tdwg.org/dwc/terms/

Record-levelTermsdcterms:type|dcterms:modified|dcterms:language|dcterms:rights|dcterms:rightsHolder|dcterms:accessRights|dcterms:bibliographicCitaCon|dcterms:references|ins2tu2onID|collec2onID|datasetID|ins2tu2onCode|collec2onCode|datasetName|ownerInsCtuConCode|basisOfRecord|informaConWithheld|dataGeneralizaCons|dynamicProperCesOccurrenceoccurrenceID|catalogNumber|recordNumber|recordedBy|individualCount|organismQuanCty|organismQuanCtyType|sex|lifeStage|reproducCveCondiCon|behavior|establishmentMeans|occurrenceStatus|preparaCons|disposiCon|associatedMedia|associatedReferences|associatedSequences|associatedTaxa|otherCatalogNumbers|occurrenceRemarksOrganismorganismID|organismName|organismScope|assocoatedOccurrences|associatedOrganisms|previousIdenCficaCons|organismRemarksMaterialSample|LivingSpecimen|PreservedSpecimen|FossilSpecimenmaterialSampleIDEvent|HumanObserva2on|MachineObserva2oneventID|parentEventID|fieldNumber|eventDate|eventTime|startDayOfYear|endDayOfYear|year|month|day|verbaCmEventDate|habitat|samplingProtocol|sampleSizeValue|sampleSizeUnit|samplingEffort|fieldNotes|eventRemarksLoca2onloca2onID|higherGeographyID|higherGeography|conCnent|waterBody|islandGroup|island|country|countryCode|stateProvince|county|municipality|locality|verbaCmLocality|verbaCmElevaCon|minimumElevaConInMeters|maximumElevaConInMeters|verbaCmDepth|minimumDepthInMeters|maximumDepthInMeters|minimumDistanceAboveSurfaceInMeters|maximumDistanceAboveSurfaceInMeters|locaConAccordingTo|locaConRemarks|verbaCmCoordinates|verbaCmLaCtude|verbaCmLongitude|verbaCmCoordinateSystem|verbaCmSRS|decimalLa2tude|decimalLongitude|geodeCcDatum|coordinateUncertaintyInMeters|coordinatePrecision|pointRadiusSpaCalFit|footprintWKT|footprintSRS|footprintSpaCalFit|georeferencedBy|georeferencedDate|georeferenceProtocol|georeferenceSources|georeferenceVerificaConStatus|georeferenceRemarksGeologicalContextgeologicalContextID|earliestEonOrLowestEonothem|latestEonOrHighestEonothem|earliestEraOrLowestErathem|latestEraOrHighestErathem|earliestPeriodOrLowestSystem|latestPeriodOrHighestSystem|earliestEpochOrLowestSeries|latestEpochOrHighestSeries|earliestAgeOrLowestStage|latestAgeOrHighestStage|lowestBiostraCgraphicZone|highestBiostraCgraphicZone|lithostraCgraphicTerms|group|formaCon|member|bedIden2fica2oniden2fica2onID|idenCfiedBy|typeStatus|idenCficaConQualifier|dateIdenCfied|idenCficaConReferences|idenCficaConVerificaConStatus|idenCficaConRemarksTaxontaxonID|scien2ficNameID|acceptedNameUsageID|parentNameUsageID|originalNameUsageID|nameAccordingToID|namePublishedInID|taxonConceptID|scien2ficName|acceptedNameUsage|parentNameUsage|originalNameUsage|nameAccordingTo|namePublishedIn|namePublishedInYear|higherClassificaCon|kingdom|phylum|class|order|family|genus|subgenus|specificEpithet|infraspecificEpithet|taxonRank|verbaCmTaxonRank|scienCficNameAuthorship|vernacularName|nomenclaturalCode|taxonomicStatus|nomenclaturalStatus|taxonRemarksResourceRela2onship(AuxiliaryTerms)resourceRela2onshipID|resourceID|relatedResourceID|relaConshipOfResource|relaConshipAccordingTo|relaConshipEstablishedDate|relaConshipRemarksMeasurementOrFact(AuxiliaryTerms)measurementID|measurementType|measurementValue|measurementAccuracy|measurementUnit|measurementDeterminedDate|measurementDeterminedBy|measurementMethod|measurementRemarks

DARWIN CORE ARCHIVE (DWC-A) v  DwC-A publish DwC records including terms

from DwC-A extensions. v  Simple text based format. v  Zipped single file archive.

occurrence.txt

DARWIN CORE ARCHIVE

A Darwin Core Archive (DwC-A) is the text representation of data formatted to Darwin Core. A DwC-A is a compressed file containing a minimum of three files.

http://rs.tdwg.org/dwc/terms/guides/text/index.htm

STAR SCHEMA EXAMPLE - OCCURRENCE

Media

OccurrenceCore

Geographical

DeterminaCon

meta.xml

EML.xml

+

DwCArchiveOccurrence

Germplasm

STAR SCHEMA EXAMPLE - CHECKLIST

Literature

TaxonCore

DescripCon

Occurrences

meta.xml

EML.xml

+

DwCArchiveChecklist

Vernacular

DistribuCon

Types

STAR SCHEMA EXAMPLE - EVENT

EventCore

Occurrences

MeasurementorFact

meta.xml

EML.xml

+

DwCArchiveSamplesRelevé

DATA NORMALIZATION

What is data normalization? Reasons to normalize a database Normal forms

http://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/, http://databases.about.com/od/specificproducts/a/normalization.htm, http://www.dotnet-tricks.com/Tutorial/sqlserver/756N210512-Database-Normalization-Basics.html

"Datanormaliza,onistheprocessofreducingdatatoitscanonicalform.Forinstance,Databasenormaliza0onistheprocessoforganizingthefieldsandtablesofarela0onaldatabasetominimizeredundancyanddependency"(Wikipedia)."Denormaliza,onistheprocessofaGemp0ngtoop,mizethereadperformanceofadatabasebyaddingredundantdataorbygroupingdata"(Wikipedia).

Publish your biodiversity data with GBIF

PUBLISH DATA IN GBIF da

ta p

ublis

hing

Step 1: data holding research institutes seek endorsement as an approved data publisher.

Step 2: datasets are identified and converted to standard Darwin Core format.

Step 3: datasets can be published directly from the data node and/or with the assistance from a national GBIF node.

Citizen science data platforms also publish in GBIF.

Datapublishingguidelines

h[p://www.gbif.org/resources?f[0]=gr_purpose%3A955

WHAT IS DATA PUBLISHING?

“Publishing” refers to making biodiversity datasets publicly accessible and discoverable, in a standardized form, via an access point, typically a web address (a URL).

IPT

TheGBIFIntegrateddataPublishingToolkit(IPT)isafreeopensourcesorwaretoolwri[eninJavathatisusedtopublishandsharebiodiversitydatasetsthroughtheGBIFnetwork.

h[p://www.gbif.org/ipt

IPTUserManual:

h[ps://github.com/gbif/ipt/wiki/IPT2ManualNotes.wiki

RobertsonT,DöringM,GuralnickR,BloomD,WieczorekJ,BraakK,OteguiJ,RussellL,DesmetP(2014).TheGBIFintegratedpublishingtoolkit:FacilitaCngtheefficientpublishingofbiodiversitydataontheinternet.PLoSOne9(8).doi:10.1371/journal.pone.0102623

DATA PUBLISHING METHODS

DATA PUBLISHING LANDSCAPE

DiGIR(2001),BioCASE

(2001),TapirLink(2007)inusefor

publishingbiodiversitydata

Ideaforsimple,compressedtext-basedfileforpublishing

introducedatTDWG

GBIFintroducesIPT1.0

GBIFredevelopsIPTwithlessmemory

requirements

GBIFintroducesIPT2.0

IPTmorethan100installaConsandservingmore

than800datasets

Nodesandaggregators

(includingGBIFNorway)begintoinstallanduse

IPTs

Demo/testEventcore

developedbyGBIFandEU

BON

2007 2008 2009 2010 2011 2012 20142013 2015

Eventcoreisreleasedforuse(October2015).

DatasetDOIswithDataCite(March2015).IPTbecomesthe

dominantdata-publishingsoluConinGBIF.

Researchgrade:anobservaConmusthavemedia,coordinates,adate,andpassqualitymetrics,butnowthecommunityIDmustbefinerthanfamily.

NeedsID:AnyobservaConthatcouldbecome“Research”gradebutneedsmoreiden0fica0ons.

Casual:AnyobservaConthatcannotbecome“Research”grade.

Morethan17millionNorwegianoccurrencerecordspublishedinGBIF

DATA PUBLISHING LANDSCAPE: STATUS 2016

TheconCnuedGBIFcommitmenttoimprovingaccesstobiodiversitydata.

RefinementandexpansionofstandardsandpublishingsoQware.

Evolvingsocialnorms.

MostdatasCllpublishedwithsimpleoccurrencecore.

Portalsdonotcontainthefeaturestosupportricherdata.

ManyinsCtuConssCllneedconvincingtopublishbiodiversitydata.

http://www.gbif.org/page/82104

NodeteamatNHM,UniversityofOsloDagEndresen,NodemanagerChrisCanSvindseth,Databasemanager

FridtjofMehlum,ResearchdirectorEinarTimdal,AssociateprofessorGeirSøli,AssociateprofessorVidarBakken,Consultant

Artsdatabanken,Trondheim

WouterKochNilsValland

NTNUUniversityMuseumAndersFinstad,GBIFSciencecommiGee

ResearchCouncilofNorway

PerBacke-Hansen,Headofdelega0on

Contactusat:gbif-driQ@nhm.uio.no