2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.
-
Upload
dag-endresen -
Category
Science
-
view
67 -
download
0
Transcript of 2016 -12-14 GBIF data publishing. GBIF seminar in Bergen.
DatapublishingandtheDarwinCoredatastandard
GBIF provides a data publishing infrastructure
GBIFprovidesaservicefordatadiscovery
globalregistry dataportal
thatisdependentonresolvablestableiden0fiersforefficientfunc0onality
Research institute
Biodiversity ConservationBiodiversity
AnalysisGBIF portal
Global information systems
Scientific Research
MULTIPLE-PURPOSE DATA SERVICES
Darwin Core data exchange standard
WHAT IS BIODIVERSITY DATA?
Digital text or multimedia data record detailing facts about the instance of occurrence of an organism, i.e. on the what, where, when, how and by whom of the occurrence and the recording.
BIODIVERSITY DATA TYPES
http://www.gbif.org/publishing-data/summary#datatypes
Checklists(oftaxonnames)
Occurrences
Metadata(datasetdescripCon)
BIODIVERSITY DATA TYPES – SAMPLE DATA
http://www.gbif.org/newsroom/news/sample-based-data
Samples
IntroducConoftheEventcoreinMarch-October2015
DATA STANDARDS
Slide source: GB23 Nodes Madagascar October 2015 & iDigBio Florida January 2015 - http://www.tdwg.org/standards/
ABCD Access to Biological Collection Data (2005) DwC Darwin Core (2009) AC Audubon Core Multimedia Resources Metadata Schema (2013) NCD Natural Collection Descriptions (Draft 2008) EML Ecological Metadata Language (Ecological Society of America)
Darwin Core – a vocabulary of terms
WieczorekJ,BloomD,GuralnickR,BlumS,DöringM,DeGiovanniR,RobertsonT,andVieglaisD(2012)DarwinCore:AnEvolvingCommunity-DevelopedBiodiversityDataStandard.PLoSONE7(1):e29715.(doi:10.1371/journal.pone.0029715)
h[p://rs.tdwg.org/dwc/terms/
Record-levelTermsdcterms:type|dcterms:modified|dcterms:language|dcterms:rights|dcterms:rightsHolder|dcterms:accessRights|dcterms:bibliographicCitaCon|dcterms:references|ins2tu2onID|collec2onID|datasetID|ins2tu2onCode|collec2onCode|datasetName|ownerInsCtuConCode|basisOfRecord|informaConWithheld|dataGeneralizaCons|dynamicProperCesOccurrenceoccurrenceID|catalogNumber|recordNumber|recordedBy|individualCount|organismQuanCty|organismQuanCtyType|sex|lifeStage|reproducCveCondiCon|behavior|establishmentMeans|occurrenceStatus|preparaCons|disposiCon|associatedMedia|associatedReferences|associatedSequences|associatedTaxa|otherCatalogNumbers|occurrenceRemarksOrganismorganismID|organismName|organismScope|assocoatedOccurrences|associatedOrganisms|previousIdenCficaCons|organismRemarksMaterialSample|LivingSpecimen|PreservedSpecimen|FossilSpecimenmaterialSampleIDEvent|HumanObserva2on|MachineObserva2oneventID|parentEventID|fieldNumber|eventDate|eventTime|startDayOfYear|endDayOfYear|year|month|day|verbaCmEventDate|habitat|samplingProtocol|sampleSizeValue|sampleSizeUnit|samplingEffort|fieldNotes|eventRemarksLoca2onloca2onID|higherGeographyID|higherGeography|conCnent|waterBody|islandGroup|island|country|countryCode|stateProvince|county|municipality|locality|verbaCmLocality|verbaCmElevaCon|minimumElevaConInMeters|maximumElevaConInMeters|verbaCmDepth|minimumDepthInMeters|maximumDepthInMeters|minimumDistanceAboveSurfaceInMeters|maximumDistanceAboveSurfaceInMeters|locaConAccordingTo|locaConRemarks|verbaCmCoordinates|verbaCmLaCtude|verbaCmLongitude|verbaCmCoordinateSystem|verbaCmSRS|decimalLa2tude|decimalLongitude|geodeCcDatum|coordinateUncertaintyInMeters|coordinatePrecision|pointRadiusSpaCalFit|footprintWKT|footprintSRS|footprintSpaCalFit|georeferencedBy|georeferencedDate|georeferenceProtocol|georeferenceSources|georeferenceVerificaConStatus|georeferenceRemarksGeologicalContextgeologicalContextID|earliestEonOrLowestEonothem|latestEonOrHighestEonothem|earliestEraOrLowestErathem|latestEraOrHighestErathem|earliestPeriodOrLowestSystem|latestPeriodOrHighestSystem|earliestEpochOrLowestSeries|latestEpochOrHighestSeries|earliestAgeOrLowestStage|latestAgeOrHighestStage|lowestBiostraCgraphicZone|highestBiostraCgraphicZone|lithostraCgraphicTerms|group|formaCon|member|bedIden2fica2oniden2fica2onID|idenCfiedBy|typeStatus|idenCficaConQualifier|dateIdenCfied|idenCficaConReferences|idenCficaConVerificaConStatus|idenCficaConRemarksTaxontaxonID|scien2ficNameID|acceptedNameUsageID|parentNameUsageID|originalNameUsageID|nameAccordingToID|namePublishedInID|taxonConceptID|scien2ficName|acceptedNameUsage|parentNameUsage|originalNameUsage|nameAccordingTo|namePublishedIn|namePublishedInYear|higherClassificaCon|kingdom|phylum|class|order|family|genus|subgenus|specificEpithet|infraspecificEpithet|taxonRank|verbaCmTaxonRank|scienCficNameAuthorship|vernacularName|nomenclaturalCode|taxonomicStatus|nomenclaturalStatus|taxonRemarksResourceRela2onship(AuxiliaryTerms)resourceRela2onshipID|resourceID|relatedResourceID|relaConshipOfResource|relaConshipAccordingTo|relaConshipEstablishedDate|relaConshipRemarksMeasurementOrFact(AuxiliaryTerms)measurementID|measurementType|measurementValue|measurementAccuracy|measurementUnit|measurementDeterminedDate|measurementDeterminedBy|measurementMethod|measurementRemarks
DARWIN CORE ARCHIVE (DWC-A) v DwC-A publish DwC records including terms
from DwC-A extensions. v Simple text based format. v Zipped single file archive.
occurrence.txt
DARWIN CORE ARCHIVE
A Darwin Core Archive (DwC-A) is the text representation of data formatted to Darwin Core. A DwC-A is a compressed file containing a minimum of three files.
http://rs.tdwg.org/dwc/terms/guides/text/index.htm
STAR SCHEMA EXAMPLE - OCCURRENCE
Media
OccurrenceCore
Geographical
DeterminaCon
meta.xml
EML.xml
+
DwCArchiveOccurrence
Germplasm
STAR SCHEMA EXAMPLE - CHECKLIST
Literature
TaxonCore
DescripCon
Occurrences
meta.xml
EML.xml
+
DwCArchiveChecklist
Vernacular
DistribuCon
Types
STAR SCHEMA EXAMPLE - EVENT
EventCore
Occurrences
MeasurementorFact
meta.xml
EML.xml
+
DwCArchiveSamplesRelevé
DATA NORMALIZATION
What is data normalization? Reasons to normalize a database Normal forms
http://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/, http://databases.about.com/od/specificproducts/a/normalization.htm, http://www.dotnet-tricks.com/Tutorial/sqlserver/756N210512-Database-Normalization-Basics.html
"Datanormaliza,onistheprocessofreducingdatatoitscanonicalform.Forinstance,Databasenormaliza0onistheprocessoforganizingthefieldsandtablesofarela0onaldatabasetominimizeredundancyanddependency"(Wikipedia)."Denormaliza,onistheprocessofaGemp0ngtoop,mizethereadperformanceofadatabasebyaddingredundantdataorbygroupingdata"(Wikipedia).
Publish your biodiversity data with GBIF
PUBLISH DATA IN GBIF da
ta p
ublis
hing
Step 1: data holding research institutes seek endorsement as an approved data publisher.
Step 2: datasets are identified and converted to standard Darwin Core format.
Step 3: datasets can be published directly from the data node and/or with the assistance from a national GBIF node.
Citizen science data platforms also publish in GBIF.
Datapublishingguidelines
h[p://www.gbif.org/resources?f[0]=gr_purpose%3A955
WHAT IS DATA PUBLISHING?
“Publishing” refers to making biodiversity datasets publicly accessible and discoverable, in a standardized form, via an access point, typically a web address (a URL).
IPT
TheGBIFIntegrateddataPublishingToolkit(IPT)isafreeopensourcesorwaretoolwri[eninJavathatisusedtopublishandsharebiodiversitydatasetsthroughtheGBIFnetwork.
h[p://www.gbif.org/ipt
IPTUserManual:
h[ps://github.com/gbif/ipt/wiki/IPT2ManualNotes.wiki
RobertsonT,DöringM,GuralnickR,BloomD,WieczorekJ,BraakK,OteguiJ,RussellL,DesmetP(2014).TheGBIFintegratedpublishingtoolkit:FacilitaCngtheefficientpublishingofbiodiversitydataontheinternet.PLoSOne9(8).doi:10.1371/journal.pone.0102623
DATA PUBLISHING METHODS
DATA PUBLISHING LANDSCAPE
DiGIR(2001),BioCASE
(2001),TapirLink(2007)inusefor
publishingbiodiversitydata
Ideaforsimple,compressedtext-basedfileforpublishing
introducedatTDWG
GBIFintroducesIPT1.0
GBIFredevelopsIPTwithlessmemory
requirements
GBIFintroducesIPT2.0
IPTmorethan100installaConsandservingmore
than800datasets
Nodesandaggregators
(includingGBIFNorway)begintoinstallanduse
IPTs
Demo/testEventcore
developedbyGBIFandEU
BON
2007 2008 2009 2010 2011 2012 20142013 2015
Eventcoreisreleasedforuse(October2015).
DatasetDOIswithDataCite(March2015).IPTbecomesthe
dominantdata-publishingsoluConinGBIF.
Researchgrade:anobservaConmusthavemedia,coordinates,adate,andpassqualitymetrics,butnowthecommunityIDmustbefinerthanfamily.
NeedsID:AnyobservaConthatcouldbecome“Research”gradebutneedsmoreiden0fica0ons.
Casual:AnyobservaConthatcannotbecome“Research”grade.
Morethan17millionNorwegianoccurrencerecordspublishedinGBIF
DATA PUBLISHING LANDSCAPE: STATUS 2016
TheconCnuedGBIFcommitmenttoimprovingaccesstobiodiversitydata.
RefinementandexpansionofstandardsandpublishingsoQware.
Evolvingsocialnorms.
MostdatasCllpublishedwithsimpleoccurrencecore.
Portalsdonotcontainthefeaturestosupportricherdata.
ManyinsCtuConssCllneedconvincingtopublishbiodiversitydata.
http://www.gbif.org/page/82104
NodeteamatNHM,UniversityofOsloDagEndresen,NodemanagerChrisCanSvindseth,Databasemanager
FridtjofMehlum,ResearchdirectorEinarTimdal,AssociateprofessorGeirSøli,AssociateprofessorVidarBakken,Consultant
Artsdatabanken,Trondheim
WouterKochNilsValland
NTNUUniversityMuseumAndersFinstad,GBIFSciencecommiGee
ResearchCouncilofNorway
PerBacke-Hansen,Headofdelega0on
Contactusat:[email protected]