doi>
RAL, 16 Feb 2005
Norman Paskin, International DOI Foundation
Digital Object Identifiers for Science Data
doi>Outline
• What is a DOI• Persistent identification and resolution • Functional components • Example applications in science data (2) • Open Q&A
• Slides are available electronically – some slides are hidden (run in presentation view)– sequence of builds 29 - 49
• References are in penultimate slide – See especially CODATA 2004 article
• Print handout of Internet Registries article
• Possible other topics for discussion: semantic interoperability (data integration)
doi>Digital Object Identifier = DOI
• A name (not a location) for an entity on digital networks AND/OR
• A system for persistent and actionable identification and interoperable exchange of managed information on digital networks – Standards-based components
• Developed as cross-industry, cross-sector, not-for-profit effort managed by an open membership collaborative development body– International DOI Foundation (IDF)
• In widespread use now:– Over 15 million assigned, over 1000 naming authorities (users)– Key feature of scientific primary publishing as part of CrossRef system– Adopted for government documents (EC, OECD, UK, etc)
• In use, is a mechanism “behind the scenes”, – e.g. looks like a URL in a web context
• One application is interoperable common system for identification of science data: two projects considered as examples:– TIB project (citation of primary data sets)– Names for Life (biological taxonomy)
Identification and resolution
• Identifier:• A unique label for an entity involved in a transaction • Note the ambiguity of “identifier”:
– Label alone (e.g. ISBN)– Specification alone (e.g. URN)– Implemented specification (e.g. DOI, Bar code)
• Ideally, persistent • Ideally, actionable …
• Resolution:• The process in which an identifier is the input (a request) to a network service to
receive in return a specific output
• Both concepts are in principle neutral as to technology implementation• Abstract concepts, but implementations typically at least “internet” TCP/IP (the
more general the better, e.g. not just “Web”)
Persistence
• "It is intended that the lifetime of a [persistent identifier] be permanent. That is, the [persistent identifier] will be globally unique forever, and may well be used as a reference to a resource well beyond the lifetime of the resource it identifies or of any naming authority involved in the assignment of its name.“
• [Persistent Identifier] = URN in IETF RFC 1737: Functional Requirements for Uniform Resource Names. (http://www.ietf.org/rfc/rfc1737.txt)
Technical and social infrastucture issues
Interoperability
• Persistence is one dimension of interoperability: – “persistence is interoperability with the future”
• We know what we mean, but others may not.– Identifiers assigned in one context may be encountered, and may be re-used,
in another place (or time) - without consulting the assigner. You can’t assume that your assumptions will be known to someone else. Interoperability = the possibility of use in services outside the direct control of the issuing assigner
• Identifiers may be opaque or may be meaningful – but meaning only makes sense in context– Normally, opaque string is the safest assumption – User communities define rules (social infrastructure; namespaces)– Interoperability guarantees others can interpret even if they do not know rules
• NB: The recent chemistry identifier proposal INChI is an interesting meaningful identifier
1. Obvious: Assign ID to resource Once assigned the number must identify the same resource – Beyond the lifetime of the resource, or the assigner
Two principles for persistent identification
resource ID
2. Less obvious: Assign Resource to ID The resource must be “identified”
– Must ensure it is always the same thing (bound)– Describe the resource “content” [with precision] – Failure to do this will ultimately break interoperability
How far do we go in each? Depends on what we think is “good enough”– Technologists have focussed on (1) [and “bags of bits/data structures”].– The content/rights world (2) [and focus on “intellectual content”] – Both viewpoints valid– (2) is now becoming more relevant
• Resolution: The process in which an identifier is the input (a request) to a network service to receive in return a specific output
• “Point and click” is what I do (URL model), so: • “what I point to (resolve to and get) is what is identified”, right?• No
– Point and click “get” is not referencing – Can identify but not “get” directly things that are intangible (works), or
fugitive (performances) or that change: (“Todays NY Times”) or people and concepts….
– Pointing and clicking can return different things in different contexts, or give multiple options
• Identifier identifies an entity. Pointing and clicking is a service about that entity • Entities can be physical, abstract, tangible, intangible, things, people, concepts,
instances, … • Resolution provides a mechanism to describe the resource “content” through a
service which delivers a description
Resolution and “What are we identifying?”
Document on screen
Abstract work?Manifestation of abstract work?Version?This HTML file? All/some of these?
What are we identifying?
Identification and resolution
• Resolvable identifiers must specify:– Agreed numbering syntax– Resolution mechanism– Data model to define “what it is we are identifying”– Technical and social infrastructure to implement
• (compare physical world bar codes)
• These could be assembled ad hoc, or offered as a packaged system (e.g.DOI)
Data ModelInternet
Resolution
Numbering scheme
Policies
DOI is the combination of these four components
doi>
DOI syntax can include any
existing identifier “label”formal or informal,
of any entity
• An identifier “container” e.g.– 10.1234/NP5678– 10.5678/ISBN-0-7645-4889-4– 10.2224/2004-10-ISO-DOI
• NISO Z39.84, DOI Syntax
Data ModelResolutionby Handle
Numbering scheme
Policies
doi>
Internet resolution allows a DOI to link to
any & multiple piecesof current data
• Resolve from DOI to data– initially to location (URL) – persistence
• May be to multiple data:– Multiple locations– Metadata– Services– Extensible user-defined
• Uses the Handle system- Implementing URI/URN concept- Running on TCP/IP (common co-inventor)- IETF RFCs 3650, 3651, 3652- To be in GRID Globus tool kit - Full Unicode compliance
Data ModelResolutionby Handle
Numbering scheme
Policies
doi>
<indecs> Data Dictionary
+DOI AP framework
• DOI Data Model = Metadata tools: – a data dictionary to define +– a grouping mechanism to relate
• Necessary for interoperability – “Enabling information that originates in one
context to be used in another in ways that are as highly automated as possible”.
• Able to use existing metadata – Mapped using a standard dictionary– Can describe any entity at any level of granularity– indecsDD which incorporates ISO MPEG 21 RDD
• IDF is the MPEG21 RDD registration authority
Data ModelResolutionby Handle
Numbering scheme
Policies
doi>
DOI policies allow any model
for practical implementations
• Implementation through IDF– Governance and agreed scope, policy, “rules of the road” – Technical infrastructure: resolution mechanism, proxy servers,
mirrors, back-up, central dictionary, – Social infrastructure: persistence commitments, fall-back
procedures, cost-recovery (self-sustaining), shared use of system– Not a standard but a Registration Authority/maintenance agency
• IDF delegates through Registration Agencies – Each can develop own applications– Use in “own brand” ways appropriate for their community (eg:
CrossRef)
Data ModelResolutionby Handle
Numbering scheme
Policies
doi>
doi>
Resolve
The Handle resolution technology allows you to access any kind of Service associated with your DOI.eg
Services can include metadata services
Identify
DOI syntax can include any existing identifier, formal or informal, of any entityeg
10.2341/0-7645-4889-110.5678/978-0-7645-4889-410.1000/ISBN 076454889110.1234/Norman_presentation10.2224/2004-10-28-ISO-DOI
Describe
DOI metadata can be of any type, standard or proprietaryeg OnixForBooksOnixForSerialsIEEE/LOMMARCDublin CoreProprietary scheme
DOI combination of components
A package of services is an Application Profile
(to interoperate with anyone else in the DOI network, map to the <indecs> Data Dictionary (iDD)
doi>DOI and scientific data
• DOI is already the core technology for maintaining cross-reference – persistent links between a citation and internet access to article
• CrossRef system used by 350+ publishers representing bulk of STM articles (as pre-publication link builder) both for profit and not for profit, OA, www.crossref.org
• 9,000 DOIs per day added to CrossRef. – Over 12 million DOIs now registered with CrossRef, – Over 850,000 assigned to books and conference proceedings.
• Several projects suggested to IDF using DOIs for data (not connected with CrossRef)– physico-chemical property data; biological microscopy images. – See Paskin, ICSTI 2002 paper
• Some sectors have developed their own identifiers, – e.g. Life Science Identifier (I3C/IBM): simple URN mechanism, non-generic, non-global –
but very useful in bio-informatics– These can be incorporated into a DOI if needed to make globally interoperable and
extensible
• Two projects in particular have developed DOI applications:
doi>(1) TIB: Citation of Primary Data
• Problem: re-use of existing data sets– Attribution of data source: make data publications citable in a standard
way (cf. articles Citation Index) – Archiving of data in context so as to be discoverable and interoperable
(usable by others)
• Background – CODATA National Committee WG, grant-aided by DFG (Sept 2001 to May
2002): Report "Concept of Citing Scientific Primary Data“– Continuation as project for pilot implementation funded by DFG Oct 2003 to
Oct 2005 at TIB (German National Library of Science & Technology)– Development of DOI registration agency for Data
• Solution:- DOIs for data sets, with associated metadata - Core management metadata applicable to all datasets - Structured metadata extensible to specific science disciplines
- Follows principles of DOI Data Model
doi>(1) Citation of Primary Data: illustration of solution
• During her research for the World Data Center Climate (WDCC) Dr. Weather gains primary data about the weather in Hannover in the year 2003.– Primary data is tested, evaluated, stored and administrated at the WDCC.– Primary data is registered and allocated DOI at the TIB– With quality control of metadata, no change once allocated, etc
• Dr Weather can now cite this with a resolvable DOI e.g DOI:10.1594 /WDCC/W_Han_2003_MMB_210.1594 (Prefix) = TIB as the registration agency.
WDCC = research institute. W_Han_2003_MMB_2 = internal name of the Data
• DOI is resolvable directly, or via http as http://dx.doi.org/10.1594/WDCC/W_Han_2003_MMB_2
doi>(1) Citation of Primary Data: illustration of solution
Usage scenario 1: attribution of source• Dr. Storm is reading publications from Dr. Weather in a journal and would like to analyse her
data under different aspects.• Can resolve the DOI to obtain the data set for use• In his publication ”Comparison of the weather from Hannover and Miami” Dr. Storm cites Dr.
Weather’s data using its DOI, referring to the uniqueness and own identity of the original data.• Citation example:
Weather, 2003: “Weather in Hannover for 2003”doi: 10.1594/WDCC/W_Han_2003_MMB_2
Usage scenario 2: archiving for re-use• Mr. Nice is writing a paper about the sales figures of ice cream in Hannover in 2003, but he has
no information about the weather.• Searches via TIB central registration agency metadata search• Result is doi:10.1594/WDCC/W_Han_2003_MMB_2• He resolves the DOI to find the data.• The metadata refers him to the WDCC as publisher and data archive.• In his paper he cites the data using the DOI.
doi>(2) Names for life: Biological taxonomy
• Problem: “Future-proofing biological nomenclature”– See Garrity and Lyons, OMICS, 2003
• For a given nomenclature in a biological taxonomy, change occurs– e.g. new species recognised, species reassigned as the founding
species of new genera; synonyms; species split into subspecies which later became separate species;
– resulting in changes of names, genera, families, classes, relationships over time
– How does researcher keep track?
• Solution: DOI proposed as tool– a data model of nomenclature and taxonomy– enabling disambiguation of synonyms and competing taxonomies– a metadata resolution service– enabling dissemination of archived and updated information objects
through persistent links
communisvagahaloplanktisrubracitreaesperjianaundina
Alteromonas
1972 1973 1976 1977 1978
macleodii(T)
nomenclature
communisvagahaloplanktisrubracitreaesperjianaundinaaurantia
Alteromonas
1972 1973 1976 1977 1978 1979
macleodii(T)
nomenclature
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedai
Alteromonas
1972 1973 1976 1977 1978 1979 1981
macleodii(T)
nomenclature
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceae
Alteromonas
1972 1973 1976 1977 1978 1979 1981 1982
macleodii(T)
nomenclature
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceae
vagacommunis(T)
Marinomonas Alteromonas
commune
vagum
1972 1973 1976 1977 1978 1979 1981 1982 1984
multiglobiferum
japonicumminutiumbiejerinckiimaris
maris
hiroshimense
pelagicumpusillum
jannaschiikreigii
Oceanosprillum
mariswilliamsae
linum(T) macleodii(T)
nomenclature
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedai
vaga benthicahanedai
Marinomonas Alteromonasputrifaciens(T)
Shewanella
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum
Oceanosprillum
mariswilliamsae
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
luteoviolaceae
communis(T)linum(T) macleodii(T)
nomenclature
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987
communisvagahaloplanktisrubracitreaesperjianaundinaaurantia
hanedailuteoviolaceaedenitrificans
vaga benthicahanedai
Marinomonas Alteromonas Shewanella
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum
Oceanosprillum
mariswilliamsae
putrifaciens
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
vaga benthicahanedai
Marinomonas Alteromonas Shewanella
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum
Oceanosprillum
mariswilliamsae
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988
colwelliana
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
vaga benthicahanedai
Marinomonas Shewanella
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckii
pelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonis
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990
colwelliana
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckii
pelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktis
putrifacienshanedai
denitrificans
rubracitreaesperjianaundinaaurantia
luteoviolaceae
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckii
pelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafulginea
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra
haloplanktis haloplanktis(T)
Pseudoalteromonas
undina
haloplanktis tetradonis
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckii
pelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafulginea
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra
Pseudoalteromonas
undinaantartica
elyakoviii
haloplanktistetradonis
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckii
pelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafulginea
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra
Pseudoalteromonas
undinaantartica
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolacea
bacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolytica
haloplanktistetradonis
mediterannea
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckii
pelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafulginea
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra
Pseudoalteromonas
undinaantartica
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolacea
bacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis
japonica
haloplanktistetradonis
mediterannea
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckii
pelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafulginea
Pseudoalteromonas
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis
haloplanktistetradonis
mediterannea
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
vaga benthicahanedaicolwellianaalgae
Marinomonas Shewanella
communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans
tetradonisatlanticacarageenovora
Alteromonas
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004
japonicumminutiumbiejerinckiimaris
maris
hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckii
pelagicummaris
hiroshimense
Oceanosprillum
mariswilliamsae
distinctafulginea
Pseudoalteromonas
elyakoviii
fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna
atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis
haloplanktistetradonis
11 others
mariniintestinasaireschlegelianagaetbuli
mediteranneaprimoryensis
haloplanktishaloplanktis(T)
putrifaciens(T)communis(T)linum(T) macleodii(T)
nomenclature
name
taxon
combinedname
exemplar
nomos
journalarticle
geneannotation
anyonline
information
strainrecord
links from the web
journalarticle
strainrecord
geneannotation
journalarticle
journalarticle
links to the web
DOI
DOIDOI
DOI
DOI
(2) Names for Life: illustration of solution
dissemination
name
taxon
combinedname
exemplar
nomos
By reasoning over information objects, construct services that can be offered through multiple resolution.
Look up this name Look up this name and all its synonyms and all its synonyms in PubMedin PubMed
Determine whether thisDetermine whether thisexemplar is part of a taxon exemplar is part of a taxon in another nomosin another nomos
Compare this name to Compare this name to the current state the current state (contents) (contents) of the taxonof the taxon
(2) Names for Life: illustration of solution
doi>
• Paskin, Norman. "Digital Object Identifiers for scientific data". Paper presented at 19th International CODATA Conference, Berlin, 10 November 2004. http://www.doi.org/topics/041110CODATAarticleDOI.pdf
• Project announced to develop DOIs for scientific data: http://www.doi.org/news/TIBNews.html
• Garrity, G. M.; Lyons, C. "Future-proofing biological nomenclature". Omics, 2003, Volume 7, Number 1, pgs. 31-33. pre version at http://www.eecs.umich.edu/~jag/wdmbio/garrity.htm.
• Harris, Jerald D. ""Published Works" in the electronic age: recommended amendments to Articles 8 and 9 of the Code". Bulletin of Zoological Nomenclature, 61(3), September 2004, pp. 138-148.
• "Online Registries: The DNS and Beyond...", Esther Dyson, Release 1.0 September 2003. [ print only: summary at doi:10.1340/309registries ]
• DOI progress report : D Lib magazine (online) [http://www.dlib.org/dlib/june03/paskin/06paskin.html]
• “Identification and Metadata: Components of DRM Systems" Norman Paskin; in E. Becker et al (eds) "Digital Rights Management” in the series Lecture Notes in Computer Science (Springer-Verlag, 2003) pp. 26-61 [http://www.doi.org/topics/drm_paskin_20030113_b1.pdf]
• DOI factsheets etc. http://www.doi.org/factsheets.html
Further reading
Top Related