IETF BOF Data Set Identifier Interoperability
description
Transcript of IETF BOF Data Set Identifier Interoperability
IETF BOF DSII, July 2012
IETF BOFData Set Identifier Interoperability
Beth PlaleDirector, Data To Insight Center
Indiana University
IETF BOF DSII, July 2012
The DSII BOFDiscussion of persistent identifier solutions (part I) and steps to
achieving interoperability among persistent identifiers (part II) for data sets made available on the Internet.
The initial use case: scientific data sets produced by different research teams;
Other use cases: media developed by different sources and combined into a common collection.
This BoF is not intended to form a working group at this session.
IETF BOF DSII, July 2012
Science Data Deluge• A lot of data being generated is in sciences – through ocean
instruments, air quality sensors, through gene sequencing machines, through climate models …
• Research funding agencies want to see research data from funded efforts be available for reuse: today, and decades into future:
– “The National Science Foundation is committed to the principle that the various forms of data collected with public funds belong in the public domain.”
Data Archiving Policy, NSF Social Behavioral and Economic Sciences
IETF BOF DSII, July 2012
Problem acute in Long TailPower law graph showing popularity ranking. To right (in yellow) is long tail;To left are few that dominate. Note that areas of both regions are equal.
IETF BOF DSII, July 2012
Long tail and on-line business• Chris Anderson (Wired 2004) popularized term “long tail”. Has two
complementary ideas:– First that merchandise assortments can grow because goods are not
limited by shelf space, and– Second, that online venues change the demand curve because
consumers value niche products. • These complementary forces result in tail that steadily grows both
longer as more obscure products are made available, but also fatter as consumers discover products better suited to their tastes.
IETF BOF DSII, July 2012
Long tail and data• Emerging trend in science of inexpensive instrument producing
huge volumes of data. – E.g., Genetic sequencing machine, inexpensive enough for purchase
by a research lab, yet produces Terabytes of data with every run. • Long tail of science and scholarly activity goes beyond simply
project size to encompass set of sub-disciplines who carry out “small or localized science”
• These are researchers whose collective numbers actually account for an enormous amount of data-driven science.
IETF BOF DSII, July 2012
Key role of Metadata in Science Data• Metadata must be preserved when scientific data is
generated because metadata is ephemeral – Jim Gray• “The management, organization, access, and preservation
of digital data is arguably a ‘grand challenge’ of the information age” - Fran Berman (2008)
• If annotation is left to the scientist, it is not done (U.K. e-Science Core)
• The further the distance between data producer and re-use, the more detailed the metadata that’s required.
IETF BOF DSII, July 2012
Generalizing to Needs for Tracking “the Object”
• Defn “Objects”: an information resource that could be• Data set• Digital documents • Software• Websites• Physical objects: books, bones, statues, etc.• Intangible objects: chemicals, diseases, vocabulary terms,
performances
Area of largest concern
IETF BOF DSII, July 2012
Metadata Associated with IdentifierIncludes: Checksums, pointer to metadata, rights information, also:
C: [opens session] C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1 C: S: HTTP/1.1 200 OK S: <snip>S: erc: S: who: Lederberg, Joshua S: what: Studies of Human Families for Genetic Linkage S: when: 1974 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf S: [closes session]
IETF BOF DSII, July 2012
Operations performed upon identifiers
• discovery, • data access, • access control, and• logical arrangement. We find cases for all of these operations, implying the
need for multiple identifiers
IETF BOF DSII, July 2012
Governance and Cost• Where are resolvers/assigners run?• Is distribution model for resolvers scalable to the
levels needed by data object discovery and use?• What organization(s) have long term oversight over
continued existence of resolving/assigning/interoperability services?
IETF BOF DSII, July 2012
Part II: Data set identifier Interoperability
• Metadata interoperability• Relationship interoperability• Service interoperability
IETF BOF DSII, July 2012
Metadata Interoperability• One solution: universal implementation of common metadata
scheme for all identifier schemes• Otherwise: mechanisms through which possible to
– Use descriptive metadata associated with one identifier in context of another identifier;
– Aggregate descriptive metadata associated with several different identifiers in single context.
• And do so without loss of semantic value (meaning).
IETF BOF DSII, July 2012
Relationship Interoperability• Standard mechanisms for expressing relationships between
the objects identified under different identifiers schemes – "The publisher identified with this [standard party
identifier] is the publisher of this journal identified with this ISSN."
• This implies development of standard set of typed relationships between identifiers with well-defined semantics.
IETF BOF DSII, July 2012
Service Interoperability• The creation of common services:
– "...the use of shared syntax or physical interface for request/response for provision of services and/or data.”
• Types of services might include: • Metadata look up services: user resolves identifier to set of
metadata about object• Identifier discovery services: user with limited set of metadata
can discover identifier or identifiers for that object.
IETF BOF DSII, July 2012
References• EPIC: European based. Works with Handle
System, http://www.pidconsortium.eu • EZID: long term identifiers made easy, works
with both DataCite DOI and ARK http://n2t.net/ezid
• The ARK Identifier Scheme, Internet-Draft, 2012-04 http://www.ietf.org/internet-drafts/draft-kunze-ark-16.txt
• The Handle System, http://www.handle.net/– Handle System Overview, Nov03 RFC
3650 – Handle System Namespace and Service
Definition, Nov 03 RFC 3651 – Handle System Protocol (v2.1) Nov 03 RFC
3652
• Terminology and Use Cases for Interoperability of Identifier Resolution Systems, Internet Draft, 2012-07https://datatracker.ietf.org/doc/draft-kahn-dsii-id-res-sys/
• On the utility of identification schemes for digital earth science data: an assessment and recommendationshttp://rd.springer.com/article/10.1007/s12145-011-0083-6/fulltext.html
• Identifier Interoperability: A Report on Two Recent ISO Activities, http://www.dlib.org/dlib/april06/paskin/04paskin.html