Ndsa 2013-abrams-integrating-repositories-for-data-sharing

33
NDSA Digital Preservation 2013 Integrating Repositories for Research Data Sharing Stephen Abrams California Digital Library Angela Rizk-Jackson Julia Kochi University of California, San Francisco Noah Wittman University of California, Berkeley

description

The thorough integration of information technology and resources into scientific workflows has nurtured a new paradigm of data-intensive science. However, far too much research activity still takes place in silos, to the detriment of open scientific inquiry and advancement. Data-intensive science would be facilitated by more universal adoption of good data management practices ensuring the ongoing viability and usability of all legitimate research outputs, including data, and the encouragement of data publication and sharing for reuse. The centerpiece of such data sharing is the digital repository, acting as the foundation for external value-added services supporting and promoting effective data acquisition, publication, discovery, and dissemination. Since a general-purpose curation repository will not be able to offer the same level of specialized user experience provided by disciplinary tools and portals, a layered model built on a stable repository core is an appropriate division of labor, taking best advantage of the relative strengths of the concerned systems. The Merritt repository, operated by the University of California Curation Center (UC3) at the California Digital Library (CDL), functions as a curation core for several data sharing initiatives, including the eScholarship open access publishing platform, the DataONE network, and the Open Context archaeological portal. This presentation with highlight two recent examples of external integration for purposes of research data sharing: DataShare, an open portal for biomedical data at UC, San Francisco; and Research Hub, an Alfresco-based content management system at UC, Berkeley. They both significantly extend Merritt’s coverage of the full research data lifecycle and workflows, both upstream, with augmented capabilities for data description, packaging, and deposit; and downstream, with enhanced domain-specific discovery. These efforts showcase the catalyzing effect that coupled integration of curation repositories and well-known public disciplinary search environments can have on research data sharing and scientific advancement.

Transcript of Ndsa 2013-abrams-integrating-repositories-for-data-sharing

Page 1: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Integrating Repositories for Research Data Sharing

Stephen AbramsCalifornia Digital Library

Angela Rizk-JacksonJulia Kochi

University of California, San Francisco

Noah WittmanUniversity of California, Berkeley

Page 2: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Why is data curation important?

Accelerating scientific progress Enabling appropriate scrutiny and verification of results Promoting integrity and debate Facilitating new collaborations Avoiding needless duplication of effort Increasingly, complying with institutional policies, publication

requirements, and funder mandates

Cf. White and Teds (2011), “Making the case for research data management” DCC briefing paper, www.dcc.ac.uk/resources/briefing-papers/making-case-rdm

Page 3: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Merritt

Curation repository available to the UC community and external partners Preservation and access Content agnostic, model free Highly decentralized micro-services architecture

Cf. Abrams, Cruse, Kunze, and Minor (2011), “Curation micro-services: A pipeline metaphor for repositories,” Journal of Digital Information 12(2), journals.tdl.org/jodi/article/view/1605

26 curatorial units 271 collections 325,000 objects 450,000 versions 4,500,000 files 13 TB

www.cdlib.org/uc3/merrittmerritt.cdlib.org

Page 4: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Merritt

Storage nodeStorage broker

Inventory

ONEShare UNM storage node

Storage node

UI/API

UI/API

UI/API

LDAP

LDAP

LDAP

RDBMS

Fixity

User agent

Message queue

RDBMS

Load balancer

Ingest

Load balancer

Ingest

Ingest

EZID

No-SQL

DataCite

DataONE member node

RDBMS

RDBMS

DataONEcoord’ing node

IDF

Load balancer

Web of Knowledge

Primo

SAN

SDSC cloud

Page 5: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

(Some) issues to address

Scale Individual objects ranging from 0 to 47,000 files Individual files ranging from 0 to 14 GB

Maintaining control Concern over potential loss of control over dissemination and

use of data

User experience Switch from organizational to individual interaction

www.flickr.com/photos/vixon/116447718www.flickr.com/photos/traftery/4319529821www.flickr.com/photos/32195273@N05/51076852642

Page 6: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

(Some) issues to address

Scale Individual objects ranging from 0 to 47,000 files Individual files ranging from 0 to 14 GB

Maintaining control Concern over potential loss of control over dissemination and

use of data

User experience Switch from organizational to individual interaction

Augment repository function by composition (when possible) and addition (when necessary) Loosely-coupled integration with external community supported

systems and services

Page 7: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Scale

Avoiding client timeout ≤ 2 GB: File-based stream-based AIP-to-DIP processing > 2 GB: Asynchronous delivery

Email notification with personalized, time-limited URL

Streamlined storage provisioning SDSC cloud

cloud.sdsc.edu

www.kevatron.co.uk/converting-8-24-bit-samples-in-coreaudio-on-ios www.flickr.com/photos/paulbhartzog/680749585

Page 8: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Control

Data use agreements (DUAs) Explicit assertion of license requirements and terms of use Curatorial and consumer notification of acceptance

Cf. Brazhnik and Jones (2007), “Anatomy of data integration,” Journal of Biomedical Informatics 40(3): 252-69, doi:10.1016/j.jbi.2006.09.001

From: [email protected]: Merritt DUA acceptance

Name: Stephen AbramsAffiliation: California Digital LibraryCollection: UCSF DataShareObject: Frontotemporal Lobar Degeneration (FTLD)Date: 2013-05-31 09:50:34 PDTTerms of use: As part of this agreement, Consumer submits to the following statements: (1) I will receive access to de-identified data and will not attempt to establish the

identity of any of the study subjects.(2) I will share these data only with my immediate co-workers, and I will not transfer

these data to other research groups. I understand that these data are available to other research groups through the process by which I obtain them.

(3) I will require anyone in my group who utilizes these data, or anyone with whom I share these data to comply with this data use agreement

...

Page 9: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

User experience

Due to its open eligibility policy, Merritt will always provide a more generic UX than special-purpose or disciplinary systems

Shifting user roles, shifting expectations Institutional individual researcher Behavioral expectations set by the commercial/mobile web

Page 10: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

User experience

Due to its open eligibility policy, Merritt will always provide a more generic UX than special-purpose or disciplinary systems

Shifting user roles, shifting expectations Institutional individual researcher Behavioral expectations set by the commercial web

Integration with extant services that better provide the desired UX DataShare

Research Hub

Page 11: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

DataShare

“The goal of the DataShare project is to catalyze widespread sharing of scientific research data”datashare.ucsf.edu

UCSF Clinical and Translational Science Institutectsi.ucsf.edu

UCSF Librarywww.library.ucsf.edu

UCSF Center for Imaging of Neurodegenerative Diseasewww.radiology.ucsf.edu/cind

Architecture DataShare submission client (Ruby/Rails)

Merritt curation repository DataShare discovery portal (XTF/Java)

Page 12: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

DataShare

Prepare Describe Upload Curate Discover Share

Page 13: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

DataShare

Prepare Best practice advice

Describe Upload Curate Discover Share

Page 14: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

DataShare

Prepare Describe

Schema-directedmetadata editor

DataCite schemaschema.datacite.org

Upload Curate Discover Share

Page 15: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

DataShare

Prepare Describe Upload

File browse ordrag-n-drop

Curate Discover Share

Page 16: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

DataShare

Prepare Describe Upload Curate

Manage datasets

Discover Share

Page 17: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

DataShare

Prepare Describe Upload Curate Discover

Faceted search andbrowse

Share

Page 18: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

DataShare

Prepare Describe Upload Curate Discover Share

DataONE DataCite (soon) Primo

Web of Knowledge SEO

Page 19: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Merritt + DataShare

Storage nodeStorage broker

Inventory

ONEShare UNM storage node

Storage node

UI/API

UI/API

UI/API

LDAP

LDAP

LDAP

RDBMS

Fixity

User agent

Message queue

RDBMS

Load balancer

Ingest

Load balancer

Ingest

Ingest

EZID

No-SQL

DataCite

DataONE member node

RDBMS

RDBMS

DataONEcoord’ing node

IDF

Load balancer

Web of Knowledge

Primo

SAN

SDSC cloud

DataShare upload

Collection Atom feed

XTF xtf.cdlib.org

DataShare portal

Lucene

Page 20: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

“Research Hub provides powerful tools for content management and collaboration”hub.berkeley.edu

Alfresco CMSwww.alfresco.com

770 projects, 3,900 users Personal file management Project collaboration Departmental resource pooling Research data management

Desktop sync, mobile app, Adobe Creative Suite

UC Berkeley Information Services and Technologyist.berkeley.edu

Page 21: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

Prepare Acquire and

arrange

Describe Upload Curate Discover Share

Page 22: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

Prepare Describe

Schema-directedmetadata editors

Upload Curate Discover Share

Page 23: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

Prepare Describe Upload

Direct action

Curate Discover Share

Page 24: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

Prepare Describe Upload

Policy-based workflow rules

Curate Discover Share

Page 25: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

Prepare Describe Upload

Drag-and-drop

Curate Discover Share

Page 26: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

Prepare Describe Upload Curate

Manage datasets

Discover Share

Page 27: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

Prepare Describe Upload Curate Discover

Search / browse

Share

Page 28: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Research Hub

Prepare Describe Upload Curate Discover Share

Curatorialinvitation

Page 29: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Merritt + DataShare + Research Hub

Storage nodeStorage broker

Inventory

ONEShare UNM storage node

Storage node

UI/API

UI/API

UI/API

LDAP

LDAP

LDAP

RDBMS

Fixity

User agent

Message queue

RDBMS

Load balancer

Ingest

Load balancer

Ingest

Ingest

EZID

No-SQL

DataCite

DataONE member node

RDBMS

RDBMS

DataONEcoord’ing node

IDF

Load balancer

Web of Knowledge

Primo

SAN

SDSC cloud

DataShare upload

Collection Atom feed

XTF xtf.cdlib.org

DataShare portal

Lucene

Research Hub

Page 30: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Future integrations

UCTrust/InCommon federationIncommon.org

Open Context archaeological portalopencontext.org

Nuxeowww.nuxeo.com

UC system-wide DAMS

Islandoraislandora.ca

Fedora Merritt

DPNwww.dpn.org

Page 31: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Sharing research through repositories

Conform to institutional policy, publication requirements, and funder mandates

Pro-active curation of valuable research outputs Stable citation and access High visibility publication and discovery Use metrics

Page 32: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

Sharing research through repositories

Conform to institutional policy, publication requirements, and funder mandates

Pro-active curation of valuable research outputs Stable citation and access High visibility publication and discovery Use metrics Repository layering as an appropriate division of labor

Exploiting existing capabilities already in local use

Page 33: Ndsa 2013-abrams-integrating-repositories-for-data-sharing

NDSA Digital Preservation 2013

For more information Merritt

www.cdlib.org/uc3/[email protected] Abrams David LoyPatricia Cruse Mark ReyesShirin Faenza Joan StarrScott Fisher Carly StrasserErik Hetzner Marisa StrongJoshua Hubbard Bhavitavya VedulaGreg Janée Kenneth WeissJohn Kunze Perry WilletRosalie Lack

DataSharedatashare.ucsf.eduGeoffrey Boushey Julia KochiAnirvan Chatterjee Angela Rizk-JacksonManinder Kahlon Michael Weiner

Research Hubhub.berkeley.eduIan Crew Michael McCarthy (Tribloom)Noah WittmanPatrick McGrath

www.slideshare.net/UC3/ndsa-2013abramsintegratingrepositoriesfordatasharing