NEUROIMAGING RESEARCH DATA IFE CYCLE MANAGEMENT · The data repository • a iRODS-based ICT system...
Transcript of NEUROIMAGING RESEARCH DATA IFE CYCLE MANAGEMENT · The data repository • a iRODS-based ICT system...
NEUROIMAGING RESEARCH DATA LIFE-CYCLE MANAGEMENT
Hurng-Chun (Hong) Lee, Robert Oostenveld, Erik van den Boogert, Eric Maris
Outlines• Lifecycle RDM: objectives and challenges
• The method
• the RDM protocol
• Donders Research Data Repository (DRDR) - usage of iRODS
• Strength and weakness of DRDR
• Future focuses
2
Lifecycle RDM
• RDM spans the entire research lifecycle
• Objectives:
• long-term data preservation
• scientific-process documentation
• data publication
conception of researchdata acquisition
data analysispublication
3
Challenges• large institute with heterogeneous scientific-administrative workflows
• 600 researchers in PI groups ➞ more than 150 projects per year
• 3 centres with 4 administrative domains
• data complexity:
• text, audio/video, imaging or signal data, etc.
• sensitive data
• size ranges from a few large (>2GB) files to a huge amount of small files (<1MB)
• user expectation
4
The RDM protocoldata acquisition
data analysis
data publication
conception of research
5 * Box icons are made by Roundicons and Pixel Buddha from www.flaticon.com are licensed by CC 3.0 BY.
The RDM protocoldata acquisition
data analysis
data publication
conception of research
research documentation collection
data sharing collection
collection: a container of ✓ data (files/folders) ✓ metadata has a “state” attribute
data acquisition collection
5 * Box icons are made by Roundicons and Pixel Buddha from www.flaticon.com are licensed by CC 3.0 BY.
The RDM protocoldata acquisition
data analysis
data publication
conception of research
research documentation collection
data sharing collection
collection: a container of ✓ data (files/folders) ✓ metadata has a “state” attribute
data acquisition collection
PID
5 * Box icons are made by Roundicons and Pixel Buddha from www.flaticon.com are licensed by CC 3.0 BY.
The RDM protocoldata acquisition
data analysis
data publication
conception of research
research documentation collection
data sharing collection
research administrator
(RA)
manager
contributor
viewer
collection: a container of ✓ data (files/folders) ✓ metadata has a “state” attribute
data acquisition collection
PID
5 * Box icons are made by Roundicons and Pixel Buddha from www.flaticon.com are licensed by CC 3.0 BY.
The RDM protocoldata acquisition
data analysis
data publication
conception of research
research documentation collection
data sharing collection
research administrator
(RA)
manager
contributor
viewer
collection: a container of ✓ data (files/folders) ✓ metadata has a “state” attribute
workflow
responsibility
eligibility
data acquisition collection
PID
5 * Box icons are made by Roundicons and Pixel Buddha from www.flaticon.com are licensed by CC 3.0 BY.
The RDM protocolOrganisational Unit (OU)
data acquisition
data analysis
data publication
conception of research
research documentation collection
data sharing collection
research administrator
(RA)
manager
contributor
viewer
collection: a container of ✓ data (files/folders) ✓ metadata has a “state” attribute
workflow
responsibility
eligibility
data acquisition collection
PID
5 * Box icons are made by Roundicons and Pixel Buddha from www.flaticon.com are licensed by CC 3.0 BY.
The data repository• a iRODS-based ICT system implementing the workflow defined by the protocol
• a single data-management system enabling internal collaboration, and external data sharing
storage system
data management middleware
interfaces
WebDAV Stagermanagement portal
iRODS management rules ELK stack
✓file-based system ✓single and uniform namespace ✓access-controlled metadata management ✓authentication via trusted identity providers ✓role-based authorisation ✓data replication for disaster recovery ✓workflow automation and policy
enforcement
6
Storage resources
vault_ou1_1 vault_ou1_2 vault_ou2_1 vault_ou2_2
resc_ou1 resc_ou2
vault_nl_1
resc_nl
NFS export NFS export
Location A (first replica) Location B (second replica)
stor
age
data of collections of OU1data of collections of OU2
asynchronousdata replication
iRO
DS re
sour
ces
quot
a
quot
a
quot
a
quot
a
files
yste
m
dataflow controle
load-balancing/OU-level quota
two identical copies of data (disaster recovery)
7
Collection namespace
/rdm/DI/DCCN/DAC_3010000.01_123/
iRODS zone
organisation
organisational unit
DRDR collection
admin viewer
manager contributor
viewer
• namespace reflects administrative hierarchy
• metadata in KVU triplets
• role-based authorisation with iRODS groups
8
Management rules• a RPC-like interface for collection management
rdmUpdateCollectionMetadata
Filter out attributes the client doesn’t have right to set
Verify attribute value
Update collection attributes
return up-to-date collection attributes
Server (core.re)Clientinputs: *collName, *kvp
output: *errorcode, *collectionAttrs
9
User provisioning and authentication
• authentication via a national federated IdP
• user is provisioned upon sign-up to the management portal
• IdP attributes are stored as user KVU-triplets in iRODS
• setup PAM authentication on OTP (one-time password) for data access
10
Event logging
iCATPEP
filebeatrodsLog
reporting
auditing
• essential user actions are logged as events
• non-blocking way streaming events to Elastic stack via “filebeat”
11
User interfaces• separating data-access from collection management
• web portal for collection management
• WebDAV (Davrods) for “easy” data transfer
• file stager: a service “intelligently” managing bulk file transfer between a local storage and the repository
actual transfer
file stager
transfer job manager/queue
local storage DRDRirsync
transfer agents
web interface
12
Strength and weakness👍 It fits to a combined scientific-administrative workflow of a
large and heterogeneous institute
👍 It provides sufficient functionality for
1. sharing data for publication
2. implementing Data Management Plan (DMP)
👎 It has weak integration with data analysis facility
👎 It doesn’t implement standard way of organising collection content
13
Future focuses
• seamless integration with computing (data-analysis) facility
• FAIR-ness of published data collections
• adoption beyond neuroimaging
14
Summary• We structured a RDM workflow
• which covers the entire research lifecycle
• in which both researcher and administrator take part of responsibility
• which is specified by protocol; implemented by a iRODS-based digital repository
https://data.donders.ru.nl
15