Research Data Management from a disciplinary perspective
Sarah JonesDigital Curation Centre
[email protected]: @sjDCC
Stéphane GoldsteinResearch Information Network
[email protected]: @stephgold7
Disclaimer
Practice varies greatly by discipline and sub-discipline so it’s hard to generalise
Apologies for any sweeping statements and groupings that don’t fit your model
Image credit: Sweep by Judy Van der Velden CC-BY-NC-ND www.flickr.com/photos/judy-van-der-velden/6757403261
Case studies on disciplinary practice
RIN Information Seeking and Sharing Behaviourwww.rin.ac.uk/our-work/using-and-accessing-information-resources
– Life sciences– Humanities– Physical sciences
RIN Open Science Case studies www.rin.ac.uk/our-work/data-management-and-curation/open-science-case-studies
SCARP case studies www.dcc.ac.uk/resources/case-studies/scarp
Knowledge Exchange Incentives and motivations for sharing research data (forthcoming)
RLUK research data typology (more from Stephane)
Groups and disciplines
Arts & Humanities– Creative arts, languages, philosophy, archaeology…
Social Science– Economics, history, politics, business, psychology...
Sciences & Engineering– Physics, astronomy, earth sciences, computing…
Life Sciences– Biology, ecology, medical and veterinary science…
Arts & Humanities
Outputs may not be termed ‘data’ e.g. sketches, writing, performance, artefacts, ‘work’
Focus on literary outputs & manuscripts in some disciplines
More use of standard tools e.g. Word, Excel – less likely to adapt technologies to fit
Arguably lower awareness and uptake of RDM overall
Creative Arts
Several RDM projects in the creative arts e.g. Kultivate, KAPTUR, VADS4R, CAiRO training...
Resistance to term ‘data’ – too scientific
Importance of personal websites for profile as work is also conducted outside of academia
Visual Arts Data Service - www.vads.ac.uk
Institutional repositories at arts schools accept a broader range of outputs and display content more visually to fill the void e.g. http://research.gold.ac.uk
Sonic Arts Research Unit
Collaboration with IR as a result of losing data
Tension between providing access in a visual / usable way and preserving data
Still use soundcloud and personal websites for access, but these link to ‘master’ copy of data held in IR for preservation
www.dcc.ac.uk/resources/developing-rdm-services/repository-radar
Digital Humanities
Intentional creation of resources rather than just data as by-product of research process
More use of standards e.g. XML & TEI in language resources, image standards and capture quality for digitisation, Dublin Core metadata…
Often include technical experts in project team
Links with cultural heritage collections
Negotiating copyright often a major issue
Sustainability a big challenge
Mapping Edinburgh’s Social History
Historical maps overlaid these with all kinds of open data to chart how the town has changed through time
Uses open source tools
Allows you to overlay maps
Picks up on common themes
www.mesh.ed.ac.uk
Social Sciences
Greater awareness and acceptance of RDM by community
Methodology is as much a factor in determining difference as discipline
Nature of data often poses challenges for sharing
Lots of reuse of large survey data
Established metadata standards e.g. Data Documentation Initiative (DDI)
Strong international data centre infrastructure
Public health
Ethics predominant concern– How to negotiate consent– How to store, transfer & handle data securely– How to anonymise and share data
Data integration / linking and curation of longitudinal studies is major concern as data added to over decades
Need for data havens to help control access to data – role for unis e.g. Grampian Data Safe Haven
UK Data Service - http://ukdataservice.ac.uk
Twenty-07: Public health study
Longitudinal study following 4510 people from West of Scotland over 20 years to investigate the reasons for differences in health
Undertook interviews, questionnaires, physical measurements, blood samples etc
Strict access controls and guidelines for data collection
Data managed within the MRC Social and Public Health Sciences Unit and accessible under a data sharing agreement - http://2007study.sphsu.mrc.ac.uk/Revised-Data-Sharing-Policy-has-been-launched.html
Life Sciences
Funders arguably more demanding in terms of data sharing policy
Sharing can be problematic / resisted given the nature of the data, fear of misuse or loss of control over IPR
Data sharing agreements and access committees more common
Data integration & mining key drivers
Research is well-resourced so greater capacity to fund local solutions and tools for RDM during projects
Genetics
Vast quantities of data and rapid growth– DNA sequence data is doubling every 6-8 months
Well established public databases for gene sequences e.g. GenBank www.ncbi.nlm.nih.gov/genbank – However even this is on short-term project funding!
Need accession number to publish so driver for sharing and established workflow
European Data Infrastructure projects too e.g. ELIXIR
Neuroscience
Large data volumes due to use of medical imaging
Moving towards larger cohort studies integrating wider range of data types, which strains the balance with ethical requirements around personal data
Costs of data gathering and advances in analysis technology are making field more data intensive - computational methods
Small interdisciplinary teams provide the human infrastructure for RDM, but historically low funder investment in data management at lab level
Disciplinary archives are immature, and has encouraged tendency for labs to treat longitudinal datasets as intellectual capital
OMERO – Open Microscopy Environment
Monash e-Research Centre helps groups to adopt (and if needed adapt) existing technological solutions
Partnered a research group to implement OMERO, a secure central repository to help researchers organise, analyze and share images
Resulting tool more sustainable as tailored to specific community need
www.dcc.ac.uk/resources/developing-rdm-services/improving-rdm-monash
Science & Engineering
Large scale can mean RDM is built in as standard and sharing part of workflow e.g. facilities science
Often early adopters and advocates of new technologies e.g. the Grid, wikis & Arxiv in particle physics
Archiving established in some cases as data can’t be recreated e.g. NERC data centres for Earth Sciences
Commercial sensitivities can place restrictions on sharing in some fields
Industry partners
Mechanical Engineering
Several RDM projects at Bath e.g. ERIM, REDm-MED
Concept of repository well established in industrial engineering – Product Lifecycle Management (PLM) systems
Preservation issues as data is challenging e.g. CAD files
Less information sharing than other disciplines– Commercial sensitivities preclude sharing– Consultancy-style research can lead to internal-only results– Data generated from private systems, so less applicable to others
Crystallography
X-ray examinations, images and videos of crystal structures, chemical crystallography diffraction images
Established metadata standards e.g. Crystallographic Information Framework (CIF)
Advocates of open science and use of related tools UsefulChem - http://usefulchem.wikispaces.com LabTrove - www.labtrove.org
eCrystals Archive and Crystallography Open Database (COD)
National Crystallography Service - www.ncs.ac.uk
Astronomy
Established data standards (e.g. FITS and NOA) maintained by community
Access to facilities requires the deposit of raw data, although this can be embargoed
International data centres e.g. Sloan Digital Sky Survey - www.sdss.org
Large volumes of data so transfer can be difficult
Few IPR issues compared to other disciplines
Data products are not always shared
Galaxy Zoo
Citizen Science project started to classify a million galaxies imaged by the Sloan Digital Sky Survey
Over 50 million classifications in the first year, contributed by more than 150,000 people
Classifications were as good as those from professional astronomers
Further projects in astronomy, climatology, biology, humanities… www.galaxyzoo.org
Research data typology
Commissioned by RLUKAim: to help librarians improve their ability to
engage with researchers on RDM matters; and to enable them to acquire a better understanding of the needs of researchers
A resource structured around a suggested typology of research data, looking at different ways in which data might be categorised
Broad data types
1. How do researchers generate and process data, and for what purpose?
1.1 Method of creation and collection of research data: where the data comes from
1.2 Readiness of research data: extent to which data has been processed
1.3 Use of research data: researchers' main purpose for accessing and using data
2. In what file formats, media and volumes do researchers generate data?
2.1 Medium and format for research data: objects in which data is captured and recorded, electronic storage and file types
2.2 Electronic data volumes: size of files (this is subjective, and based largely on the perception of researchers
3. How do researchers manage and store their data? 3.1 Storage of research data: where and how data is kept
3.2 Types of metadata: not an exhaustive list, but these are widely-recognised metadata standards
3.3 Metadata standards
3.4 Degree of openness: founded on Royal Society's categorisation of 'intelligent openness'
3.5 Licensing of research data: legal rights appertaining the use of the data
An expandable resource
A scaffold onto which disciplinary examples can be hung
Dynamic resource: community input (from librarians, but maybe others too?), crowdsourcing
Turning it into an online interactive toolRefreshing, curating, adapting the resourceBasic introduction at
http://www.powtoon.com/show/fZDm1s0W6TI/research-data-typology-for-rluk-draft/
Conclusions
Lots of work still to do!
Domains different in all respects: data, methods, key RDM concerns, level of infrastructure and support…
Differences exist at sub-discipline level
Need to understand the area Developing and using RLUK’s typology
How to plug the gaps?
Dozens of different repositories or databases specialising in sub-domains or data types, but still major gaps– Shared services?– Institutional services – specialising rather than generic?– Role of publishers and learned societies?– Funder calls for domain specific infrastructure?– Unis to support ground-up development of tools / services?
• How can the sector help domain-specific solutions to mature and thrive?
Top Related