Research Data Management and Sharing for the Social Sciences and Humanities

104
Research data management and sharing for social and behavioral sciences and humanities Rebekah Cummings, Research Data Management Librarian J. Willard Marriott Library September 15, 2015

Transcript of Research Data Management and Sharing for the Social Sciences and Humanities

Research data management and sharing for social and behavioral sciences and

humanities

Rebekah Cummings, Research Data Management Librarian

J. Willard Marriott Library

September 15, 2015

In the next two hours… �  Introductions

�  What is data management?

�  Why manage and share data?

�  Data management plans

�  Data organization

�  Describing data (metadata!)

�  Data ownership

�  Data storage and security

�  Data archiving and sharing

�  Data services at the University of Utah

�  Wrap-up and Questions

�  Provide guidance on data management to the University of Utah community

�  Data management plan consultations

�  Help you find a repository to store and share your data

�  Help you locate data services on campus

�  Provide data management training

�  Pilot library data services

Your Research Data Management Librarian

Name Department Why you’re here

What is data management?

The process of controlling the information (read: data)

generated during a research project.

https://www.libraries.psu.edu/psul/pubcur/what_is_dm.html

What are data?

“The recorded factual material commonly accepted in the research community as necessary to validate

research findings.”

- U.S. OMB Circular A-110

http://www.whitehouse.gov/omb/circulars/a110/a110.html

Data are diverse

SOCIAL SCIENCE DATA

HUMANITIES DATA

Well-established data practices, archives, and standards

Emerging data practices, archives, and standards

Types of data: surveys, interviews, audio-video, observations, census records, government records, opinion polls

Types of data: text, photographs, newspapers, letters, birth and death records, records of human history

Long history of capturing and organizing certain types of data (e.g. Census data, CLOSER, ICPSR)

More recent history of capturing and organizing data (e.g. HathiTrust, Chronicling America)

What do social science and humanities data have in common?

�  Everyone is working digitally now

�  Quantitative and qualitative data

�  Often work with data that was not created for our purposes

� We all have more data than ever before

�  New mandates for DMPs and sharing

�  All of us could improve our data management practices

Why manage data?

Two bears data management problems

1.  Didn’t know where he stored the data

2.  Saved one copy of the data on a USB drive

3.  Data was in a format that could only be read by outdated, proprietary software

4.  No codebook to explain the variable names

5.  Variable names were not descriptive

6.  No contact information for the co-author Sam Lee

Why manage data?

Your best collaborator is yourself six months from now,

and your past self doesn’t answer emails.

Why else manage data?

�  Save time and efficiency

�  Meet grant requirements

�  Promote reproducible research

�  Enable new discoveries from your data

�  Make the results of publicly funded research publicly available

Grant requirements and federal mandates

�  National Institute of Health (2003) – Required data management plans for grants over $500,000; All manuscripts in PubMed within 12 months of publication.

�  National Science Foundation (2011) – All NSF grants must have a data management plan.

�  The White House OSTP memo (2013) – Federal agencies with over $100 million/year in R&D must develop a plan to support public access to research.

As of 2015… �  NEH Office of Digital Humanities – Requires two-page

data management plan similar to NSF requirement.

�  Bill and Melinda Gates Foundation – Data Access Plan

Hypothetical Scenario

You are working on a research project with a small to medium sized research team. You uncover something notable in your field and write up the results of that research, which are then accepted by a reputable journal. People start citing your work! Three years later someone accuses you of falsifying your work.

� Would you be able to prove that you did the work as described in the article?

� What would you need to prove you hadn’t falsified the data?

� What should you have done throughout your research study to be able to prove you did the work as described?

Questions from MANTRA training module

Data Management Plans �  What data are generated by your research?

�  What is your plan for managing the data?

�  How will your data be shared?

Research Data Lifecycle

PLANNING

Research Data Lifecycle http://www.data-archive.ac.uk/create-manage/life-cycle

Elements of a DMP �  Types of data, including file formats

�  Data description

�  Data storage

�  Data sharing, including confidentiality or security restrictions

�  Data archiving and responsibility

�  Data management costs

DMPTool – CDL

Data organization

Data are messy

Managing Research Assets �  Identify: Make an audit of what you have and where

�  Decide which of your assets you want to keep and which you don’t need

�  Organize your assets: Give them descriptive file names, organize them into a logical file structure, and write down your organizational scheme

�  Make copies: Keep multiple copies of your research assets. Back up your reference library frequently. Every few years, check your copies to see if you need to export them to a newer format.

From Miriam Posner’s “Managing Research Assets” http://bit.ly/manageresearch

File naming

MyData.xls

MeetingNotes.doc

Presentation.ppt

Assignment1.pdf

Behold! The humanist dataset

File naming best practices

1.  Be descriptive

2.  Don’t be generic

3.  Appropriate length

4.  Be consistent

5.  Think critically about your file names

File naming best practices

�  Files should include only letters, numbers, and underscores/dashes.

�  No special characters

�  No spaces; Use dashes, underscores, or camel case (like-this or likeThis)

�  Not all systems are case sensitive. Assume this, THIS, and tHiS are the same.

Version Control - Numbering

001002003009010099

110239

99

Use leading zeros for scalability

Bonus Tip: Use ordinal numbers (v1,v2,v3) for major version changes and decimals for minor changes (v1.1, v2.6)

Version Control - Dates

If using dates use YYYYMMDD

June2015 = BAD!

06-18-2015 = BAD!

20150618 = GREAT!

2015-06-18 = This is fine too J

Common elements in a file name

�  Project name �  Name of creator

�  Description of content

�  Name of research team/department

�  Date of creation

�  Version number From a DMP: “Each file name, for all types of data, will contain the project acronym PUCCUK; a reference to the file content(survey, interview, media) and the date of an event (such as the date of an interview)

1.  PLPP_EvaluationData_Workshop2_2014.xlsx

2.  MyData.xlsx

3.  publiclibrarypartnershipsprojectevaluationdataworkshop22014CummingsHelenaMontana.xlsx

Who filed better?

Who filed better?

1.  July 24 2014_SoilSamples%_v6

2.  20140724_NSF_SoilSamples_Cummings

3.  SoilSamples_FINAL

File organization best practices

� Top level folder should include project title and date.

� Sub-structure should have a clear and consistent naming convention.

� Document your folder structure in a README text file.

README files

README files

File Organization Exercise

1.  Is there a better way to organize these files? 2. Can you spot any problems with the way these files are

names? 3. What files might be missing from this folder?

Describing data

Why describe your data?

Research Documentation �  Grant proposals and related reports

�  Applications and approvals (e.g. IRB)

�  Codebooks, data dictionaries

�  Consent forms

�  Surveys, questionnaires, interview protocols

�  Transcripts, hard copies of audio and video files

�  Any software or code you used (no matter how insignificant or buggy)

Three levels of documentation

�  Project level – what the study set out to do, research questions, methods, sampling frames, instruments, protocols, members of the research team

�  File or database level – How all the files relate to one another. A README file is a classic way of capturing this information.

�  Variable or item level – Full label explaining the meaning of each variable.

http://datalib.edina.ac.uk/mantra/documentation_metadata_citation/

FNAME?

IJ?

Codebooks Codebooks provide information on the structure, contents, and layout of a data file.

�  Column locations and widths for each variable

�  Response codes for each variable

�  Codes used to indicate nonresponsive and missing data

�  Questions and skip patterns used in a survey

�  Data types

�  Variable names

http://www.icpsr.umich.edu/files/deposit/Guide-to-Codebooks_v1.pdf

Structured Data (Metadata)

There was a study put out by Dr. Gary Bradshaw from the University of Nebraska Medical Center in 1982 called “ Growth of Rodent Kidney Cells in Serum Media and the Effect of Viral Transformation On Growth”. It concerns the cytology of kidney cells.

Unstructured Data Structured Data

Title Growth of rodent kidney cells in serum media and the effect of viral transformations on growth.

Author Gary Bradshaw

Date 1982

Publisher University of Nebraska Medical Center

Subject Kidney -- Cytology

Metadata Fields - Video �  Type/Format

�  .mp4, .avi, .mov

�  Run time

�  Title

�  Producer/author

�  Date(s)

�  Location(s) �  Place of production

�  Content �  Annotations

�  Systems Requirement for access �  Windows, Quicktime,

RealPlayer

�  Download requirement �  Size of file �  Software needed

�  Contact Info

�  Persistent Identifier

�  Other documentation

http://www.slideshare.net/RebekahCummings/data-management-for-education-research

Type of Metadata - Audio �  Structural

�  Relationship to other audio files in the same project

�  Descriptive �  Title, creator, subject,

description of project, date, content

�  Administrative �  Rights, licensing,

contact person

�  Technical �  Equipment used, file

format (MP3, WAV, FLAC), software for recording and editing

�  Embedded �  Some files have

embedded metadata – date, file format, etc. Do not rely on this as metadata

http://www.slideshare.net/RebekahCummings/data-management-for-education-research

Data Documentation Initiative

�  Most recognized standard for describing social science data and is often recommended for humanities data as well.

�  Used by many data repositories

�  Extremely mature, XML-based standard

�  Hundreds of elements for data description �  AnalysisUnit = the entity being analyzed in the

study or variable �  DataType = specifies type of data being collected

Dublin Core

Disciplinary Metadata Digital Curation Centre’s list of subject-specific metadata schemas - http://www.dcc.ac.uk/resources/metadata-standards

Data citation �  Enables easy reuse and verification of your

data

�  Allows the impact of your data to be tracked

�  Creates a scholarly structure that rewards data producers

�  Increases citation rate for related publications (Pienta, 2010)

Data ownership

Data Management Rollout Survey (2013)

JISC Data Management Rollout Project Survey Results- 2012- http://damaro.oucs.ox.ac.uk/outputs.xml

UNC Data Ownership Survey (2012)

Table from “Research Data Stewardship at UNC,” 2012

https://www.insidehighered.com/news/2015/07/27/ucsd-wins-key-round-legal-fight-usc-over-huge-research-project

Academic Research Data

Academic Research

Data

Proprietary

Data

More Open

Less Open

Gov’t Data

Complication #1 - Stakeholders

1.  Researchers2.  Universities3.  Funding Agencies4.  Public

Complication #2 - Terminology

•  Data ownership•  Data governance•  Data stewardship

Complication #3 – Data and IP

“The discoverer of a scientific fact as to the nature of the physical world, an historical fact, a contemporary news event, or any other ‘fact’ may not claim to be the ‘author’ of that fact. If anyone may claim authorship of facts, it must be the Supreme Author of us all. The discoverer merely finds and records.”

- Melville Nimmer, 1963

University policy

“The University of Utah retains ownership

and stewardship of the scientific data and

records for projects conducted at the

University or that use University of

personnel or resources.

- Research Handbook, Section 9.9

University policy (cont.) “Except where precluded by the specific terms of a sponsored agreement, tangible research property, including the scientific data and other records of research conducted by the faculty or staff of the University, belongs to the University.”

- Research Handbook, Section 9.9

But what about IP? University IP includes “the tangible and intangible results of research (including for example data, lab notebooks, charts, etc.)”

- Employee Intellectual Property Assignment Agreement

Intellectual Property – Copyright and Patents

� Faculty members retain copyright over their “traditional scholarly products” but that term is fairly narrowly defined and would have to be evaluated on a case-by-case basis

�  If you plan on commercializing your data, you must speak with TVC (Technology, Venture, and Commercial).

Recap - Who owns the data? � The University

� The project sponsor if that was negotiated in the contract

� Another institution with which you are collaborating.

�  IF you are a faculty member and IF your data can be defined as a “traditional scholarly work” you would retain copyright of your data.

Data responsibility “The P.I. is responsible for the collection, management, maintenance, and retention of research data accumulated under a research project. The University must retain research data in sufficient detail and for an adequate period of time to enable appropriate responses to questions about accuracy, authenticity, privacy, and compliance with laws and regulations governing the conduct of research. It is the P.I.s responsibility to determine what records need to be retained to comply with sponsor requirements.

Research Handbook 9.9.2

Data responsibility (cont.) “Research data must be archived for a minimum of three years after the final project closeout.”

“The P.I. should develop appropriate procedures for proper archiving and tracking of research data.”

Research Handbook 9.9.4

Data Storage

LOCKSS (Lots of Copies Keeps

Stuff Safe)

Options for data storage

Personal computers or laptops

Networked drives

External storage devices

Language from a DMP “All data files will be stored on the University server that is backed up nightly. The University's computing network is protected from viruses by a firewall and anti-virus software. Digital recordings will be copied to the server each day after interviews.

Signed consent forms will be stored in a locked cabinet in the office. Interview recordings and transcripts, which may contain personal information, will be password protected at file-level and stored on the server.

Original versions of the files will always be kept on the server. If copies of files are held on a laptop and edits made, their file names will be changed.”

Ubox – box.utah.edu

Storing Sensitive Data

What kind of sensitive data?

� Human subject data

� Patient information

� Environmental data

� Potentially patentable data

Working with sensitive data

�  If possible, collect the necessary data without using direct identifiers

�  Otherwise, remove all direct identifiers upon collection or immediately afterwards

�  Be careful with indirect identifiers

�  Avoid storing or sharing unencrypted personal data electronically

�  Talk to IRB/ Check HIPPA guidelines

Sensitive data (cont.) �  Include information in your consent forms

about how the data will be shared and what steps will be taken to prevent identity disclosure.

�  During the data collection phase, do not share sensitive data beyond the research group

�  If data will not remain usable with identifiers removed, consider depositing data in an archive with controlled access.

HIPPA “Safe Harbor” de-identification protocol

�  18 HIPPA Identifiers – remove these pieces of information for data exports.

Tools for Working w/ Sensitive Data

�  ICPSR Guide to Social Science Data Preparation and Archiving (Chapters 5 & 6) - http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf

�  Managing and Sharing Data: UK Data Archive (Ethics and Consent, pages 22-27) http://www.data-archive.ac.uk/media/2894/managingsharing.pdf

�  Identity Finder, Simple Data Masking, Spider, SSN Scanning Tools

�  QualAnon – Tool for anonymizing interview transcripts, typed field notes, or other qualitative data. Changes identified names into specified pseudonyms.

Thinking long-term

Archiving ≠ Storage

�  Storage redundancy

�  Security/ confidentiality

�  Long term preservation (fixity checks, forward migration)

�  Persistent identifiers

�  Metadata Preparation

� Wider visibility of research

�  Secondary analysis tools

Data archives services may include:

Archiving Options

� Domain specific repository – ICPSR; GenBank; FlyBase

� General Purpose Data Repository – FigShare; Dryad; Dataverse

�  Institutional repository - USpace

Arts/Humanities Data Repository

How to choose a data repository

� Requirements of funding agency/journal

� Subject or discipline options

� Size of dataset

� File formats accepted

� Accessibility of data

� Budget

� Time

Recommended Repositories �  Re3data - index of data repositories at

http://www.re3data.org/browse

�  PLOS’s guide - http://journals.plos.org/plosone/s/data-availability - loc-recommended-repositories

�  Princeton’s guide - http://libguides.princeton.edu/c.php?g=84261&p=541339

�  Scientific Data’s guide - http://www.nature.com/sdata/data-policies/repositories

Data sharing

Rules for Sharing Your Data � Publish your data online with a persistent

identifier (DOI or ARK)

� Publish your data in a reputable data repository

� Convert your data to stable, non-proprietary formats for long-term access

� Publish enough context to make your data understandable (metadata, code, workflows)

� Link your data to your publications as often as possible

Rules for Sharing Data (cont.)

� State how you want to get credit for your data

� Always cite the sources of data that you use and include data citations with your datasets

�  Include datasets in your NSF Biosketch or Faculty Profile

Content from “Ten Simple Rules for the Care and Feeding of Scientific Data” http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003542

University Data Resources

Office of the VP for research Research Integrity and Compliance

�  Institutional Review Board (IRB)

� Conflict of Interest

� Research Education Training

� Human Subjects

� Animal Subjects

�  Lab Safety

Marriott & Eccles Libraries

Daureen Nesdill Research Data Management

Librarian, Sciences

Darell Schmick Research Librarian,

Health Sciences

Rebekah Cummings Research Data Management

Librarian, Social Sciences & Humanities

Marriott Library - Software

� Quantitative and Qualitative Analysis Software – Student Computing Services � SPSS � Stata � nVivo – qualitative data analysis � ATLAS.ti – qualitative data analysis � MATLAB � R � SAS

Marriott Library – Subject Guides Data subject guides: http://campusguides.lib.utah.edu/dataanddatasets

Marriott Library – Digital Humanities

�  Explore tools with you and help connect you with other digital humanists.

�  Find available digitized source material

�  Secure data for text and data mining

�  Bookworm – word frequency; visualize trends in historical texts (built off Google Ngram Viewer)

�  MALLET – topic modeling

�  APIs for programmatic access to large corpora �  HathiTrust Research Center �  JSTOR Data for Research �  Getty Research Institute

Writing a DMP � DMPTool – https://dmp.cdlib.org/

�  ICPSR website - https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/index.html

� Once again… call a data librarian!

Data Creation/ Collection � REDCAP – a browser based tool that allows

investigators to create and administer surveys. Data is stored on HIPPA/FERPA compliant servers.

�  LabArchives – electronic lab notebooks being implemented on campus in research labs and classes. [email protected]

� Create audio/video recordings – Faculty Center; [email protected] and [email protected]

Data ownership and commercialization

Technology & Venture Commercialization Office http://www.tvc.utah.edu/ Dave Morrison, Patent Librarian Marriott Library, Room 2110K 801-585-6802 [email protected]

Data Visualization � SCI Institute – Scientific Computing and

Imaging Institutehttps://www.sci.utah.edu/

� GIS assistance at Marriott Library – [email protected] � Creation of interactive mapping projects � Locating and creating geospatial data � How to work various GIS platforms

(ArcGIS, Google Earth, etc.)

Data Storage & Archiving

� Ubox – HIPPA/FERPA compliant; easy to create account �  50 GB free - http://box.utah.edu/

� Uspace – Institutional repository �  40 GB per data submission - http://uspace.utah.edu/

� Center for High Performance Computing �  HIPPA/ FERPA compliant ($210/TB for 5 years, more for

quarterly backups) - https://www.chpc.utah.edu/

�  ICPSR – U of U is an institutional member – [email protected]

Online Data Management Training

�  ICPSR - https://www.icpsr.umich.edu/icpsrweb/landing.jsp

�  UK Data Archive http://www.data-archive.ac.uk/help/user-faq#3

�  MANTRA Data Management Training - http://datalib.edina.ac.uk/mantra/

�  RDM Rose http://rdmrose.group.shef.ac.uk/

�  Data Q http://researchdataq.org/

Major takeaways �  Data management starts at the beginning of a

project

�  Document your data with a certain level of reuse in mind

�  Consider archiving and sharing options when you are done with your project

�  Don’t overlook campus resources!

Thank you! Questions? [email protected]

@RebekahCummings

(801) 581-7701

Marriott Library, 1705Y

…or ask now!!