Research Data Management and Sharing for the Social Sciences and Humanities
-
Upload
rebekah-cummings -
Category
Education
-
view
935 -
download
0
Transcript of Research Data Management and Sharing for the Social Sciences and Humanities
Research data management and sharing for social and behavioral sciences and
humanities
Rebekah Cummings, Research Data Management Librarian
J. Willard Marriott Library
September 15, 2015
In the next two hours… � Introductions
� What is data management?
� Why manage and share data?
� Data management plans
� Data organization
� Describing data (metadata!)
� Data ownership
� Data storage and security
� Data archiving and sharing
� Data services at the University of Utah
� Wrap-up and Questions
� Provide guidance on data management to the University of Utah community
� Data management plan consultations
� Help you find a repository to store and share your data
� Help you locate data services on campus
� Provide data management training
� Pilot library data services
Your Research Data Management Librarian
What is data management?
The process of controlling the information (read: data)
generated during a research project.
https://www.libraries.psu.edu/psul/pubcur/what_is_dm.html
What are data?
“The recorded factual material commonly accepted in the research community as necessary to validate
research findings.”
- U.S. OMB Circular A-110
http://www.whitehouse.gov/omb/circulars/a110/a110.html
SOCIAL SCIENCE DATA
HUMANITIES DATA
Well-established data practices, archives, and standards
Emerging data practices, archives, and standards
Types of data: surveys, interviews, audio-video, observations, census records, government records, opinion polls
Types of data: text, photographs, newspapers, letters, birth and death records, records of human history
Long history of capturing and organizing certain types of data (e.g. Census data, CLOSER, ICPSR)
More recent history of capturing and organizing data (e.g. HathiTrust, Chronicling America)
What do social science and humanities data have in common?
� Everyone is working digitally now
� Quantitative and qualitative data
� Often work with data that was not created for our purposes
� We all have more data than ever before
� New mandates for DMPs and sharing
� All of us could improve our data management practices
Two bears data management problems
1. Didn’t know where he stored the data
2. Saved one copy of the data on a USB drive
3. Data was in a format that could only be read by outdated, proprietary software
4. No codebook to explain the variable names
5. Variable names were not descriptive
6. No contact information for the co-author Sam Lee
Why manage data?
Your best collaborator is yourself six months from now,
and your past self doesn’t answer emails.
Why else manage data?
� Save time and efficiency
� Meet grant requirements
� Promote reproducible research
� Enable new discoveries from your data
� Make the results of publicly funded research publicly available
Grant requirements and federal mandates
� National Institute of Health (2003) – Required data management plans for grants over $500,000; All manuscripts in PubMed within 12 months of publication.
� National Science Foundation (2011) – All NSF grants must have a data management plan.
� The White House OSTP memo (2013) – Federal agencies with over $100 million/year in R&D must develop a plan to support public access to research.
As of 2015… � NEH Office of Digital Humanities – Requires two-page
data management plan similar to NSF requirement.
� Bill and Melinda Gates Foundation – Data Access Plan
Hypothetical Scenario
You are working on a research project with a small to medium sized research team. You uncover something notable in your field and write up the results of that research, which are then accepted by a reputable journal. People start citing your work! Three years later someone accuses you of falsifying your work.
� Would you be able to prove that you did the work as described in the article?
� What would you need to prove you hadn’t falsified the data?
� What should you have done throughout your research study to be able to prove you did the work as described?
Questions from MANTRA training module
Data Management Plans � What data are generated by your research?
� What is your plan for managing the data?
� How will your data be shared?
Research Data Lifecycle
PLANNING
Research Data Lifecycle http://www.data-archive.ac.uk/create-manage/life-cycle
Elements of a DMP � Types of data, including file formats
� Data description
� Data storage
� Data sharing, including confidentiality or security restrictions
� Data archiving and responsibility
� Data management costs
Managing Research Assets � Identify: Make an audit of what you have and where
� Decide which of your assets you want to keep and which you don’t need
� Organize your assets: Give them descriptive file names, organize them into a logical file structure, and write down your organizational scheme
� Make copies: Keep multiple copies of your research assets. Back up your reference library frequently. Every few years, check your copies to see if you need to export them to a newer format.
From Miriam Posner’s “Managing Research Assets” http://bit.ly/manageresearch
File naming best practices
1. Be descriptive
2. Don’t be generic
3. Appropriate length
4. Be consistent
5. Think critically about your file names
File naming best practices
� Files should include only letters, numbers, and underscores/dashes.
� No special characters
� No spaces; Use dashes, underscores, or camel case (like-this or likeThis)
� Not all systems are case sensitive. Assume this, THIS, and tHiS are the same.
Version Control - Numbering
001002003009010099
110239
99
Use leading zeros for scalability
Bonus Tip: Use ordinal numbers (v1,v2,v3) for major version changes and decimals for minor changes (v1.1, v2.6)
Version Control - Dates
If using dates use YYYYMMDD
June2015 = BAD!
06-18-2015 = BAD!
20150618 = GREAT!
2015-06-18 = This is fine too J
Common elements in a file name
� Project name � Name of creator
� Description of content
� Name of research team/department
� Date of creation
� Version number From a DMP: “Each file name, for all types of data, will contain the project acronym PUCCUK; a reference to the file content(survey, interview, media) and the date of an event (such as the date of an interview)
1. PLPP_EvaluationData_Workshop2_2014.xlsx
2. MyData.xlsx
3. publiclibrarypartnershipsprojectevaluationdataworkshop22014CummingsHelenaMontana.xlsx
Who filed better?
Who filed better?
1. July 24 2014_SoilSamples%_v6
2. 20140724_NSF_SoilSamples_Cummings
3. SoilSamples_FINAL
File organization best practices
� Top level folder should include project title and date.
� Sub-structure should have a clear and consistent naming convention.
� Document your folder structure in a README text file.
File Organization Exercise
1. Is there a better way to organize these files? 2. Can you spot any problems with the way these files are
names? 3. What files might be missing from this folder?
Research Documentation � Grant proposals and related reports
� Applications and approvals (e.g. IRB)
� Codebooks, data dictionaries
� Consent forms
� Surveys, questionnaires, interview protocols
� Transcripts, hard copies of audio and video files
� Any software or code you used (no matter how insignificant or buggy)
Three levels of documentation
� Project level – what the study set out to do, research questions, methods, sampling frames, instruments, protocols, members of the research team
� File or database level – How all the files relate to one another. A README file is a classic way of capturing this information.
� Variable or item level – Full label explaining the meaning of each variable.
http://datalib.edina.ac.uk/mantra/documentation_metadata_citation/
Codebooks Codebooks provide information on the structure, contents, and layout of a data file.
� Column locations and widths for each variable
� Response codes for each variable
� Codes used to indicate nonresponsive and missing data
� Questions and skip patterns used in a survey
� Data types
� Variable names
Structured Data (Metadata)
There was a study put out by Dr. Gary Bradshaw from the University of Nebraska Medical Center in 1982 called “ Growth of Rodent Kidney Cells in Serum Media and the Effect of Viral Transformation On Growth”. It concerns the cytology of kidney cells.
Unstructured Data Structured Data
Title Growth of rodent kidney cells in serum media and the effect of viral transformations on growth.
Author Gary Bradshaw
Date 1982
Publisher University of Nebraska Medical Center
Subject Kidney -- Cytology
Metadata Fields - Video � Type/Format
� .mp4, .avi, .mov
� Run time
� Title
� Producer/author
� Date(s)
� Location(s) � Place of production
� Content � Annotations
� Systems Requirement for access � Windows, Quicktime,
RealPlayer
� Download requirement � Size of file � Software needed
� Contact Info
� Persistent Identifier
� Other documentation
http://www.slideshare.net/RebekahCummings/data-management-for-education-research
Type of Metadata - Audio � Structural
� Relationship to other audio files in the same project
� Descriptive � Title, creator, subject,
description of project, date, content
� Administrative � Rights, licensing,
contact person
� Technical � Equipment used, file
format (MP3, WAV, FLAC), software for recording and editing
� Embedded � Some files have
embedded metadata – date, file format, etc. Do not rely on this as metadata
http://www.slideshare.net/RebekahCummings/data-management-for-education-research
Data Documentation Initiative
� Most recognized standard for describing social science data and is often recommended for humanities data as well.
� Used by many data repositories
� Extremely mature, XML-based standard
� Hundreds of elements for data description � AnalysisUnit = the entity being analyzed in the
study or variable � DataType = specifies type of data being collected
Disciplinary Metadata Digital Curation Centre’s list of subject-specific metadata schemas - http://www.dcc.ac.uk/resources/metadata-standards
Data citation � Enables easy reuse and verification of your
data
� Allows the impact of your data to be tracked
� Creates a scholarly structure that rewards data producers
� Increases citation rate for related publications (Pienta, 2010)
Data Management Rollout Survey (2013)
JISC Data Management Rollout Project Survey Results- 2012- http://damaro.oucs.ox.ac.uk/outputs.xml
https://www.insidehighered.com/news/2015/07/27/ucsd-wins-key-round-legal-fight-usc-over-huge-research-project
Complication #3 – Data and IP
“The discoverer of a scientific fact as to the nature of the physical world, an historical fact, a contemporary news event, or any other ‘fact’ may not claim to be the ‘author’ of that fact. If anyone may claim authorship of facts, it must be the Supreme Author of us all. The discoverer merely finds and records.”
- Melville Nimmer, 1963
University policy
“The University of Utah retains ownership
and stewardship of the scientific data and
records for projects conducted at the
University or that use University of
personnel or resources.
- Research Handbook, Section 9.9
University policy (cont.) “Except where precluded by the specific terms of a sponsored agreement, tangible research property, including the scientific data and other records of research conducted by the faculty or staff of the University, belongs to the University.”
- Research Handbook, Section 9.9
But what about IP? University IP includes “the tangible and intangible results of research (including for example data, lab notebooks, charts, etc.)”
- Employee Intellectual Property Assignment Agreement
Intellectual Property – Copyright and Patents
� Faculty members retain copyright over their “traditional scholarly products” but that term is fairly narrowly defined and would have to be evaluated on a case-by-case basis
� If you plan on commercializing your data, you must speak with TVC (Technology, Venture, and Commercial).
Recap - Who owns the data? � The University
� The project sponsor if that was negotiated in the contract
� Another institution with which you are collaborating.
� IF you are a faculty member and IF your data can be defined as a “traditional scholarly work” you would retain copyright of your data.
Data responsibility “The P.I. is responsible for the collection, management, maintenance, and retention of research data accumulated under a research project. The University must retain research data in sufficient detail and for an adequate period of time to enable appropriate responses to questions about accuracy, authenticity, privacy, and compliance with laws and regulations governing the conduct of research. It is the P.I.s responsibility to determine what records need to be retained to comply with sponsor requirements.
Research Handbook 9.9.2
Data responsibility (cont.) “Research data must be archived for a minimum of three years after the final project closeout.”
“The P.I. should develop appropriate procedures for proper archiving and tracking of research data.”
Research Handbook 9.9.4
Language from a DMP “All data files will be stored on the University server that is backed up nightly. The University's computing network is protected from viruses by a firewall and anti-virus software. Digital recordings will be copied to the server each day after interviews.
Signed consent forms will be stored in a locked cabinet in the office. Interview recordings and transcripts, which may contain personal information, will be password protected at file-level and stored on the server.
Original versions of the files will always be kept on the server. If copies of files are held on a laptop and edits made, their file names will be changed.”
What kind of sensitive data?
� Human subject data
� Patient information
� Environmental data
� Potentially patentable data
Working with sensitive data
� If possible, collect the necessary data without using direct identifiers
� Otherwise, remove all direct identifiers upon collection or immediately afterwards
� Be careful with indirect identifiers
� Avoid storing or sharing unencrypted personal data electronically
� Talk to IRB/ Check HIPPA guidelines
Sensitive data (cont.) � Include information in your consent forms
about how the data will be shared and what steps will be taken to prevent identity disclosure.
� During the data collection phase, do not share sensitive data beyond the research group
� If data will not remain usable with identifiers removed, consider depositing data in an archive with controlled access.
HIPPA “Safe Harbor” de-identification protocol
� 18 HIPPA Identifiers – remove these pieces of information for data exports.
Tools for Working w/ Sensitive Data
� ICPSR Guide to Social Science Data Preparation and Archiving (Chapters 5 & 6) - http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf
� Managing and Sharing Data: UK Data Archive (Ethics and Consent, pages 22-27) http://www.data-archive.ac.uk/media/2894/managingsharing.pdf
� Identity Finder, Simple Data Masking, Spider, SSN Scanning Tools
� QualAnon – Tool for anonymizing interview transcripts, typed field notes, or other qualitative data. Changes identified names into specified pseudonyms.
Archiving ≠ Storage
� Storage redundancy
� Security/ confidentiality
� Long term preservation (fixity checks, forward migration)
� Persistent identifiers
� Metadata Preparation
� Wider visibility of research
� Secondary analysis tools
Data archives services may include:
Archiving Options
� Domain specific repository – ICPSR; GenBank; FlyBase
� General Purpose Data Repository – FigShare; Dryad; Dataverse
� Institutional repository - USpace
How to choose a data repository
� Requirements of funding agency/journal
� Subject or discipline options
� Size of dataset
� File formats accepted
� Accessibility of data
� Budget
� Time
Recommended Repositories � Re3data - index of data repositories at
http://www.re3data.org/browse
� PLOS’s guide - http://journals.plos.org/plosone/s/data-availability - loc-recommended-repositories
� Princeton’s guide - http://libguides.princeton.edu/c.php?g=84261&p=541339
� Scientific Data’s guide - http://www.nature.com/sdata/data-policies/repositories
Rules for Sharing Your Data � Publish your data online with a persistent
identifier (DOI or ARK)
� Publish your data in a reputable data repository
� Convert your data to stable, non-proprietary formats for long-term access
� Publish enough context to make your data understandable (metadata, code, workflows)
� Link your data to your publications as often as possible
Rules for Sharing Data (cont.)
� State how you want to get credit for your data
� Always cite the sources of data that you use and include data citations with your datasets
� Include datasets in your NSF Biosketch or Faculty Profile
Content from “Ten Simple Rules for the Care and Feeding of Scientific Data” http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003542
Office of the VP for research Research Integrity and Compliance
� Institutional Review Board (IRB)
� Conflict of Interest
� Research Education Training
� Human Subjects
� Animal Subjects
� Lab Safety
Marriott & Eccles Libraries
Daureen Nesdill Research Data Management
Librarian, Sciences
Darell Schmick Research Librarian,
Health Sciences
Rebekah Cummings Research Data Management
Librarian, Social Sciences & Humanities
Marriott Library - Software
� Quantitative and Qualitative Analysis Software – Student Computing Services � SPSS � Stata � nVivo – qualitative data analysis � ATLAS.ti – qualitative data analysis � MATLAB � R � SAS
Marriott Library – Subject Guides Data subject guides: http://campusguides.lib.utah.edu/dataanddatasets
Marriott Library – Digital Humanities
� Explore tools with you and help connect you with other digital humanists.
� Find available digitized source material
� Secure data for text and data mining
� Bookworm – word frequency; visualize trends in historical texts (built off Google Ngram Viewer)
� MALLET – topic modeling
� APIs for programmatic access to large corpora � HathiTrust Research Center � JSTOR Data for Research � Getty Research Institute
Writing a DMP � DMPTool – https://dmp.cdlib.org/
� ICPSR website - https://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/index.html
� Once again… call a data librarian!
Data Creation/ Collection � REDCAP – a browser based tool that allows
investigators to create and administer surveys. Data is stored on HIPPA/FERPA compliant servers.
� LabArchives – electronic lab notebooks being implemented on campus in research labs and classes. [email protected]
� Create audio/video recordings – Faculty Center; [email protected] and [email protected]
Data ownership and commercialization
Technology & Venture Commercialization Office http://www.tvc.utah.edu/ Dave Morrison, Patent Librarian Marriott Library, Room 2110K 801-585-6802 [email protected]
Data Visualization � SCI Institute – Scientific Computing and
Imaging Institutehttps://www.sci.utah.edu/
� GIS assistance at Marriott Library – [email protected] � Creation of interactive mapping projects � Locating and creating geospatial data � How to work various GIS platforms
(ArcGIS, Google Earth, etc.)
Data Storage & Archiving
� Ubox – HIPPA/FERPA compliant; easy to create account � 50 GB free - http://box.utah.edu/
� Uspace – Institutional repository � 40 GB per data submission - http://uspace.utah.edu/
� Center for High Performance Computing � HIPPA/ FERPA compliant ($210/TB for 5 years, more for
quarterly backups) - https://www.chpc.utah.edu/
� ICPSR – U of U is an institutional member – [email protected]
Online Data Management Training
� ICPSR - https://www.icpsr.umich.edu/icpsrweb/landing.jsp
� UK Data Archive http://www.data-archive.ac.uk/help/user-faq#3
� MANTRA Data Management Training - http://datalib.edina.ac.uk/mantra/
� RDM Rose http://rdmrose.group.shef.ac.uk/
� Data Q http://researchdataq.org/
Major takeaways � Data management starts at the beginning of a
project
� Document your data with a certain level of reuse in mind
� Consider archiving and sharing options when you are done with your project
� Don’t overlook campus resources!
Thank you! Questions? [email protected]
@RebekahCummings
(801) 581-7701
Marriott Library, 1705Y
…or ask now!!