Responsible conduct of research: Data Management
-
Upload
c-tobin-magle -
Category
Data & Analytics
-
view
140 -
download
1
Transcript of Responsible conduct of research: Data Management
Responsible Conduct of Research: Managing Data
Tobin MagleData Management
Specialist
Nicole KaplanInformation
Manager
Daniel DraperDigital Repositories
Unit Coordinator
Responsible Conduct of Research: The data management firehose!
C. Tobin Magle, PhDPlease ask me for help with data management!
My Background: molecular microbiology
(1) Magle CT et al Infect Immun. 2014 Feb;82(2):618-25. doi: 10.1128/IAI.00444-13. Epub 2013 Nov 25.(2) Sun W, Tanaka TQ, Magle CT, et al.. Sci Rep. 2014 Jan 17;4:3743. doi: 10.1038/srep03743.
Data Workshops
Individual help for ANY data topic
How do I write a DMP?
How do I organize my
data?
How do I clean and format my
data?How do I use R?
How do I get my data ready
to share?
How do I comply with funder mandates?
What DM tools are there for
collaboration?
How do I use R?
Data Management Serviceshttps://lib.colostate.edu/services/data-management
What is data management?
The policies, practices and procedures needed to manage the storage, access and preservation of
data produced from a research project
data management != data sharing
• but the same principles apply to both
Everything* is digital• Data are ephemeral• Big, complex data
• CAN share
*ok not everything, but most things
-> Need new skills
More researchers
https://www.nsf.gov/statistics/2016/nsf16300/digest/nsf16300.pdf
Working Email
Data are extant(If status known)
Status of data (if response)
Response (if email working)
doi:10.1016/j.cub.2013.11.014
We are losing vast amounts of data
00
0
0
0
0
0
0
0
00
0
0
1
1
1
11
1
11
1
1
1
1
1
1
1
0
00
0
0
0
000
000 0
1
1
1 1
10
Who is responsible?
CSU data policy
General points:• The university owns, and is therefore ultimately responsible for
research data
• Researchers are the data managers
• The university promotes openness
http://policylibrary.colostate.edu/policy.aspx?id=737
You’re a Data Manager
http://www.phdcomics.com/comics/archive.php?comicid=382
CSU data policy
Research Data Associated with Theses and Dissertations
To preserve the complete scholarly record of the author, data sets must be incorporated. Therefore, a student depositing their thesis or dissertation is required to make discoverable, accessible and available their associated data sets in accordance with this policy and provisions of the University’s Digital Repository. Access and rights management (embargo period, access limited to specific IP addresses) shall be the same for the associated data sets as it is for the thesis or the dissertation.
http://policylibrary.colostate.edu/policy.aspx?id=737
When should data management happen?
Throughout the whole research cycle
Hypothesis
The research cycle
Hypothesis Experimental design
The research cycle
Hypothesis DataExperimental design
The research cycle
Hypothesis DataExperimental design
Results
The research cycle
Hypothesis DataExperimental design
ResultsArticle
The research cycle
Hypothesis DataExperimental design
ResultsArticle
The research cycle
Hypothesis DataExperimental design
ResultsArticle
Data Management Plans
The research cycle
HypothesisRaw data
Experimental design
Tidy Data
ResultsArticle
Data Management Plans
Cleaning
Analysis
The research cycle
HypothesisRaw data
Experimental design
Tidy Data
ResultsArticle
Data Management Plans
Cleaning
Sharing
Analysis
Open Data
The research cycle
HypothesisRaw data
Experimental design
Tidy Data
ResultsArticle
Data Management Plans
Cleaning
Sharing
Analysis
Open Data
Code Reproducible Research
The research cycle
HypothesisRaw data
Experimental design
Tidy Data
ResultsArticle
Data Management Plans
Cleaning
Sharing
Analysis
Open Data
Code Reproducible Research
Reuse
The research cycle
HypothesisRaw data
Experimental design
Tidy Data
ResultsArticle
Data Management Plans
Cleaning
Sharing
Analysis
Open Data
Code Reproducible Research
Reuse
The research cycle
What is research data?
• “The recorded factual material commonly accepted in the scientific community as necessary to validate research findings”
- White House Office of Management and Budget
• Reality: Applies to any research product
HypothesisRaw data
Experimental design
Tidy Data
ResultsArticle
Data Management Plans
Cleaning
Sharing
Analysis
Open Data
Code Reproducible Research
Reuse
Working data vs. archived data Working
Archived
What is a data management plan?
A description of how you plan to describe, preserve and share your research data.
Often required by funding agencies
Successful DMPs include
• A data inventory, including type(s) and size
• A strategy for describing the data
• A plan for preserving the data
• A method for access to the data
Always make sure to follow funder requirements
Tool: DMPTool
• Review requirements from different agencies
• https://dmptool.org/guidance
• Create new DMPs based on funding agency templates
• Search public DMPs
Data inventory
• What type of data are you going to collect?• What file type will be produced?• What size will these files be? How many files?
• How will you organize the data?
• What other research outputs will be produced?• Code/Software?• Templates/protocols?
Data inventory
• What type of data are you going to collect?
• What file type will be produced?
• What size will these files be? How many files?
• What other research outputs will be produced?• Code/Software?• Templates/protocols?
miRNA sequences
FASTQ files
1 GB per filex 64 strainsx 3 replicates-------------------~200 GB
R scripts for analysis and visualization
Data use tutorials
Data formats
• Avoid proprietary formats• Know what software can read your data
Proprietary Format Open Format
Excel (.xls, .xlsx) Comma Separated Values (.csv)
Word (.doc, .docx) plain text (.txt)
PowerPoint (.ppt, .pptx) PDF/A (.pdf)
Photoshop (.psd) TIFF (.tif, .tiff)
Quicktime (.mov) MPEG-4 (.mp4)
MPEG 4 Protected audio (.m4p) MP3 (.mp3)
Q’s: Data InventoryWhat kind of data are you going to collect?
What file type will be produced?
What size will these files be? How many files?
What other research outputs will be produced?
Folder systems
• Identify ways to divide your data into categories (Attributes)
• Top level organization is the most important attribute
• Provide documentation
Hierarchical Organizationmy_thesis
chapter1 chapter2 chapter3 chapter4
raw_data
replicate1
replicate2
processed_data code
Processing
cleaning
results
tables
figures
Q’s: Data Organization
• What kinds of files are there? (See data inventory)
• How could you group them?• Project?• Time?• Location?• File type?
• What are the most important attributes?
Tool: Open Science Framework
• Components
• Add-ons
• Contributors
• Wiki
http://help.osf.io/m/collaborating/l/524109-using-the-wiki http://www.slideshare.net/DuraSpace/121014-slides-roadmap-to-the-future-of-share
Organization rules
• Be consistent
• One directory per project
• Separate subdirectories for• Raw data• Processed data• Code (processing and analysis)• Output
• Make raw data read-only
• Make README fileshttp://help.osf.io/m/60347/l/611391-organizing-files
Example: Temperature data
A strategy for describing the data
• Metadata: Relevant information for re-creation and re-use
• Contact info• How data was collected• Details about collection• Date, location of collection• Units
• Can be as simple as a text file
Metadata standards• Dublin Core: http://dublincore.org/documents/dcmi-terms/
• Can be applied to anything
• Many discipline specific metadata standards• EML: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html• MIAME: http://fged.org/projects/miame/
• Search for other standards: • http://www.dcc.ac.uk/resources/metadata-standards• https://biosharing.org/standards/
Genomics example (NCBI template)
Q’s: Describe your dataWhat do people need to know to reuse your data?
Are there any discipline-specific metadata standards?
What format will you describe your data in (text, XML, tabular)?
What fields will you include (author, date, format, identifier?)
A plan for preserving the data
• Where will it be stored?- Backups
• Necessary metadata and other products
• Who is responsible?
• How long?
Ellin, A. Rutgers Student Offer $1,000 for Data on Stolen Laptop.abcNEWS via Good Morning America. April 26, 2013. http://abcnews.go.com/blogs/business/2013/04/rutgers-student-offers-1000-for-data-on-stolen-laptop/
Backup
Back up recommendations
• Store in geographically distinct locations
• How often?
• Automation: Will you remember to do it manually?
• Security: Are you working with PHI?
Q’s: Preservation planWhat will you store?
Who will be responsible for the data (person or position)?
How long will you store it?
Where will you store it?
How will you back it up?
*Differentiate between working vs. archived
A method to access the data
• Important to funding agencies• Reproduce existing research• Promote further research
• Must be easily available: • No “by request only”• Embargoes are “ok”
• Data security: consider privacy and IP issues before sharing
Data access and sharing best practices
• Non-proprietary formats
• Include metadata
• As open as possible
• Follow CSU research data policy
Trusted Repositories: store and share
• Discipline specific • Search: http://service.re3data.org/browse/by-subject/
• Generic• Figshare - https://figshare.com/• Dryad - http://datadryad.org/
• CSU Digital Repository• http://lib.colostate.edu/digital-collections/ http://
67.media.tumblr.com/6228cbe58a9652f1a85e8ab1ed08d715/tumblr_inline_n6oukhNlZW1qf11bs.png
Tool: CSU digital repository
• Over 100 Datasets
• Satisfy requirements for manuscripts and grants
• At no cost <1 TB• $150/TB for 5 years• $300/TB for >5 years
Theses and Dissertation Data1. Submit to ProQuest with thesis or dissertation
• Supplemental data file• Only discoverable through thesis or dissertation
2. Submit to CSU Library separately• Requires distinctive descriptive metadata• Linked with thesis or dissertation• Data discoverable globally
Q’s: Access methodsWhere will people be able to access the data?
Does your discipline have a repository?
Are you complying with CSU’s data policy?
How will you format the data for CSU digital repository?
Need help?
• General: [email protected]
• Direct: [email protected]
• DMPTool: http://dmptool.org/
• Data Management Services website: http://lib.colostate.edu/services/data-management