Responsible conduct of research: Data Management

59
Responsible Conduct of Research: Managing Data Tobin Magle Data Management Specialist Nicole Kaplan Information Manager Daniel Draper Digital Repositories Unit Coordinator

Transcript of Responsible conduct of research: Data Management

Page 1: Responsible conduct of research: Data Management

Responsible Conduct of Research: Managing Data

Tobin MagleData Management

Specialist

Nicole KaplanInformation

Manager

Daniel DraperDigital Repositories

Unit Coordinator

Page 2: Responsible conduct of research: Data Management

Responsible Conduct of Research: The data management firehose!

C. Tobin Magle, PhDPlease ask me for help with data management!

Page 3: Responsible conduct of research: Data Management

My Background: molecular microbiology

(1) Magle CT et al Infect Immun. 2014 Feb;82(2):618-25. doi: 10.1128/IAI.00444-13. Epub 2013 Nov 25.(2) Sun W, Tanaka TQ, Magle CT, et al.. Sci Rep. 2014 Jan 17;4:3743. doi: 10.1038/srep03743.

Page 4: Responsible conduct of research: Data Management

Data Workshops

Page 5: Responsible conduct of research: Data Management

Individual help for ANY data topic

How do I write a DMP?

How do I organize my

data?

How do I clean and format my

data?How do I use R?

How do I get my data ready

to share?

How do I comply with funder mandates?

What DM tools are there for

collaboration?

How do I use R?

Page 6: Responsible conduct of research: Data Management

Data Management Serviceshttps://lib.colostate.edu/services/data-management

Page 7: Responsible conduct of research: Data Management

What is data management?

The policies, practices and procedures needed to manage the storage, access and preservation of

data produced from a research project

Page 8: Responsible conduct of research: Data Management

data management != data sharing

• but the same principles apply to both

Page 9: Responsible conduct of research: Data Management

Everything* is digital• Data are ephemeral• Big, complex data

• CAN share

*ok not everything, but most things

-> Need new skills

Page 10: Responsible conduct of research: Data Management

More researchers

https://www.nsf.gov/statistics/2016/nsf16300/digest/nsf16300.pdf

Page 11: Responsible conduct of research: Data Management

See arXiv:1402.4578 for details

Page 12: Responsible conduct of research: Data Management

Working Email

Data are extant(If status known)

Status of data (if response)

Response (if email working)

doi:10.1016/j.cub.2013.11.014

Page 13: Responsible conduct of research: Data Management

We are losing vast amounts of data

00

0

0

0

0

0

0

0

00

0

0

1

1

1

11

1

11

1

1

1

1

1

1

1

0

00

0

0

0

000

000 0

1

1

1 1

10

Who is responsible?

Page 14: Responsible conduct of research: Data Management

CSU data policy

General points:• The university owns, and is therefore ultimately responsible for

research data

• Researchers are the data managers

• The university promotes openness

http://policylibrary.colostate.edu/policy.aspx?id=737

Page 15: Responsible conduct of research: Data Management

You’re a Data Manager

http://www.phdcomics.com/comics/archive.php?comicid=382

Page 16: Responsible conduct of research: Data Management

CSU data policy

Research Data Associated with Theses and Dissertations

To preserve the complete scholarly record of the author, data sets must be incorporated. Therefore, a student depositing their thesis or dissertation is required to make discoverable, accessible and available their associated data sets in accordance with this policy and provisions of the University’s Digital Repository. Access and rights management (embargo period, access limited to specific IP addresses) shall be the same for the associated data sets as it is for the thesis or the dissertation.

http://policylibrary.colostate.edu/policy.aspx?id=737

Page 17: Responsible conduct of research: Data Management

When should data management happen?

Throughout the whole research cycle

Page 18: Responsible conduct of research: Data Management

Hypothesis

The research cycle

Page 19: Responsible conduct of research: Data Management

Hypothesis Experimental design

The research cycle

Page 20: Responsible conduct of research: Data Management

Hypothesis DataExperimental design

The research cycle

Page 21: Responsible conduct of research: Data Management

Hypothesis DataExperimental design

Results

The research cycle

Page 22: Responsible conduct of research: Data Management

Hypothesis DataExperimental design

ResultsArticle

The research cycle

Page 23: Responsible conduct of research: Data Management

Hypothesis DataExperimental design

ResultsArticle

The research cycle

Page 24: Responsible conduct of research: Data Management

Hypothesis DataExperimental design

ResultsArticle

Data Management Plans

The research cycle

Page 25: Responsible conduct of research: Data Management

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Analysis

The research cycle

Page 26: Responsible conduct of research: Data Management

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

The research cycle

Page 27: Responsible conduct of research: Data Management

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

Code Reproducible Research

The research cycle

Page 28: Responsible conduct of research: Data Management

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

Code Reproducible Research

Reuse

The research cycle

Page 29: Responsible conduct of research: Data Management

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

Code Reproducible Research

Reuse

The research cycle

Page 30: Responsible conduct of research: Data Management

What is research data?

• “The recorded factual material commonly accepted in the scientific community as necessary to validate research findings”

- White House Office of Management and Budget

• Reality: Applies to any research product

Page 31: Responsible conduct of research: Data Management

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Sharing

Analysis

Open Data

Code Reproducible Research

Reuse

Working data vs. archived data Working

Archived

Page 32: Responsible conduct of research: Data Management

What is a data management plan?

A description of how you plan to describe, preserve and share your research data.

Often required by funding agencies

Page 33: Responsible conduct of research: Data Management

Successful DMPs include

• A data inventory, including type(s) and size

• A strategy for describing the data

• A plan for preserving the data

• A method for access to the data

Always make sure to follow funder requirements

Page 34: Responsible conduct of research: Data Management

Tool: DMPTool

• Review requirements from different agencies

• https://dmptool.org/guidance

• Create new DMPs based on funding agency templates

• Search public DMPs

Page 35: Responsible conduct of research: Data Management

Data inventory

• What type of data are you going to collect?• What file type will be produced?• What size will these files be? How many files?

• How will you organize the data?

• What other research outputs will be produced?• Code/Software?• Templates/protocols?

Page 36: Responsible conduct of research: Data Management

Data inventory

• What type of data are you going to collect?

• What file type will be produced?

• What size will these files be? How many files?

• What other research outputs will be produced?• Code/Software?• Templates/protocols?

miRNA sequences

FASTQ files

1 GB per filex 64 strainsx 3 replicates-------------------~200 GB

R scripts for analysis and visualization

Data use tutorials

Page 37: Responsible conduct of research: Data Management

Data formats

• Avoid proprietary formats• Know what software can read your data

Proprietary Format Open Format

Excel (.xls, .xlsx) Comma Separated Values (.csv)

Word (.doc, .docx) plain text (.txt)

PowerPoint (.ppt, .pptx) PDF/A (.pdf)

Photoshop (.psd) TIFF (.tif, .tiff)

Quicktime (.mov) MPEG-4 (.mp4)

MPEG 4 Protected audio (.m4p) MP3 (.mp3)

Page 38: Responsible conduct of research: Data Management

Q’s: Data InventoryWhat kind of data are you going to collect?

What file type will be produced?

What size will these files be? How many files?

What other research outputs will be produced?

Page 39: Responsible conduct of research: Data Management

Folder systems

• Identify ways to divide your data into categories (Attributes)

• Top level organization is the most important attribute

• Provide documentation

Page 40: Responsible conduct of research: Data Management

Hierarchical Organizationmy_thesis

chapter1 chapter2 chapter3 chapter4

raw_data

replicate1

replicate2

processed_data code

Processing

cleaning

results

tables

figures

Page 41: Responsible conduct of research: Data Management

Q’s: Data Organization

• What kinds of files are there? (See data inventory)

• How could you group them?• Project?• Time?• Location?• File type?

• What are the most important attributes?

Page 42: Responsible conduct of research: Data Management

Tool: Open Science Framework

• Components

• Add-ons

• Contributors

• Wiki

http://help.osf.io/m/collaborating/l/524109-using-the-wiki http://www.slideshare.net/DuraSpace/121014-slides-roadmap-to-the-future-of-share

Page 43: Responsible conduct of research: Data Management

Organization rules

• Be consistent

• One directory per project

• Separate subdirectories for• Raw data• Processed data• Code (processing and analysis)• Output

• Make raw data read-only

• Make README fileshttp://help.osf.io/m/60347/l/611391-organizing-files

Page 44: Responsible conduct of research: Data Management

Example: Temperature data

Page 45: Responsible conduct of research: Data Management

A strategy for describing the data

• Metadata: Relevant information for re-creation and re-use

• Contact info• How data was collected• Details about collection• Date, location of collection• Units

• Can be as simple as a text file

Page 46: Responsible conduct of research: Data Management

Metadata standards• Dublin Core: http://dublincore.org/documents/dcmi-terms/

• Can be applied to anything

• Many discipline specific metadata standards• EML: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html• MIAME: http://fged.org/projects/miame/

• Search for other standards: • http://www.dcc.ac.uk/resources/metadata-standards• https://biosharing.org/standards/

Page 47: Responsible conduct of research: Data Management

Genomics example (NCBI template)

Page 48: Responsible conduct of research: Data Management

Q’s: Describe your dataWhat do people need to know to reuse your data?

Are there any discipline-specific metadata standards?

What format will you describe your data in (text, XML, tabular)?

What fields will you include (author, date, format, identifier?)

Page 49: Responsible conduct of research: Data Management

A plan for preserving the data

• Where will it be stored?- Backups

• Necessary metadata and other products

• Who is responsible?

• How long?

Page 50: Responsible conduct of research: Data Management

Ellin, A. Rutgers Student Offer $1,000 for Data on Stolen Laptop.abcNEWS via Good Morning America. April 26, 2013. http://abcnews.go.com/blogs/business/2013/04/rutgers-student-offers-1000-for-data-on-stolen-laptop/

Backup

Page 51: Responsible conduct of research: Data Management

Back up recommendations

• Store in geographically distinct locations

• How often?

• Automation: Will you remember to do it manually?

• Security: Are you working with PHI?

Page 52: Responsible conduct of research: Data Management

Q’s: Preservation planWhat will you store?

Who will be responsible for the data (person or position)?

How long will you store it?

Where will you store it?

How will you back it up?

*Differentiate between working vs. archived

Page 53: Responsible conduct of research: Data Management

A method to access the data

• Important to funding agencies• Reproduce existing research• Promote further research

• Must be easily available: • No “by request only”• Embargoes are “ok”

• Data security: consider privacy and IP issues before sharing

Page 54: Responsible conduct of research: Data Management

Data access and sharing best practices

• Non-proprietary formats

• Include metadata

• As open as possible

• Follow CSU research data policy

Page 55: Responsible conduct of research: Data Management

Trusted Repositories: store and share

• Discipline specific • Search: http://service.re3data.org/browse/by-subject/

• Generic• Figshare - https://figshare.com/• Dryad - http://datadryad.org/

• CSU Digital Repository• http://lib.colostate.edu/digital-collections/ http://

67.media.tumblr.com/6228cbe58a9652f1a85e8ab1ed08d715/tumblr_inline_n6oukhNlZW1qf11bs.png

Page 56: Responsible conduct of research: Data Management

Tool: CSU digital repository

• Over 100 Datasets

• Satisfy requirements for manuscripts and grants

• At no cost <1 TB• $150/TB for 5 years• $300/TB for >5 years

Page 57: Responsible conduct of research: Data Management

Theses and Dissertation Data1. Submit to ProQuest with thesis or dissertation

• Supplemental data file• Only discoverable through thesis or dissertation

2. Submit to CSU Library separately• Requires distinctive descriptive metadata• Linked with thesis or dissertation• Data discoverable globally

Page 58: Responsible conduct of research: Data Management

Q’s: Access methodsWhere will people be able to access the data?

Does your discipline have a repository?

Are you complying with CSU’s data policy?

How will you format the data for CSU digital repository?

Page 59: Responsible conduct of research: Data Management

Need help?

• General: [email protected]

• Direct: [email protected]

• DMPTool: http://dmptool.org/

• Data Management Services website: http://lib.colostate.edu/services/data-management