Post on 01-Nov-2014
description
Data Management for Research
Aaron Collie, MSU LibrariesLisa Schmidt, University Archives
Introductions Please tell us your name and
department A brief description of your
primary research area What do you consider to be
your research data Experience and/or comfort
level with managing research data?
cc http://www.flickr.com/photos/quinnanya/
http://tiny.cc/msudataseminar
Data Management. Isn’t that… trivial?
Not so much. Data is a primary output of research; it is very expensive to produce high quality data. Data may be collected in nanoseconds, but it takes the expert application of research protocol and design to generate data.
CC-BY-SA-3.0 Rob Lavinsky CC-BY-SA-3.0 Rob
Even more consequential, data is the input of a process that generates higher orders of understanding.
Wisdom
Knowledge
Information
Data
Understanding is hierarchical!
Russell Ackoff
This is the engine of the academic industry…De
fine
a qu
estio
n
Gath
er
info
rmati
on
Form
a
hypo
thes
is
Test
the
hypo
thes
is
Anal
yze
the
data Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a
hypo
thes
is
Test
the
hypo
thes
is
Anal
yze
the
data
Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
So, things can get a little messy.
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a
hypo
thes
is
Test
the
hypo
thes
is
Anal
yze
the
data
Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
The scientific method “is often misrepresented as a fixed sequence of steps,” rather than being seen for what it truly is, “a highly variable and creative process” (AAAS 2000:18).
Gauch, Hugh G. Scientific Method in Practice. New York: Cambridge University Press, 2010. Print. (Emphasis added)
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a
hypo
thes
is
Test
the
hypo
thes
is
Anal
yze
the
data Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
The Research Depth Chart
Scientific Method
Research Design
Research Method
Research Tasks Mor
e Sp
ecifi
c
M
ore
Gen
eric
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a
hypo
thes
is
Test
the
hypo
thes
is
Anal
yze
the
data Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
Problem Identification
Study Concept
Literature Review
Environmental Scan
Funding & Proposal
Research Design
Research Methodolog
y
Research Workflow
Hypothesis Formation
Design Validation
Research Activity
Data Management
Data Organization
Data Storage
Data Description
Data Sharing
Scholarly Communication
Report Findings
Publish
Peer Review
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a
hypo
thes
is
Test
the
hypo
thes
is
Anal
yze
the
data Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
Problem Identification
Study Concept
Literature Review
Environmental Scan
Funding & Proposal
Research Design
Research Methodolog
y
Research Workflow
Hypothesis Formation
Design Validation
Research Activity
Data Management
Data Organization
Data Storage
Data Description
Data Sharing
Scholarly Communication
Report Findings
Publish
Peer Review
Upfront Decisions for Researchers How are the data described and organized? Who are the expected and potential audiences for
the datasets? What publications or discoveries have resulted from
the datasets? How should the data be made accessible? How might the data be used, reused, and
repurposed?
Upfront Decisions for Researchers What is the expected lifespan of the data? Besides the researcher(s) on the project, who else
should be given access to the data? Does the dataset include any sensitive information? Who owns or controls the research data? Should any restrictions be placed on the dataset? How are the data stored and preserved?
• Introduction• Background
• The Impetus: NSF Data Management Plan Mandate• The Effect: Policy to Practice• The Response: Changing Data Landscape
• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup• Data Publishing, Sharing, & Reuse• Protecting Data & Responsible Reuse
• Data Lifecycle Resources
Agenda
But why are we really here?
Impetus: NSF has mandated that all grant applications submitted after January 18th, 2011 must include a supplemental “Data Management Plan”
Effect: The original NSF mandate has had a domino effect, and many funders now require or state guidelines for data management of grant funded research
Response: Data management has not traditionally received a full treatment in (many) graduate and doctoral curricula; intervention is necessary
Impetus: NSF Data Management Plan
Policies for re-use, re-distribution, and creation of derivatives Plans for archiving data, samples, and other research outcomes, maintaining access Types of data, samples, physical collections, software generated Standards for data and metadata format and content Access and sharing policies, with stipulations for privacy, confidentiality, security, intellectual property, or other rights or requirements
Impetus: NSF Data Management Plan
NSF will not evaluate any proposal missing a DMP PI may state that project will not generate data DMP is reviewed as part of intellectual merit or broader impacts of application, or both Costs to implement DMP may be included in proposal’s budget May be up to two pages long
Effect: Funder Policies
NASA “promotes the full and open sharing of all data”
“requires that data…be submitted to and archived by designated national data centers.”
“expects the timely release and sharing of final research data"
"IMLS encourages sharing of research data."
“…should describe how the project team will manage and disseminate data generated by the project”
Effect: More is on the way
Presidential Memorandum on Managing Government Records (August 24, 2012)• Managing Government Records Directive: All permanent electronic records in Federal agencies will be managed electronically to the fullest extent possible for eventual transfer and accessioning by NARA in an electronic format.
White House policy memo (February 22, 2013)• Increasing Access to the Results of Federally Funded Scientific Research: Federal agencies with more than $100M in R&D expenditures must develop plans to make the published results of federally funded research freely available to the public within one year of publication.
Effect: Local Policy
University Research Council Best PracticesResearch Data: Management, Control, and Access
• To assure that research data are appropriately recorded, archived for a reasonable period of time, and available for review under the appropriate circumstances.– Ownership = MSU– “Stewardship” = You– Period of Retention = 3 years– Transfer of Responsibility = Written Request
Response: Changing Data Landscape
Data Management Competencies Standards & Best Practices Discipline Specific Discourse
Data sharing and open data Data sets as publications Data journals Citations for data (e.g., used in secondary analysis) Data as supplementary materials to traditional articles Data repositories and archives
Data Sharing Impacts Reinforces open scientific
inquiry Encourages diversity of
analysis and opinion Promotes new research,
testing of new or alternative hypotheses and methods of analysis
Supports studies on data collection methods and measurement
Cc http://www.flickr.com/photos/pinchof_10/
Data Sharing Impacts
Facilitates education of new researchers
Enables exploration of topics not envisioned by initial investigators
Permits creation of new datasets by combining data from multiple sources
• Introduction• Background
• The Impetus: NSF Data Management Plan Mandate• The Effect: Policy to Practice• The Response: Changing Data Landscape
• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup• Data Publishing, Sharing, & Reuse• Protecting Data & Responsible Reuse
• Data Lifecycle Resources
Agenda
Research Data Management Fundamentals
File Organization Documentation Storage & Backup Data Publishing, Sharing,
& Reuse Protecting Data
& Responsible Reuse
File Organization Practices: Overview
1. Design a file plan for your research project
2. Use file naming conventions that work for your project
3. Choose file formats to maximize usefulness
“When I was a freshmen I named my assignments Paper Paperr Paperrr Paperrrr”-Undergrad
Design a File Plan
File structure is the framework Classification system makes it easier to locate
folders/files Benefits:
Simple organization intuitive to team members and colleagues
Reduces duplicate copies in personal drives and e-mail attachments
Design a File Plan
Choose a sortable directory hierarchy Example 1: Investigator, Process, Date
CollieTEI_Encoding20110117
Example 2: Instrument, Date, Sample Usability Survey
20120430Sample 1
Design a File Plan
Example documentation of Directory Hierarchy: /[Project]/[Grant Number]/[Event]/[Investigator/Date]
Use File Naming Conventions
Enable better access/retrieval of files Create logical sequences for file sorting More easily identify what you’re searching for
Meaningful but short—255 character limit Use alphanumeric characters
Example: abc123 Capital letters or underscores differentiate
between words Surname first followed by initials of first name
Use File Naming Conventions
Year-month-day format for dates, with or without hyphens Example 1: 2006-03-13 Example 2: 20060313
Decide on a simple versioning method Example: file_v001
Use File Naming Conventions
To create consistent file names, specify a template such as:
[investigator]_[descriptor]_[YYYYMMDD].[ext]
Use File Naming Conventions
This Not ThissharpeW_krillMicrograph_backscatter3_20110117.tif KrillData2011.tif
This Not ThisborgesJ_collocation_20080414.xml Borges_Textbase.xml
Choose Appropriate File Formats
• Non-proprietary• Open, documented standard• Common usage by research community• Standard representation (ASCII, Unicode)• Unencrypted• Uncompressed
Choose Appropriate File Formats
Format Genre Optimal Standards TEXT .txt; .odt; .xml; .html
AUDIO .flac; .wav,
VIDEO .mp2/.mp4; .mkv
IMAGE .tif; .png; .svg; .jpg
DATA .sql; .csv
Documentation Practices: Overview
Even researchers require proper documentation to decipher or reuse their datasets
Documentation = accessible, intelligible datasets
Documentation Practices: Overview
1. At minimum create a README file that you can use to document your project
2. Utilize standards for describing data including Metadata Standards
3. If applicable, use in-line code commentary to explain code
(cc) Will Scullin
Create a README file
At minimum, store documentation in readme.txt file or equivalent, with data What data consists of How it was collected Restrictions to distribution or use Other descriptive information
“Data about data” Standardized way of describing data Explains who, what, where, when of data creation
and methods of use Data more easily found Data more easily compared to other data sets
Use Metadata Standards
Use Metadata Standards
Basic project metadata:
• Title • Language • File Formats
• Creator • Dates • File Structure
• Identifier • Location • Variable List
• Subject • Methodology • Code Lists
• Funders • Data Processing • Versions
• Rights • Sources • Checksums
• Access Information
• List of File Names
Use Metadata Standards Dublin Core: Commonly-used descriptive metadata
format facilitates dataset discovery across the Web. Data Documentation Initiative (DDI): Defines
metadata content, presentation, transport, and preservation for the social and behavioral sciences.
ISO 19115:2003: Describes geographic data such as maps and charts.
More examples:http://www.lib.msu.edu/about/diginfo/collect.jsp
Use In-Line Code Commentary
Example of R code commentary
# Cumulative normal densitypnorm(c(-1.96,0,1.96))
If applicable, in-line code commentary helps explain code
Storage & Backup Practices
1. Avoid single points of failure
2. Ensure data redundancy & replication
3. Understand common types of storage
(cc) George Ornbo
Data at significant risk of loss without storage and backup plan
Avoid Single Points of FailureA single point of failure occurs when it would only take one event to destroy all data on a device
Use managed networked storage when possible Move data off of portable media Never rely on one copy of data Do not rely on CD or DVD copies to be readable Be wary of software lifespans
Ensure Data Redundancy Effective data storage plan provides for 3 copies:
Primary authoritative copy Secondary local backup Tertiary remote backup
Geographically distribute and secure Local vs. remote, depending on needed recovery time
Personal computer, external hard drives, departmental, or university servers may be used
Ensure Data Redundancy
Cloud storage Amazon s3 Google MS Azure DuraCloud Rackspace Glacier
Note that many enterprise cloud storage services include a charge for in/out of data transfers
$$$
Understand Common Types of Storage
• Optical Media• Portable Flash Media• Commercial Hard Drives• Commercial NAS• Cloud Storage• Enterprise Network Storage• Trusted Archival Storage
Understand Common Types of Storage
• Features of storage types:• Portable data transfers• Short-term storage• Project term storage• Networked data transfer• Long-term storage• Reliable backup option
Understand Common Types of StoragePortable Data Transfer
Short Term Storage
Project Term Storage
Networked Data Transfer
Long Term Storage
Reliable Backup Option
Optical Media ✔ ✗ ✗ ✗ ✗ ✗
Portable Flash Media
✔ ✔ ✗ ✗ ✗ ✗
Commercial Hard Drives
✔ ✔ ✔ ✗ ✗ ✗
Commercial NAS ✗ ✔ ✔ ✔ ✗ ✗
Cloud Storage ✗ ✔ ✔ ✔ ✗ ✗
Enterprise Network Storage ✗ ✔ ✔ ✔ ✔ ✔
Trusted Archival Storage ✗ ✗ ✗ ✔ ✔ ✔
Understand Common Types of StorageMedia Storage @ MSU
Optical Media MSU Computer Store—Sells Optical Media and hardware accessoriesUAHC Media Storage Service—Offers physical lock-box like storage for MSU
Flash Media MSU Computer Store—Sells Optical Media and hardware accessoriesUAHC Media Storage Service—Offers physical lock-box like storage for MSU
Commercial Hard Drives
MSU Computer Store—Sells Optical Media and hardware accessories.UAHC Media Storage Service—Offers physical lock-box like storage for MSU
Enterprise Cloud Storage
Angel—Free. Ideal for collaboration; not storage space. Phase out 2015Desire2Learn—Free. Ideal for collaboration; not storage space. Replaces AngelGoogleApps—Free. Ideal for collaboration; not intended as storage space
Enterprise Network Storage
AFS Space—Free to 1GB, add’l space can be purchased w/dept. accountIT Services Individual, Mid-Tier and Enterprise Storage—Fee basedHPCC Home or Research—Free up to 1TB. Fee based additions available
Trusted Archival Storage
Disciplinary Repositories – Disciplinary repositories offer archival services for pertinent research data.
Data Publishing, Sharing, Reuse
1. Time-intensive, with potentially high return on investment
2. Publish data in several data publication venues to morebroadly share results of research
Research datasets on par with peer-reviewed journal articles as first-class scholarly contributions
Sharing & Publishing Data
• Data preparation for sharing and publication is a time-intensive process
• Potential positive outcomes:• Increased research impact and citations• Enable additional scientific inquiry• Opportunities for co-authorship and collaboration• Enhance your grant proposal’s competitiveness
Data Publication Venues
• Multiple ways to publish research data• Faculty or project website• Journal supplementary materials• Disciplinary data repository (data archive)
• Varying levels of support for indexing, access controls, and long-term curation
Data Publication Venues
• Disciplinary Data Repository• Securely share data, ensure long-term access• High visibility• Often offer persistent citations• Availability varies across domains• Databib.org directory
Protecting Data & Responsible Reuse
1. Consider how to protect data and intellectual property rights while encouraging reuse
2. Keep in mind ethical concerns when sharing data
(cc) Will Scullin
Intellectual Property
• IP refers to exclusive rights of creators of works• Individual data cannot be protected by US
copyright• Organization of data such as database, creative
work produced by data, and research instruments used may be protected
©
Intellectual Property• Principal investigator’s institution holds IP rights• Provide clearly stated license for producing
derivatives, reusing, and redistributing datasets• License under Creative Commons• State if any restrictions or embargos on use
• Provide example of how work should be cited to encourage proper attribution on reuse
• Document any IP / copyright issues
Ethics & Data Sharing• Keep in mind the following ethical concerns when
sharing your data:• Privacy• Confidentiality• Security and integrity of the data
• For data involving human subjects, obtain written permission or consent stating how the data may be reused
Best Practices = High Impact Data• File organization ensures easier access and
retrieval of data• Documentation makes datasets accessible and
intelligible to users• Storage and backup safeguards data• Data publishing and sharing encourages the most
widespread reuse of data• Data protection ensures responsible reuse
• Introduction• Background
• The Impetus: NSF Data Management Plan Mandate• The Effect: Policy to Practice• The Response: Changing Data Landscape
• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup• Data Publishing, Sharing, & Reuse• Protecting Data & Responsible Reuse
• Data Lifecycle Resources
Agenda
ContactLisa M. SchmidtElectronic Records ArchivistUniversity Archives & Historical Collectionslschmidt@ais.msu.edu
Aaron CollieDigital Curation LibrarianMSU Librariescollie@msu.edu