Scientific data management
Transcript of Scientific data management
DatamanagementResponsible Conduct of Research
Seminar SeriesUC Berkeley
April 16, 2012
Who are you?
Jeffery Loo, PhD
“Flying books”Installation by J. Ignacio Diaz de Rabago
UC Berkeley Library
NSF data management plan
Requirement as of January 18,
2011
Your plans to organize, store, and share data
http://www.nsf.gov/bfa/dias/policy/dmp.jsp
“My Data Management Plan – a satire”
Dr. C. Titus BrownAssistant ProfessorMichigan State University
Source
Dear NSF,
I am happy to respond to your request for a 2-page Data Management Plan.
First of all, let me say how enthusiastic I am that you have embraced this new field of "large scale data analysis". Ever since I started working with large Avida data sets in 1993, […] I have seen the need for a systematic plan to manage the data. It is nice to see NSF stepping up to the plate in such a timely manner, and I am happy to comply.
Now, as to my actual data management plan, here is how I plan to deal with research data in the future.
I will store all data on at least one, and possibly up to 50, hard drives in my lab.
The directory structure will be custom, not self-explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is.
Backups will rarely, if ever, be done.
When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.
[….]
Note, we didn't use a version control system, either. […] And our repository is not publicly available - you have to beg for permission. Note, I only answer e-mail on every other Tuesday.
Any design notes on the data analysis are in our private e-mail, and we will fight to the death -- up to and including ignoring FOIA requests -- to prevent you from obtaining them.
Meanwhile we will continue publishing exciting sounding (but irerproducible) analyses, and submitting grants based on them, because that's the only thing that the reviewers care about.
sincerely yours,
--titus
(representing every computational scientist in the world.)
Data challenges
Distributed, uncoordinated effort
Concerns about data re-use
Data management may be ad lib“Can’t you ever relax?”
Informal data management practices
Lots to do!
Ensure long-term access
Facilitate sharing
Prepare for future re-use
Data activities in
the research workflow
Source:http://www2.lib.virginia.edu/brown/data/lifecycle.html
Lots of different research products
Models and computational simulations Images, photographs, audio, and video
Instrument readings Maps
Software Artifacts and samples
Physical collections And more …
Goal for this lunch hour
Review “first steps” in data management
Saving dataDescribing/documenting dataSharing dataData management planningData ethics
Common sense versus common practice
Saving data
Hall of fame anecdote
http://www.youtube.com/watch?v=J6HtRWyiL98
Where do you store data safely?
Traditional storage not always sufficientPersonal computersDepartmental/university servers
Two additional types of storageArchives and repositoriesCloud storage (storing files in an online site)
Archives and repositories
Special types of online storage sites
Long-term storage, management, and preservation
Search, download, and analytic functionalities
Institutional archives and repositories
Merritthttp://merritt.cdlib.org/
Data repository management services at UCBhttp://ist.berkeley.edu/ds
Public archive and repository
Long-term access, open to the public
GenBankhttp://www.ncbi.nlm.nih.gov/genbank/
3rd party cloud storage
Amazon S3Google Docs
Dropbox
Beware of posting sensitive data/files
Deciding on storage
Consider:Permanence Oversight Security
Save for long-term access
Recommended file formats• Non-proprietary• Uncompressed and unencrypted (okay to encrypt sensitive
data)• Common usage by your research community• Standard representation (e.g., ASCII text, Unicode)
1 2 3
Original master Local external storage
Remote external storage
UC Berkeley IST backup services
3rd party services (Amazon S3, Elephant Drive, Jungle Disk, Mozy, Carbonite Free, Dropbox)
Email a copy to yourself
Backup 3 copies
Describing anddocumenting data
(metadata)
What countries have a five-pointed star on their national flag?
DOI: 10.1126/science.1207745
“outsourcing” our memory
“we don’t remember information as well, when we expect to find it on a computer later”
If we outsource our memory to computers …
We need good organization structures toFind data from the past quickly and completelyUnderstand data from the past
It helps toDocument and describe data“Assign metdata”
What do you document?
Descriptivemetadata elements
Administrative metadata elements
Structural metadata elements
Title Creator or contact Date Experimental conditions MethodologyVersion
Dictionary or codebook to explain the data variables
Tools and software needed for processing or visualizing the data
File formats
File names
How to record metadata
writemetadata
save asreadme.txt
store in file folder with data
Option 1
Metadata form/file in an archive/repository
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Option 2
Annotate
<title>Effect of salt on ice cream production efficiency</title> <temperature>0</temperature>
XML, a popular system for annotating datahttp://www.w3schools.com/xml/
Option 3
Assign descriptive namesDescriptive file names
Descriptive folder names
Consider these elements:• project title• experimental
conditions and group• trial numbers• file version number
indicating data modifications• date or time stamps• author initials
data1.csv 75-celsius-trial_control_ver002.csv
Data > 1 > raw >> part A >> 110904 > readings
Project-title > Trial 1 >> Experimental >> Control > Trial 2 > Trial 3
Australia
Brazil
Cape Verde
Ethiopia
United States of America
Sharing data
Historic data sharing
Anagrams to secure discoveriesVersus the “open science revolution” of journals
today
Galileo Newton Huygens Hooke
Open scienceShare research data, products, and
communications openly
Potential benefitsProtects unique data that cannot be readily replicatedReinforces open scientific inquiryEncourages diversity of analysis and opinionPromotes new lines of researchMakes possible the testing of new or alternative
hypotheses and methods of analysisSupports studies on data collection methods and
measurementFacilitates the education of new researchersEnables the exploration of topics not envisioned by the
initial investigatorsPermits the creation of new datasets when data from
multiple sources are combinedProvides content for scientific education
Data sharing examples
Crystal structure of M-PMV retroviral protease
Private sector too!
Cross-sector data sharing for
Alzheimer’s researchhttp://www.adni-info.org
(News story)
Increased citation rate
Funding agency policies
NIH Data Sharing Policy
NSF Data Sharing PolicyData management plan for grant
applications
Journal expectations
Data sharing as a term of publication
How do I share?
Personal sharing
Share-upon-requestEmail me for a copy!
Self-archiveDownload from my personal website!
Journal publishing
Public archive or repository
The Ancient Agora of Athens
Ideal characteristicsPopular with national/global coverageSpecific to your disciplineOffers long-term preservation
Find an archive/repositoryAsk colleaguesSearch http://databib.lib.purdue.edu/
Public versus institutionalarchives and repositories
Institutional archives/repositoriesMay restrict to a smaller audienceMay offer greater control of your data
Public archives/repositoriesCreate comprehensive dataset for a larger
research problem spaceDomain-specific archives/repositories may
provide better support
Help others find your data
Berkeleywww.berkeley.edu/mystuff/super-data.csv
Stanfordwww.stanford.edu/mystuff/super-data.csv
file moves to
old URL is kaput
DOI Digital object identifier
Resolve DOIby visiting http://dx.doi.org/ followed by DOI
File can move, but DOI remains the sameThe DOI record stores location details
Try permanent identifiers
Generate permanent identifiers
request your free account, by emailing [email protected]
http://n2t.net/ezidSubscription through the UCB Library
Final tips for sharing
Be selective
Recognize restrictions (privacy and confidentiality)
Online services for sharing among your teamResearch Hub3rd party services
Data management planning
What is a data management plan?
A plan for organizing, storing, and sharing data
Planning associated with greater self-control for exercisemedical adherenceself-health exams sunscreen useschoolworkrefraining from a negative
behavior
Source: Townsend and Liu, 2012
Perhaps planning helps for data management
Why have a plan?
Prepare for efficient and quality data collection that is safe and shareable
NSF and NIH requirements
requirements
Data management plan≤ 2 pagesdescribes how data will be managed, disseminated, and shared
Plan undergoes peer review
Writing an NSF data management plan
Specific requirements vary by NSF divisions
In general, describe:Types of research data and materials producedStandards for data format, content, and metadataPolicies for access and sharingPolicies for re-use, re-distribution, and derivativesPlans for archiving and preserving
You can explain why data will not be shared
Examples1 and 2
NIH requirements
Timely data sharing encouraged
If requesting ≥ $500k per year, a plan is required
Describe how data will be sharedor why sharing is not possible
In the final progress report, describe data sharing actions taken
Writing an NIH data sharing plan
A brief paragraph
Suggested topicsSchedule for sharingFormat of the dataDocumentation of the dataAnalytic tools providedData-sharing agreements (criteria and conditions)Mode of data sharing
there was a
beautiful scientist
NIH plan example 1
The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers,
we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects.
Therefore, we are not planning to share the data.
NIH plan example 2This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years.
Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/
User registration is required in order to access
or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource.
Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to
users will not be used for commercial purposes, and will not be redistributed to third parties.
Library guidance
Guides, templates, exampleshttp://www.lib.berkeley.edu/sciences/data/guide
Online service for building data plans
https://dmp.cdlib.org/
Step-by-step instructions for meeting funding
agency requirements
Data ethics
Study by Martinson et al., 2005Source - doi:10.1038/435737a
Motivated by increasing pressureto publish papers and win grants?
3247 respondents
0.3% admitted to falsification or “cooking” research data
About 1 in 3 confessed to committing at least one of 10 serious misbehaviors
Citing data
Prevent distortions and manipulations
Keep raw original data
Log all changes made
Data licensing
Restrictions on data use, for example
No for-profit useNo re-sharingGive attribution
Check for license/terms of use
Stay current with data requirements
Review for changes to policies byFunding agenciesUniversity regulationsFederal and state governments
Haiku summary
Data is precious Safely store and share widelyGood for your career