Introduction to Data Management
Transcript of Introduction to Data Management
![Page 1: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/1.jpg)
Introduction to Data Management
June 4, 2012
Karen Hanson, MLIS
Knowledge Systems Librarian
Alisa Surkis, PhD, MLS
Translational Science Librarian
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.
![Page 2: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/2.jpg)
Understand…
• current climate around data management and data sharing
• best practices in data documentation and description
• principles of storage and long-term preservation of data
• basic elements of a data management plan
Objectives
2/76
![Page 3: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/3.jpg)
1. Introduction
2. Incentives
3. Standards for description & documentation
4. Storage, archiving and sharing
5. Data Management Plans
Data management
3/76
![Page 4: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/4.jpg)
What is data?
• “Facts and statistics collected together for reference or analysis”
Oxford Dictionaries online
http://oxforddictionaries.com/definition/data?q=data
• “Research data, unlike other types of information, is collected, observed, or created, for purposes of analysis to produce original research results.”
University of Edinburgh, Information Services
http://www.ed.ac.uk/schools-departments/information-services/services/research-support/data-library/research-data-mgmt/data-mgmt/research-data-definition
4/76
![Page 5: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/5.jpg)
And that means…?
• Tables of numbers
• Sequences of bits (10110) or base pairs (GACTTA)
• Samples, specimens, slides
• Sound recordings, video recordings, images
• Laboratory notebooks
• Protocols, methodologies
• Software (code), algorithms, models
• “A myriad of other information objects, none of which may stand alone” - Christine Borgman
5/76
![Page 6: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/6.jpg)
Categories of data
• Observational (real time)
• Experimental (lab)
• Computational (model)
• Derived or Compiled
Source: National Science Board. Long-Lived Digital Data collections, 2005.
6/76
![Page 7: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/7.jpg)
What is data management?
• Not just creation, storage, processing and analysis
• Refers to managing the full lifecycle of data
7/76
![Page 8: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/8.jpg)
Data management lifecycle Creating data
• design research
• plan data management
• capture/create the data
• document process
Source: UK Data Archive, University of Essex.
http://www.data-archive.ac.uk/create-manage/life-cycle
8/76
![Page 9: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/9.jpg)
Data management lifecycle
Processing data
• enter data, digitize, transcribe, translate
• check, validate, clean data
• anonymize data where necessary
• describe data
• manage and store data
9/76
![Page 10: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/10.jpg)
Data management lifecycle
Analyzing data
• interpret / analyze data
• write publications
10/76
![Page 11: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/11.jpg)
Data management lifecycle
Preserving data
• migrate data to best format / medium
• back-up and store data
• create final metadata and documentation
• archive data
11/76
![Page 12: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/12.jpg)
Data management lifecycle
Giving access to data
• distribute / share data
• control access
12/76
![Page 13: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/13.jpg)
Why would anyone need my data?
Cow concept: Dorothea Salo, “Save the Cows”, 2009.
http://www.slideshare.net/cavlec/save-the-cows-data-curation-for-the-rest-of-us-1533252
Analyze, process
Publish
![Page 14: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/14.jpg)
You don’t need to kill the cow!
14/76
![Page 15: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/15.jpg)
Data management lifecycle
Re-using data
• follow-up research
• new research
• check validity
15/76
![Page 17: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/17.jpg)
1. Introduction
2. Incentives
• You and your data
• Government mandates
• Publisher requirements
• Lost credibility
• Faster progress, better science
• Citations
• Big data
3. Standards for description & documentation
4. Storage, archiving and sharing
5. Data management plans
Data management
17/76
![Page 18: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/18.jpg)
Why worry about data management?
• Bad things can happen if you don’t
• People will make you anyway
• Sharing is win-win
…you can’t share what you can’t find, read, decipher
18/76
![Page 19: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/19.jpg)
You and your data
• Make research process more efficient
• Comprehensibility
• Security
… what about other people and your data?
19/76
![Page 20: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/20.jpg)
• Government mandate
o Data sharing
o Data management plans
• Publisher requirements
• Credibility issues
Sharing - sticks!
20/76
![Page 21: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/21.jpg)
Sharing - carrots!
Faster progress!
Better science!
Also…
o Data becomes citable
o Data linked to publications
21/76
![Page 22: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/22.jpg)
Government mandates
Timeline
1999: US Office of Management and Budget amended the Freedom of Information Act
2003: NIH adopted a data sharing policy.
(still no teeth, but young yet)
22/76
![Page 23: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/23.jpg)
Government mandates
2008: NIH implements the Public Access Policy
2009: White House issues the Open Government Directive
2011 (Jan): NSF made data management plans a requirement
23/76
![Page 24: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/24.jpg)
Government mandates (bigger sticks on the way?)
• NSTC’s Interagency Working Group on Digital Data
o 11/2011 Request for Information (RFI) on Public Access to Digital Data Resulting from Federally Funded Scientific Research
• NIH Director Working Group on Data and Informatics
o 1/2012 Request for Information for Input into the Deliberations of the Advisory Committee to the NIH Director Working Group on Data and Informatics
24/76
![Page 25: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/25.jpg)
“The Federal policy framework should move public access to digital data away from the current idiosyncratic environment to a systematic approach that lowers barriers to data access, discovery, sharing and re-use.”
- Sayeed Choudhury
The Sheridan Libraries of Johns Hopkins University
One response to RFI
25/76
![Page 26: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/26.jpg)
Postdoc survey: Data management/sharing plans
To what extent have you dealt with NIH data sharing regulations or NSF data management plans?
26/76
38%
48%
12% 8%
39%
48%
11% 10%
Not aware ofpolicies
Aware but noinvolvement
Had to write dataplan
Had toImplement data
plan
Nationally
NYULMC
![Page 27: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/27.jpg)
Publisher requirements
Nature:
“After publication, readers who encounter refusal by the authors to comply with these policies should contact the chief editor of the journal... In cases where editors are unable to resolve a complaint, the journal may refer the matter to the authors' funding institution and/or publish a formal statement of correction, attached online to the publication, stating that readers have been unable to obtain necessary materials to replicate the findings.”
http://www.nature.com/authors/policies/availability.html
27/76
![Page 28: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/28.jpg)
Publisher requirements
Science:
“All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. All computer codes involved in the creation or analysis of data must also be available to any reader of Science. After publication, all reasonable requests for data and materials must be fulfilled. .”
http://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml
28/76
![Page 29: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/29.jpg)
Suspect data: Losing credibility
Comparison of statistical analyses: papers with shared data vs. papers with no sharing
• Unshared data had more errors in reporting of results • Unshared data was weaker
• p values of unshared data significantly closer to 0.05
NOTE: APA journals require sharing: 57% did not share.
Consequences? No teeth.
Wicherts JM, Bakker M, Molenaar D. Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS One. 2011;6(11):e26828. Epub 2011 Nov 2.
PMID:22073203; PMCID: PMC3206853.
29/76
![Page 30: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/30.jpg)
Retraction: Lost credibility “There were 60 children in the study. The ages were by accident duplicated between the upper and lower halves of the database. Thus, the ages for the first 30 children in the data set were identical and in the same order with the ages for the second set of 30 children…The files with the original data are not available any more, making it impossible to reconstruct a valid data set for reanalysis.”
http://www.ctajournal.com/content/2/1/6/abstract
30/76
Amy Wagers, Harvard stem cell researcher • 1/2010 Nature article: retracted 10/2010 • 8/2008 Blood article: retracted 12/2011
Shane Mayack, claimed “these errors occurred due to mistakes made in data retrieval that were a cause of a poor, but not a unique, data management and archiving system” but stands by results.
http://retractionwatch.wordpress.com/category/by-author/amy-wagers-retractions/
![Page 31: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/31.jpg)
Faster progress! Better science!
Case studies
• Human Genome Project
• Neuromorpho.org
31/76
![Page 32: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/32.jpg)
Human Genome Project
• NIH’s first foray in big science
• Experiment in data sharing
• Establishment of Bermuda principles o Automatic release of sequence assemblies
o Immediate publication of sequences
o Entire sequence freely available
• Full genome sequenced ahead of schedule
32/76
![Page 33: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/33.jpg)
Neuromorpho.org
• Detailed morphological reconstructions of neurons o Time-intensive
o Re-usable in many way
• > 6k reconstructions deposited since 2006
• > 100k downloads in 2011
33/76
![Page 34: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/34.jpg)
Citing data
• Interoperable data and publications
• Unique Identifiers
o Findable
o Citable
• More citations
34/76
![Page 35: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/35.jpg)
Big data
“..almost everything about science is changing because of the impact of information technology. Experimental, theoretical, and computational science are all being affected by the data deluge, and a fourth, ‘data-intensive’ science paradigm is emerging.”
- Jim Gray, Fourth Paradigm (2009)
March 29, 2012: Federal government announces Big Data Research and Development Initiative, 200M+ from 6 agencies.
35/76
![Page 36: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/36.jpg)
1. Introduction
2. Incentives
3. Standards for description & documentation • File Names
• Databases
• Versioning
• Metadata
• Quality control
4. Storage, archiving and sharing
5. Data management plans
Data management
36/76
![Page 37: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/37.jpg)
Postdoc Survey How do you determine how to structure your data or what information to save about the data in order to be able to effectively access data in the future? (check all that apply)
15%
67%
46%
19% 13%
2%
11%
70%
43%
17% 13%
1%
No pre-defined
standards
Personalstandards
Lab-basedstandards
Disciplinestandards
InstitutionalStandards
Other
Nationally
NYULMC
37/76
![Page 38: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/38.jpg)
Why to avoid making your own standard, if possible…
38/76
Standards. http://xkcd.com/927/
This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
![Page 39: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/39.jpg)
What does your data look like?
• Many files or one file
• Raw format (numeric, images, binary)
• Processed format
• File sizes
39/76
![Page 40: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/40.jpg)
File names bob_1262011.tif
Bob Smith? Bob Jones?
12 June, 2011? December 6, 2011? January 26, 2011?
40/76
Unambiguous dates, the ISO standard:
• YYYYMMDD or YYYY-MM-DD o e.g. 20120612 = June 6, 2012
• YYYYMMDDTHH:MM:SS o e.g. 20120612T14:03:12 = June 6, 2012 2:03:12 pm
![Page 41: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/41.jpg)
100s of slices
5-7 experiments a week…
3 post docs
100s of slides
100s of huge images
TIF TIF
TIF TIF
TIF TIF
TIF TIF
TIF
TIF TIF
TIF TIF
TIF TIF
TIF TIF
TIF
TIF TIF
TIF TIF
TIF
TIF TIF
TIF TIF
TIF
1000s of image files TIF TIF
1 rat heart
41/76
![Page 42: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/42.jpg)
File names should…
• Reflect contents of the file
• Use non cryptic/intuitive names if possible
• Consider any character restrictions
• Uniquely identify the file
• Avoid special characters (e.g. *, $, &, #)
• Use (“_”) instead of space or dash
42/76
![Page 43: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/43.jpg)
Example of a good file name
AtherRat_012_056_mb_0423.tif
AtherRat = experiment name
012 = experiment number
056 = sample number
mb = stain used, methylene blue
0423 = coordinates of image (4 across, 23 down)
43/76
![Page 44: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/44.jpg)
Spreadsheet
Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC (2011) Data from: Ultraconserved
elements anchor thousands of genetic markers for target enrichment spanning multiple evolutionary timescales. Dryad
Digital Repository. doi:10.5061/dryad.64dv0tg1
44/76
![Page 45: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/45.jpg)
Andrew Sparkes & Amanda Clare. AutoLabDB: a substantial open source database schema to support a high-throughput
automated laboratory Bioinformatics first published online March 29, 2012 doi:10.1093/bioinformatics/bts140
45/76
Relational database
![Page 46: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/46.jpg)
Databases
• Intuitive / meaningful field & table names
• Ensure it will support scope of analysis
• Institutional support for data modeling?
• Check for reusable discipline-based standard
o E.g. EUROCarbDB
46/76
![Page 47: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/47.jpg)
• What do file/field names mean?
• What does each file/field contain?
• How do files/data relate to each other?
• Are there a limited set of possible values?
Document your file/field names
Name Type Description Possible values
Stain Text Stain used on cell sample
IO = Iodine; EY = Eosin Y; MB = Methylene blue;
47/76
![Page 48: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/48.jpg)
Version control
• Have a plan
• Be consistent
• Document changes between versions (what, who)
48/76
![Page 49: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/49.jpg)
Metadata definition
Metadata is:
“Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”
NISO (2004). Understanding Metadata. NISO Press
http://www.niso.org/publications/press/UnderstandingMetadata.pdf
49/76
![Page 50: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/50.jpg)
Why use structured metadata?
• Systematic approach to capturing descriptive information
• Supports 3rd party use by:
o Making data findable (if metadata put online)
o Providing context
o Providing a unique identifier
50/76
![Page 51: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/51.jpg)
Metadata considerations
• What standard should you use?
• Are there discipline standards?
• Which is the best fit?
• Where are you depositing your data?
51/76
![Page 52: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/52.jpg)
Metadata – specialized standard Neuromorpho
• Neuromorpho ID (UID)
• Neuron Name
• Archive (researcher) name
• Species
• Strain of species
• Age range
• Gender
• Weight range
• Developmental stage
• Primary/Secondary/Tertiary brain regions
• Primary/Secondary/Tertiary Cell classes
• Original data format
• Experiment condition
• Experiment protocol
• Staining method
• Slicing Direction/Thickness
• Tissue Shrinkage
• Objective Type
• Magnification
• Reconstruction Method
• Dates of Deposition/Upload
• Associated publications
• Web URL of archives (if available) with any additional information about the reconstruction
52/76
![Page 53: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/53.jpg)
Metadata – specialized standard Neuromorpho
• Neuromorpho ID (UID)
• Neuron Name
• Archive (researcher) name
• Species
• Strain of species
• Age range
• Gender
• Weight range
• Developmental stage
• Primary/Secondary/Tertiary brain regions
• Primary/Secondary/Tertiary Cell classes
• Original data format
• Experiment condition
• Experiment protocol
• Staining method
• Slicing Direction/Thickness
• Tissue Shrinkage
• Objective Type
• Magnification
• Reconstruction Method
• Dates of Deposition/Upload
• Associated publications
• Web URL of archives (if available) with any additional information about the reconstruction
53/76
![Page 54: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/54.jpg)
Metadata – specialized standard Neuromorpho
• Neuromorpho ID (UID)
• Neuron Name
• Archive (researcher) name
• Species
• Strain of species
• Age range
• Gender
• Weight range
• Developmental stage
• Primary/Secondary/Tertiary brain regions
• Primary/Secondary/Tertiary Cell classes
• Original data format
• Experiment condition
• Experiment protocol
• Staining method
• Slicing Direction/Thickness
• Tissue Shrinkage
• Objective Type
• Magnification
• Reconstruction Method
• Dates of Deposition/Upload
• Associated publications
• Web URL of archives (if available) with any additional information about the reconstruction
54/76
![Page 55: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/55.jpg)
Metadata – specialized standard GenBank
• Locus
• Definition
• Accession number (UID)
• Version
• Keywords
• Source organism
• Reference(s)
o Authors
o Title
o Journal
o PubMed ID
• Features
o Source
o RBS (ribosome binding site)
o gene
o CDS (protein coding sequence)
• Terminator
• Modification date
55/76
![Page 56: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/56.jpg)
Metadata – general standard
• Dublin Core
o Designed to be generic/flexible
o Usually stored as XML
e.g. <dc:creator>Hanson, Karen L.</dc:creator>
o 15 fields:
Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, Type
56/76
![Page 57: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/57.jpg)
Minimum Information for Biological and Biomedical Investigations
• Covers data and metadata
• Standards for diverse bioscience communities
• ~35 guidelines so far
• Recommended by Science magazine
Let’s take a look… http://mibbi.sourceforge.net/portal.shtml
57/76
![Page 58: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/58.jpg)
Quality control
• Assign a person to be responsible o Naming conventions adhered to
o Good data quality
o Access controls in place
o Version controls followed
58/76
![Page 60: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/60.jpg)
1. Introduction
2. Incentives
3. Standards for description & documentation
4. Storage, archiving and sharing
• Backups
• Storage
• Security
• Archiving / preservation
• Sharing
5. Data management plans
Data management
60/76
![Page 61: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/61.jpg)
Backups
• Make a backup plan
• Multiple copies
• Geographically dispersed
61/76
![Page 62: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/62.jpg)
• Ask I.T. o Enterprise server
o IT managed cloud options
o Data warehouse
o Lab Information Management System (LIMS)
o Other systems?
• Proprietary cloud options (in a pinch) o Check ownership policies
o Pick >1 provider
Storage options
62/76
![Page 63: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/63.jpg)
Security considerations • Reasons to be concerned about security
o Ethical
o Commercial
o Privacy (e.g. HIPAA)
• Work with I.T.
• Other things:
o Add passwords
o Lock unused machines
o Sign use agreements
• Publishing/sharing data? May need to de-identify
63/76
![Page 64: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/64.jpg)
Postdoc survey results When you have finished analyzing/publishing from a dataset, where do you store it for long-term preservation, management, and/or access?
39%
19%
31%
22%
45%
14%
30%
20%
InstitutionalRepository
Discipline-specificRepository
Other Do not store forlong-term
preservation
Nationally
NYULMC
64/76
![Page 65: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/65.jpg)
Digital preservation • Storage ≠ preservation!
• Digital preservation is…
“a set of activities required to make sure digital objects can be located, rendered, used and understood in the future”
http://www.digitalpreservationeurope.eu/what-is-digital-preservation/
• Protects from
o hardware obsolescence
o software obsolescence
o file integrity issues
65/76
![Page 66: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/66.jpg)
• For digital preservation, storage and/or sharing
• Types of repositories:
o Institutional
o Discipline specific (GenBank)
o Cross disciplinary (Dryad)
Digital repositories
66/76
![Page 67: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/67.jpg)
Data format • Collection vs dissemination format
• Software export features
• Open formats e.g. XML, CSV, PDF, TIFF
• No open format? Use common proprietary formats e.g. DOC, SPSS
• Unencrypted
• Uncompressed
67/76
![Page 68: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/68.jpg)
Data ownership
• Can’t assume you own data
• Check for:
o Funder policies on data ownership
o Institution policies on data ownership
68/76
![Page 69: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/69.jpg)
1. Introduction
2. Incentives
3. Standards for description & documentation
4. Storage, archiving and sharing
5. Data Management Plans
Data management
69/76
![Page 70: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/70.jpg)
What should be included in the plan?
• Types of data
• Methods of collection
• Standards that will be applied
• Backup and storage procedures
• Plans for archiving / preservation
• Access policies and provisions for secondary use
• Measures to protect privacy or intellectual property
List adapted from NYU Libraries, Data Management Libguide
http://nyu.libguides.com/data_management
70/76
![Page 71: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/71.jpg)
Data management plans Where to start?
• Purdue’s Self Assessment Questionnaire http://research.hub.purdue.edu/resources/7
• MIT’s Data Management Check List
• NIH Data Sharing
There are good recipes…
…don’t reinvent the wheel!
71/76
![Page 73: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/73.jpg)
Conclusions
• Plan data management before starting research
• Documentation, documentation, documentation
• Can’t ignore the march toward research data sharing.. get ready!
73/76
![Page 74: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/74.jpg)
http://nyuhsl.libguides.com/data_management
Resources
74/76
![Page 75: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/75.jpg)
Photo references • AJ Yakstrangler. “Tithby” 2011. www.flickr.com/photos/yakstrangler/6030261340.
• BobPetUK. “Raw Minced Beef” 2010. www.flickr.com/photos/22179048@N05/5195112462/
• Like_the_Grand_Canyon. “McDonalds Hamburger Royal Bacon” 2008. www.flickr.com/photos/like_the_grand_canyon/3022123379
• The Adventures of Kristin & Adam. “Whose read for a beat down?!” 2008. www.flickr.com/photos/kristin-and-adam/2821678614/
• kristin_a. “Easter cupcakes” 2008. www.flickr.com/photos/kristinausk/2374459826
• wilf2. “Gummy smile” 2006. www.flickr.com/photos/wibbles/244268268
• outcast104. “Vampire weekend” 2005. www.flickr.com/photos/outcast104/2011632229
• Mel B. “Egg” 2008. www.flickr.com/photos/42dreams/2452044287
• psrobin. “Baking Powder Still Life” 2010. www.flickr.com/photos/psrobin/5092598788
• edenpictures. “Sugar” 2011. www.flickr.com/photos/edenpictures/6596639341
• Mel B. “Oil pour” 2008. http://www.flickr.com/photos/42dreams/2452876486
• afiler. “Piggly Wiggly Flour Bag” 2006. www.flickr.com/photos/afiler/121359709
• Bill HR. “Pure vanilla” 2009. http://www.flickr.com/photos/billhr/3190024762
• [F]oxmoron. “Baking Soda” 2011. http://www.flickr.com/photos/f-oxymoron/5423065696
• Eran Finkle. “Cinamon quills” 2007. http://www.flickr.com/photos/finklez/3059996880
• Joelk75. “choped walnuts” 2011. http://www.flickr.com/photos/75001512@N00/5405890483/
• Cyn74. “Happy Carrot” 2009. http://www.flickr.com/photos/kyntharyn74/3262089319
• nedrichards. “Carrot Cake” 2006. http://www.flickr.com/photos/nedrichards/307600027
75/76
![Page 76: Introduction to Data Management](https://reader035.fdocuments.us/reader035/viewer/2022071600/613d13a4736caf36b7590978/html5/thumbnails/76.jpg)
http://nyuhsl.libguides.com/data_management
Thank you... Questions?
http://hsl.med.nyu.edu 76/76