RESEARCH DATA MANAGEMENT AND THE DATA LIFECYCLE/67531/metadc1010750/m2/1/high_r… · UNT Scholarly...
Transcript of RESEARCH DATA MANAGEMENT AND THE DATA LIFECYCLE/67531/metadc1010750/m2/1/high_r… · UNT Scholarly...
Douglas BurnsResearch Data and GIS Librarian
University of North Texas LibrariesWillis Library, Room 155
Pamela AndrewsRepository Librarian
University of North Texas LibrariesWillis Library, Room 356
RESEARCH DATA MANAGEMENT AND THE DATA
LIFECYCLE
AGENDA
• What are data?
• Understanding data management / lifecycle
• Why is data management important
• Data management plans
• Data management planning tools
• Organizing data & file management
• Formats & documentation
• Storage & security
• Sharing, archiving & metadata
• What UNT has to offer
Based on the Research Data Management and Sharing course (Coursera, 2017)
DA
TAP
LAN
NIN
G
OR
GA
NIZ
ING
UN
T
What are data?
Facts and statistics collected together for reference or analysis
What are data?
“Most of the useful data in the world, from economic data to news content to geographic
information, lives somewhere on the internet…”
From DataCamp e-mail advertising their “Working with Web Data in R” course (received 10/4/2017)
Library Data Sourceshttp://www.library.unt.edu/
A Not Comprehensive List of Data Sources
(In collaboration with other ASEAN nations)
(Uses Socrata, a licensed platform)
(Commercial subscription)
Even Bigfoot Has DataSo
urc
e: b
fro
.net
Let’s Evaluate a Bigfoot Datapoint…
Source: http://www.bfro.net/gdb/show_report.asp?id=28568
OBSERVED: After a late concert at the square in Canton, Illinois, my friend was bicycling home and turned right into the parking lot, that leads past a shed, fence, and dumpster.
Hearing a shuffling bang noise near the dumpster, she slowed her bike and began looking. Behind the shed was more than a noise. A shadowy sasquatch shape. Its puffy balloon feet (as she described it), held up a stocky, not so tall frame (6 ft.).
Startled, it turned and walked into the bushes, moving branches to get into the woods. This was a thinly wooded area, and on the other side, a corn field.
So freaked out she said, "what is that?", and sped right home and told me. I could tell she was worried, and by her description, this did happen.…But that noise, was it inside the dumpster?
So I slowly snuck around the corner. Only one cover of the four covers was open on the dumpster, so when I was ready, I shown the light through it. A small [raccoon] was trapped. Unable to climb out of an almost empty dumpster. It was out of breath, and sad looking, so I put small cans and an old sign in to give it a ladder out.
I went back later that evening and the [raccoon] was out and to check the shadows around there. We think it heard the [raccoon] also, and may have ben trying to find a way inside. We don’t know if it was hungry, or just trying to help the [raccoon] out.
Crowdsourcing?North Korea (DPRK) Mission, 1950s topos
Digital Humanitarians
Haiti Earthquake Crisis Response
Belief: To Trust or Not to Trust
Source: http://theoatmeal.com/comics/believe
Be a healthy skeptic.
A Not Comprehensive List of Data Programs
QUESTIONS?
Understanding Data Management
PROCESS WHEREBY DATA IS CONTROLLED TO
ACHIEVE A DESIRED GOAL
Data Management: Data LifecycleCreating Data• design research• plan data management (formats, storage)• plan consent for sharing• locate existing data• collect data (experiment, observe, measure, simulate)• capture and create metadata
Processing Data• enter data, digitize, transcribe, translate• check, validate, clean data• anonymize data where necessary• describe data• manage and store data
Analyzing Data• interpret data• derive data• produce research outputs• author publications• prepare data for preservation
Data Management: Data LifecyclePreserving Data• migrate data to best format• migrate data to suitable medium• back-up and store data• create metadata and documentation• archive data
Giving Access to Data• distribute data• share data• control access• establish copyright• promote data
Re-using Data• follow-up research• new research• undertake research reviews• scrutinize findings• teach and learn
QUESTIONS?
Why Data Management Is Important
• Think of (potential) stakeholders
• Evolving funding requirements or policies
• Increases transparency
• Reproducibility enhances research quality / authority
• Good practice
• Documentation reduces the “What did we do to get that?”
• Succession planning…
• Facilitates knowledge transfer
• Boosts long-term efficiency
• Encourages accountability
• Supports collaboration• Ex: https://www.researchgate.net/
In other words: you don’t want to get caught with your pants down!
QUESTIONS?
Data Management Plan(s)
Why we don’t have DMPs:
• Data can be intimidating…
• But we have lots of other plans… (not another one!)
• Takes time – who has that!?*
• “It’ll work out… somehow, right?”
• “We plan events… but reactto data”…?
*time is a function of priorities
Data Management Plan(s)
What it is:
• A document describing planned steps to manage data
• Usually a part of the larger research agenda
• A “contract” for stakeholders
• Protection for when things go awry, because they will
• A way to simplify your research
What it is not:
• Rocket science
• Nuclear physics
• Unsolicited relationship advice
• A laughing matter
Stupid jokes aside, your plan helps set the tone for future success!
Example 2: The Bad
Source: https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year
Additionally:• ~50% of time spent
looking for data• ~60% of time spent
cleaning/organizing data
• ~75% potential inefficiency because of poor data structures
Estimated GDP cost due to bad data?$3 trillion
Example 3: The Ugly
“I’m gonna do the stuff to the things because reasons.”
Questions To Ask About Your Project• What types of data will be
produced? Will they be reproducible? What would happen if they were lost or became unusable later?
• How much data will there be and at what growth rate? How often will the data change?
• Who will use your data now, and later?
• Are there tools or software needed to create/process/visualize the data?
• Are there regulations, copyright, or other licensing concerns related to sharing the data?
• Do the data need to be restricted or embargoed for intellectual property reasons?
• Are there any reasons to not allow re-use?
QUESTIONS?
Data Management Planning Tool
Source: https://dmptool.org/ and http://www.library.unt.edu/datamanagement/plans
QUESTIONS?
Organizing Data & File Management
STAYING ORGANIZED SAVES TIME
& MAKES LIFE EASIER
• README.txt• Code book• Standardized folder structure or
file name scheme• 20170928 or 09282017?
• Versioning• Filename_v2.pdf • Ex: GitHub
• Files and formatting• *.csv, *.tsv or *.xlsx?• Numeric or text• *.docx or PDF?
Organizing Data & File Management
Railroad_RailroadCommisionofTexas_September
212017
rrrd_RCT_2017
Organizing Data & File Management
COLLECTION METHODS:• Each project will vary, but…
• Quantitative vs. Qualitative
• One time? Longitudinal?
• Eventual delivery method• Online different than print
• Ease of use/access
• How standardized?
COLLECTION PLATFORMS:• Qualtrics (UNT has account)
• iForm
• OpenDataKit
• Zoho Forms
• CKAN
• Google Forms*
• SurveyGizmo
• Hand collect / data entry
Note: this is not an endorsement of a particular software or methodology.
AFFECT
Note: while “open” may be preferred, the point here is that you should pick the best tool to get the best results. Remember: there is no such thing as a free lunch. This op-ed piece discusses that reality and is the source of the quote above.
Something to think about:“…several issues tend to get conflated into one argument – open-source vs. closed-source, free vs. paid-for, restrictive vs flexible licensing, supported vs. unsupported, code quality…”
QUESTIONS?
Formats & Documentation
• Accessible format characteristics:• Non-proprietary
• Open, documented standard
• Common usage by your community
• Standard representation• Unicode / ASCII
• Unencrypted, uncompressed
• Licensing / copyright?• CC-BY or other
• Fair use
• Sustainable data formats:• PDF
• ASCII
• TIF or JPEG2000
• XML or RDF
• MPEG-4
Source: http://inside.mines.edu/RSS-sustainable-data-formats
Formats & Documentation
Formats & Documentation
QUESTIONS?
Storage & Security
Storage & Security
While convenient… a special note on non-UNT cloud storage options: “…UNT legal agreements enable [UNT] to hold the provider accountable instead of you.”
Sources: https://itservices.cas.unt.edu/services/file/non-unt-storage/understanding-non-unt-storage-cloud and https://itservices.cas.unt.edu/services/file/non-unt-storage/understanding-cloud-storage-information-roles
What does this mean?
The State of Texas defines an information owner as a “person with statutory or operational authority for specified information…”
If data is hacked or lost, YOU are responsible!
Storage & Security
• 3-2-1 principle: • 3 different copies of your data
• on at least 2 different media
• with 1 at a different location
• External repositories can help
• Encryption? Locked cabinet?
• Security:• Keep data safe from corruption
• Anti-virus software
• Be mindful of phishing attempts
• Control who has access• Active directory for log-in
• Written policy / guidelines
• Remember: even an un-networked computer is still vulnerable
Storage & Security
BACKUP!BACKUP!BACKUP!
BACKUP USING
MULTIPLE LOCATIONS
NOT HELPFUL!
Storage & Security
ON THE UPSIDE, THESE FORMATS AREN’T AS LIKELY TO BE HACKED…
Privacy Implications
QUESTIONS?
Sharing, Archiving & Metadata
Benefits:
• Reinforces scientific inquiry
• Verification and replication
• New research / methods
• Encourages diversity
• Provides teaching resources
• Reduces duplication
• Protects against fraud
• Enhances visibility
• Preserves for future use
• Helps other do better research
“A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.” (Aarts, 2015)
Sharing, Archiving & Metadata
Challenges:
• Requires time & money
• Perceived risks from loss of control
• May be confidential• Health data often suppressed
• Unclear ownership / Intellectual Property
• Lack of incentives
“Sharing research data is… a conundrum.” –Christine Borgman
Metadata, Or Data About Data
List of metadata standards
http://iii.library.unt.edu/record=b5522057~S12
Additional reading: https://support.google.com/webmasters/answer/79812?hl=en and http://www.mequoda.com/articles/subscription_websites/understanding-the-role-of-metadata-in-google-visibility-5-best-practices/
NOTE: What is Google using? What about AI?
Metadata, Or Data About Data
Screenshots from Portal to Texas History editor
Metadata, Or Data About Data
Screenshots from ArcMap
QUESTIONS?
What does UNT Libraries have for YOU?
UNT Scholarly WorksUNT’s open-access, institutional repository for research, creative, and scholarly output from UNT community members
UNT Data RepositoryA central archive for the research data of our UNT scholars. Can be linked to items within UNT Scholarly Works.
Boilerplate text for DMPsBoilerplate text to insert within a DMP if using the UNT Data Repository
QUESTIONS?
Additional Resources
Additional Resources: Americans’ Views on Open Government Data (benchmark: 2014-2015)
Questions: Longer List• What types of data will be produced? Will they be
reproducible? What would happen if they were lost or became unusable later?
• How much data will there be and at what growth rate? How often will the data change?
• Who will use your data now, and later?
• Who in your research group controls the data (PI, student, lab, Mines, funder)?
• How long will the data be active?
• What directory and file naming convention will be used?
• What project and data identifiers will be assigned?
• What file formats are to be used? Are they long-lived?
• What is your data storage and backup strategy?
• When will you publish the data (research) and where?
• Who might be interested in your data in the future? Who will you share it with?
• Who in your research group will be responsible for data management and archiving?
• Have you identified a repository or archive in which to deposit your data?
• Is there an ontology or other community standard for data sharing/integration?
• How will you prepare the data (if necessary) for archiving?
• Are there good project and data documentation?
• How long should the data be retained to archived (e.g. 3-5 years, 10-20 years, permanently)?
• Are there tools or software needed to create/process/visualize the data?
• Are there special privacy or security requirements (e.g. personal data, high-security data)?
• Are there sharing requirements (e.g. funder data sharing policy)?
• Are there other funder requirements (e.g. data management plan in proposal)?
• Are there regulations, copyright, or other licensing concerns related to sharing the data?
• Do the data need to be restricted or embargoed for intellectual property reasons?
• Are there any reasons to not allow re-use?