2018 Tarboton HydroShare Data Management...

35

Transcript of 2018 Tarboton HydroShare Data Management...

Page 2: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Motivation: Water/hydrology research is a team sport

• requires integration of information from multiplesources• is data and computationally intensive• requires collaboration and working as a

team/community

Data

Analysis

Models

• Advancing Hydrologic Understanding

CyberInfrastructure Challenges• The data deluge

• Large datasets, data heterogeneity, Inadequate metadata

• Data Organization and Model Input preparation• Reproducibility• Software installation and configuration

• Platform dependencies, Library dependencies, Licensing

• Computational resources• Memory, disk and processing

Page 3: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Outline

• Data Management 101(Many slides from Jeff Horsburgh Research Scholar’s presentation)

• HydroShare Overview• HydroShare Hands on

Page 4: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

The Steven Hall Story

With a little help, Steven deposited his dataset in the online

HydroShare repository

Steven collected his data in the

field and transformed

into a sharable format

Steven verified his data and metadata were correct but

kept the data private

Steven submitted his paper for

publication and responded to

reviews

Steven published his

paper and cited published data in HydroShare

Steven published his data in

HydroShare and received a DOI

From Jeff Horsburgh

Page 5: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Data Management 101• How are you managing your data?

• There are simple guidelines to improve data management

• Benefits– Improved data organization – facilitates analysis– Improved reproducibility– Improved capacity for data re-use

Borer, E.T., E.W. Seabloom, M.B. Jones, and M. Schildhauer (2009). Some simple guidelines for effective data management, ESA Bulletin, 90(2):205-214, http://dx.doi.org/10.1890/0012-9623-90.2.205

From Jeff Horsburgh

Page 6: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

1. Don’t Mess with the Raw Data

• Always store uncorrected data with all of its “bumps andwarts”

• Do not make any corrections to this– You could change something that was actually correct– You could make mistakes while correcting other mistakes

• Script QA/QC procedures and write results to a new file/copyof the data

From Jeff Horsburgh

Page 7: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

An Example

From Jeff Horsburgh

Page 8: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

An Example

Removal of a calibration shiftFrom Jeff Horsburgh

Page 9: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

An Example

Removal of anomalous, out of range valuesFrom Jeff Horsburgh

Page 10: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

An Example

Removal of “bad data” – sensor malfunctionFrom Jeff Horsburgh

Page 11: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

2. Use Descriptive File Names

• Use only plain ASCII characters• Brief, but descriptive of content• Generally – avoid spaces in file names• Include a “readme” file when using many files in a directory

From Jeff Horsburgh

Page 12: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

This might not be the best system…

How could we make this better?

From Jeff Horsburgh

Page 13: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Streamflow Data from USGS

From Jeff Horsburgh

Page 14: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

4. Do Not Mix Data Typesin Table Columns

• Numeric, strings, date/time, boolean• Different software packages will handle mixed

data types inconsistently• Can be more difficult to detect errors in the

data• Can cause erroneous results

From Jeff Horsburgh

Page 15: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

5. Archive Data in Non-ProprietaryData Formats

• Microsoft Excel is widely available and usednow, but what about in 10 years? 20 years?

• How many other software programs can openyour data?

• Will your data disappear if the fileformat/software become obsolete?

From Jeff Horsburgh

Page 16: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

• Does Your Office LookLike This?

• What are thepotential problems?

• What are somepotential solutions?

6. Preservation/Backup MediaHow are you preserving your data now?

From Jeff Horsburgh

Page 17: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

• Natural disaster• Facilities infrastructure failure• Storage failure• Server hardware/software failure• Application software failure• External dependencies• Format obsolescence• Legal encumbrance• Human error• Malicious attack by human or

automated agents• Loss of staffing competencies• Loss of institutional commitment• Loss of financial stability• Changes in user expectations and

requirements

Data Loss

CC im

age

by S

hary

nM

orro

w o

n Fl

ickr

CC im

age

by m

ombo

leum

on F

lickr

Slide courtesy DataONE.From Jeff Horsburgh

Page 18: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

To the Cloud!• Convenience• Accessibility anywhere• Cross platform• Enhanced sharing• Low cost

• But…• Privacy???????• Delay (slow or non-existent

internet)• Storage, but not much else• File formats and semantics

still matter• No community of similar

experts From Jeff Horsburgh

Page 19: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Why store your model on Hydroshare (where your data is also located)?

• Model creates reproducible results• Models/code can be shared by simply

giving permission (no need to copy)• Models can be re-executed at any time

From Jeff Horsburgh

Page 20: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Reproducible Visualization in Python

From Jeff Horsburgh

Page 21: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

8. Maintain Metadata (Information about Data)

Borer et al.: “Do not underestimate your ability to forget details about a study!”

– WHO created the data?– WHAT is the content of the data?– WHEN were the data created?– WHERE is it geographically?– WHY were the data developed?– HOW were the data developed?

From Jeff Horsburgh

Page 22: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

• When you provide data to someone else, what types of information would you want to include with the data?

• When you receive a dataset from an external source, what types of details do you want to know about the data?

Sharing Data: The Golden Rule

From Jeff Horsburgh

Page 23: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

• Providing data: – Why were the data created? – What limitations do the data have? – What does the data mean? – How should the data be cited if it is re-used in a new study?

• Receiving data:– What are the data gaps?– What processes were used for creating the data?– Are there any fees associated with the data?– In what scale were the data created? – What do the values in the tables mean?– What software do I need in order to read the data?– What projection are the data in?– Can I give these data to someone else?

Sharing Data

From Jeff Horsburgh

Page 24: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Necessary Meta/data Structure

The degree of metadata format and structure necessary for different levels of projected secondary data utilization. (adapted from Michener et al., 1997).

From Jeff Horsburgh

Page 25: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Summary

1. Don’t mess with the raw data2. Use descriptive file names3. Use descriptive file headers4. Do not mix data types in table columns5. Archive data in non-proprietary data formats6. Consider media7. Ensure repoducibility8. Maintain metadata

From Jeff Horsburgh

Page 26: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Data and models used by hydrologists are diverse…• Time series• Geographic rasters• Geographic features• Multidimensional space/time• Model programs• Model instances• …

141 241 341

131 231 331

121 221 321

111 211 311

441

431

421

411

142 242 342

132 232 332

122 222 322

112 212 312

442

432

422

412

143 243 343

133 233 333

123 223 323

113 213 313

443

433

423

413

Y

X

Time

http://www.unidata.ucar.edu

http://www.usgs.gov

http://www.esri.com

From Jeff Horsburgh

HydroShare can hold data in a wide variety of formats, and data in any format as “generic”

Page 27: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

How do people share other content now

• YouTube• Facebook• Instagram• Drop Box• Google Drive• ArcGIS Online• Hydrologic data ?

Page 28: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

HydroShare is a platform for sharing Hydrologic Resources and Collaborating•File Storage

Value Added Functionality

DropBox-ish Functionality

dropbox.com

• Meta Data Descriptions• Data Access API• Web Apps• Social Functions• DOI Data Publication

The goal of HydroShare is to advance hydrologic science by enabling the scientific community to more easily and freely share products resulting from their research - not just the scientific publication summarizing a study, but also the data and models used to create the scientific publication.

Page 29: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,
Page 30: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Collaborative data sharing

Add content to HydroShare to share with your colleagues or formally publish

to document result reproducibility

Page 31: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Resources (data and models) in HydroShare are objects of collaboration (social objects)

For each resource you can- Manage who has access

- To edit- To view

- Comment or rate- Get unique identifier- Describe with metadata- Organize into collections- Formally publish- Version- Open with compatible web

app

Page 32: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Resources formally published receive a citable digital object identifier (DOI) and are made immutable to changes

...

Formal data publication

Page 33: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Automatic and natural metadata gathering eases some of the pain of metadata entry

For geographic raster WGS 84 Coverage information automatically harvested from GeoTIFF coordinate system information

For multidimensional netCDF data with CF convention metadata the HydroShare metadata can be fully and automatically completed

Page 34: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Summary1. A new, web-based system for advancing model and data sharing2. Access multiple types of hydrologic data using standards compliant data

formats and interfaces3. Flexible discovery functionality4. Model sharing and execution5. Facilitate and ease access to use of high performance computing6. Social media and collaboration functionality7. Links to other data and modeling systems8. Enable more rapid advances in hydrologic understanding through

collaborative data sharing, analysis and modeling9. Much of the functionality has applicability to other geosciences beyond

hydrology

Page 35: 2018 Tarboton HydroShare Data Management Tutorialdata.mekongwater.org/static/files/2019-Mekong... · • Reproducibility • Software installation and configuration • Platform dependencies,

Thanks to the Mekong HydroShare team!

http://data.mekongwater.org