NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners,...

34
NCAR Plan for Science at Scale ...and some digressions J-F de La Beaujardière, PhD Director, NCAR/CISL Information Systems Division [email protected] https://orcid.org/0000-0002-1001-9210 International Computing for the Atmospheric Sciences (iCAS) Symposium 2019 Sept 12

Transcript of NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners,...

Page 1: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

NCAR Plan for Science at Scale

...and some digressions

J-F de La Beaujardière, PhDDirector, NCAR/CISL Information Systems Division

[email protected]://orcid.org/0000-0002-1001-9210

International Computing for the Atmospheric Sciences (iCAS) Symposium

2019 Sept 12

Page 2: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 2

My Focus: Data

NumericalModels

ObservingSystems

BigData

Nowwhat?

Page 3: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

The Big Data Problem

2019-09-12 3

Huge Model

Outputs

Satellite Imagery

In situ sensors

Manual sampling

Billions of files

Many formats• NetCDF3• NetCDF4• GRiB• CSV• XLS• TXT• GeoTIFF• etc

Page 4: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

Don't Move Huge Datasets – Move the Computing

2019-09-12 4

Huge Data

Subset #1

Subset #2

User #1 Computer

User #2 Computer

Internetdata distribution

Huge Data

Shared Computing

User #2Code

User #1Code

Shared Code

Page 5: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 5

Current Typical Data Organization

Dataset #A1

• Data URL• README file• Folder hierarchy• Filename convention• Standard format

(maybe)

CF conventions, ISO metadata, OpenDAP, OGC WxS (maybe)

CF conventions, ISO metadata, OpenDAP, OGC WxS (maybe)

Data Provider A

Dataset #B1

• Data URL• README file• Folder hierarchy• Filename convention• Standard format

(maybe)

Data Provider B

Domain-Specific Catalog offering text-based discoveryProblems:

• Tedious plumbing code for individual datasets and providers

• Difficult to do science or make decisions based on multiple datasets

Page 6: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

NCAR Plan for Science at Scale• Draft document (v0.5.0) attempting to address some of these

challenges• Developed at request of CISL Director & NSF Program Managers

• Proposes enhancements to NCAR infrastructure and activities in support of science by NCAR and external communities• Also continues/improves existing CISL/ISD data management activities

• Vision, Goals, and Objectives - specific and achievable• Thanks to numerous people for comments, including:

2019-09-12

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

6

Anke Kamrath/CISLEric Nienhouse/ISDSteven Worley/ISD

Douglas Schuster/ISDSeth McGinnis/ISD

Sophie Hou/ISDBrian Bonnlander/ISD

Irfan Elahi/HSSDave Hart/USSJohn Clyne/TDDKevin Paul/TDD

Elizabeth Chapin/CISLJ-F Lamarque/CGD

Matthew Long/CGDJoseph Hamman/CGDCindy Bruyère/MMMCaspar Amman/RAL

Tyler McCandless/RALTor Mohling/RALMike Daniels/EOL

Greg Stossmeier/EOLEthan Davis/Unidata

Matt Mayernik/NCAR DSETSubashree Mishra/NSF

Sarah Ruth/NSF

https://bit.ly/2SjnXFL

Page 7: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Science at Scale

The ability to perform scientific analysis on "Big Data"

without being constrained by

storage capacity,processing power,network bandwidth,

unfamiliar data formats, orinsufficient software tools.

2019-09-12

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

7

Page 8: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Goals of the Plan

Data Discovery and Access

Data Analytics

Data Management

Data Storage

Science and Collaboration

NCAR community science is supported and enhanced by CISL's hardware and software deployments.

NCAR is able to control data storage costs with appropriate performance for different usage scenarios.

Users both at NCAR and externally can compute directly on large-volume data, using either prebuilt tools or their own code.

Users are able to find NCAR-hosted data at a fine level of granularity; standardized formats and services are available.

NCAR scientists can readily comply with requirements for Open Data; CISL can streamline data archiving and curation.

2019-05-03

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

8

Page 9: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

High-Level Concept

NCAR

Analysis-Ready Data

NCAR Compute Nodes

Analysis Tools

NCAR/NSF users

2019-09-12 9

Public Cloud

Analysis-Ready Data

Cloud Computing

Analysis Tools

selected data

External users (public, industry,

international)

Page 10: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

10

Pangeo for Analysis-Ready Data

Jupyter for interactive access by remote systems

Cloud / HPC System

Xarray provides data structures and intuitive interface for interacting with datasets

Dask allows users to deploy clusters of compute nodes for

data processing.

Dis

tribu

ted

stor

age

“Analysis Ready Data”stored on globally-available

distributed storage.

Slide credit: Ryan Abernathy, LDEO/Columbia U.

2019-09-12

Zarr

Page 11: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

Current Data Storage Architecture

2019-09-12 11

GLADE(Posix f/s)

38 PB

Cheyenne HPC

Casper DAV

Local Network

Campaign (Globus)

20 PB

HPSS tape(HSI)

100 PB

Off-prem tape

(Disasterrecovery)

User-initiatedcopy/move

User-initiatedcopy/move

Automated backup

AuthorizedUsers

Page 12: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 12

GLADE-2(Posix f/s)

few PB

NWSC-3 HPC

DAVCluster

Local Network

AuthorizedUsers

User copy or demand-driven burst

Object Store(S3 API)

5 - 100 PB3-geo scale

out

User-initiatedcopy

Automatic Move

CloudCompute

PublicUsers

Dedicated Connection Cloud deep

archive(DR only)

Automated backup

Cloud Store(S3 API)

PB as needed

Possible Future Data Storage Architecture(Note: Jeff DLB's opinion, not a decision by NCAR/CISL)

Page 13: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Data Commons (on-prem object storage)• NCAR/CISL acquiring 5 PB object store

• Western Digital X100• Expect to be available for testing in Oct 2019• Motivation: POSIX filesystem not suitable for

billions of files• Uses:

• Host archival data holdings• Host analytics-optimized versions of some datasets• Evaluate lossy compression approaches• Research performance, usability, and cost

relative to Campaign disk & HPSS tape

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 13

Page 14: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Cloud Commons (off-prem object storage)

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 14

• Host selected datasets in Cloud for public access• NOTE: Not turning off existing data repos & access services

• First dataset: CESM LENS• AWS providing free S3 hosting for 100 TB of CESM LENS• Includes selected monthly, daily, 6-hour fields; both surface and 3D

• Planning additional NCAR datasets• Xarray/Zarr format rather than individual NetCDF files• Python-based analysis tools (in progress)• Will enable:

• Broad public access, incl. industry• Research on optimization for big data analysis• Evaluation of commercial cloud pros/cons

Page 15: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

A Word on Costs• When comparing Cloud vs on-prem costs, need to be honest

• Include hardware, power, cooling, real estate• Include cost of staffing for procurement, operation, security• Include opportunity costs:

• being stuck with same HW for 4 years• not benefiting from new managed services

• Goal: Don't build anything that someone else can build• We already outsource facility construction• We buy electricity from the grid instead of running our own power plant• We use existing WANs instead of laying our own fiber optic cables• We do not manufacture our own CPUs and disk drives• We leverage open-source & commercial software• We moved to Google Mail/Calendar/Docs• ⇒ Why build & operate our own data centers?

• Don't let outdated business models prevent us from trying to negotiate innovate contracts with cloud infrastructure vendors Je

ff de

La

Bea

ujar

dier

e <j

effd

lb@

ucar

.edu

>

2019-09-12 15

Page 16: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

CESM LENS• Community Earth System Model (CESM) Large Ensemble

• Kay et al. 2015 (doi:10.1175/BAMS-D-13-00255.1)

• Simulates climate from 1920-2005 using 20th century historical data, then 2006-2100 assuming RCP8.5 scenario• RCP8.5 = Representative Concentration Pathway 8.5 W/m2 radiative forcing by

increased greenhouse gas concentration. Worst-case scenario in 5th IPCC Report.

• Complicated dataset• 2 spatial grids (land/atmosphere and ocean/ice)• 3 vertical axes (surface, 3D atmosphere, 3D ocean)• 3 temporal resolutions (monthly, daily, 6hr)• Multiple time axes (20C, RCP8.5, 3 diff. control runs)• 40 ensemble members (simulations with slightly diff. initial cond.)

• Somewhat difficult to access and use• 500TB divided between on-line disk and near-line tape• ~150,000 NetCDF files in ~1,000 directories• Download from NCAR Climate Data Gateway• Compute in place on NCAR supercomputer if authorized Je

ff de

La

Bea

ujar

dier

e <j

effd

lb@

ucar

.edu

>

2019-09-12 16

Page 17: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

CESM LENS on AWS• DOI: https://doi.org/10.26024/wt24-5j82• 1 Zarr store for each component, frequency, experiment, and

variable• 175 Zarr stores instead of ~30k files• May aggregate further into multi-variable Xarray datasets

• Currently in pre-release – finalizing documentation and Jupyter Notebook for Oct 9 announcement

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 17

Thanks to:Anderson Banihirwe

Chi-Fan ShihBrian BonnlanderJoseph HammanSeth McGinnis

Kevin PaulGary Strand

Matthew LongDouglas Schuster

Eric Nienhouse

Page 18: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 18

Kay et al. 2015, Figure 2

Figure from Jupyter Notebook

Page 19: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 19

Kay et al. 2015, Figure 4 Figure from Jupyter Notebook

Page 20: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Need higher level of abstraction• Jupyter Notebooks are useful but not sufficient• Can we enable "Geodata Fabric" of information about the Earth?• Leverage space & time coordinates as organizing framework

• Latitude, longitude, named places, watersheds, etc• Multidimensional virtual data collection• Specify what data you want, location of interest, other attributes

→ software automatically makes it visible/available/computable• Standardize to simplify both human analysis and machine

learning applications

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 20

Page 21: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful
Page 22: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful
Page 23: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

2018-04-20 23

jeffd

lb@gm

ail.com

Data to Decisions:� Distill huge & complex data to ~1 bit:

plant crop? evacuate?build wind farm? go skiing?

� Support non-expert data users:city planners, business analysts, citizens, ...

Some users want answers, not huge datasets

(... or 100s of tiny datasets)

Page 24: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Source: "The promise and peril of a digital ecosystem for the planet," Campbell & Jensen, UN Environment Pgm (2019).See also "The Case for a Digital Ecosystem for the Environment," UN Science Policy Business Forum (2019)

Page 25: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

Analysis Software Stack Concept

Analysis-Ready Data

Xarray, Dask, Zarr

2019-09-12 25

CannedVisualizations OGC WMS GIS

Integration Notebooks

Workflows Containers ServerlessFunctions

Bare-metalHPC/EC2

EducationDecisions

PapersResearch

Page 26: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Desired Outcomes of Science at Scale Plan• New scientific discoveries

• Increased public use of NCAR data

• Improved capability to analyze big data

• Reduction in cost to maintain data systems

• Employee recognition for good data management

• Improved consistency in data-related proposals to NSF

• Easier compliance with Open Data requirements

2019-09-12

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

26

Page 27: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Challenges and Opportunities• Funding

• Work to date has leveraged existing relevant projects

• Very modest additional support FY2020

• Need dedicated software engineers to wire everything together

• Cultural practices

• Some people prefer familiar, less-efficient approaches

• Cadre of early adopters at NCAR and elsewhere

• Growing community and ecosystem of tools

• Pangeo, Python HoloViz, etc

• Many institutions facing similar Big Data problems

2019-09-12

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

27

Page 28: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 28

NCAR

Data Commons (object storage)

RDA CDG

DASH Repo

DAV Cluster(Casper)

Analysis-Ready Data

Archival Optimized

Intake Catalog

JupyterPyViz, GeoCAT Workflows

NCARCloud

Public Cloud

CloudCommons

Analysis-Ready Data

Jupyter

Workflows

Serverless

selected data

Intake Catalog

PyViz

EOL HAO

Containers

NCAR Plan for Science at Scale: https://bit.ly/2SjnXFL

Page 29: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Questions?

NCAR Plan for Science at Scale:https://bit.ly/2SjnXFL

(draft v0.5.0)

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 29

Page 30: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

POSIX Filesystem vs Object Storage

2019-09-12

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

30

POSIX ObjectHPC systems, desktop/laptop AWS, Google Docs, Facebook,

Netflix, etc

Hierarchical directory structure Object ID + user-defined label(can simulate hierarchy)

open(), read(), seek(), close() semantics(Campaign store: Globus interface)

HTTP GET, PUT, DELETE, HEAD (+optional POSIX emulation)

limited file metadata (owners, permissions, size, dates, etc)

arbitrary additional key/value metadata pairs

stateful (system keeps track of every file's open/close state)

stateless

strong consistency (ensure no other process can read until write finished)

read-after-write consistency

resize partitions to scale up arbitrary scale up/scale out

RAID data protection Erasure coding

Immediately replace failed disk Fail-in-place approach

Page 31: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Key Objectives for Data Discovery & Access Goal

Object Inventory & Query

Data Objects

•_____•_____•_____•_____

* Icon credit: SimpleIcon from Flaticonhttps://www.flaticon.com/authors/simpleicon

(CC BY 3.0)

*

Data Access Services

S3 API

2019-09-12

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

31

Dataset Search

Page 32: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Key Objectives for Data Management Goal

EOL

Archival Repositories

RDA CDG DASH HAO

Data Stewardship

ISO 19115Metadata <XML/>

DM Plan Support

2019-09-12

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

32

Page 33: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Observing Sources• Sensors• Field Campaigns• External Providers

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 33

Data Science Workflow

DataIngest

DataCleaning

Metadata Creation

Observational Data

ModelOutputs

Earth SystemModels

Data Assimilation

ML-basedParameterization

Data Storage

Obs/Model Intercomparison

Data Services• Product Generation• ML Training & Analyses• Workflow Tools• Analysis & Visualization

Tools• Data Optimization• Discovery, Access, Subset• Short-term Working Copies• Long-term Archival Curation

Page 34: NCAR Plan for 2019 Sept 12 Sciences (iCAS) Symposium ... DLB.pdf · limited file metadata (owners, permissions, size, dates, etc) arbitrary additional key/value metadata pairs stateful

Jeff

de L

a B

eauj

ardi

ere

<jef

fdlb

@uc

ar.e

du>

2019-09-12 34

Storage

Compute

Egress

• S3/S3 1Z iA/Glacier/Deep Archive

• $21/TB/mo - $1/TB/mo• 99.999999999% (11 9s)

durability

• Hardware• Power, cooling, real estate• Staff• Remote backup

hardware/facility

• Rich & evolving CPU/GPU choices

• $0 Free tier - • 99.999999999% (11 9s)

durability

• Hardware• Power, cooling, real estate• Staff• Remote backup

hardware/facility