dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources...

34
dCache: An Overview Paul Millar on behalf of the dCache team Nordic Data Management Workshop Oslo, Norway; 2019-02-27 https://indico.cern.ch/event/779913/ eXtreme DataCloud is co-funded by the Horizon2020 Framework Program – Grant Agreement 777367 Copyright © Members of the XDC Collaboration, 2017-2020

Transcript of dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources...

Page 1: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An OverviewPaul Millar

on behalf of the dCache team

Nordic Data Management WorkshopOslo, Norway; 2019-02-27

https://indico.cern.ch/event/779913/

eXtreme DataCloud is co-funded by the Horizon2020 Framework Program – Grant Agreement 777367Copyright © Members of the XDC Collaboration, 2017-2020

Page 2: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 2

Scientific data challenges● Volume● Fast ingest● Chaotic Access● Sharing data● Access Control● Persistence & long-term

archival● Immutability

Page 3: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 3

Fast AnalysisNFS 4.1/pNFS

High SpeedData Ingest

Wide Area Transfers (Globus Online, FTS) by GridFTP, HTTP

Interactive analysis& Sharing

Data management& workflow control(Rucio, Kafka, SSE)

Page 4: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 4

● HERA

● Tevatron

● WLCG

● Belle II

● LOFAR

● CTA

● IceCUBE

● EU-XFEL

● Petra3

● DUNE

● And many more ...

Page 5: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 5

Flexibility that works …● Supports many authentication schemes: username+password,

X.509, Kerberos and OpenID-Connect:● Integrates with existing infrastructure + pluggable for flexibility,● Users have same rights, irrespective of how they authenticate.

● Supports delegated authorisation, using Macaroons.● Multiple protocols: (Grid)FTP, HTTP/WebDAV, SRM, xrootd, NFS

v4.1/pNFS and dcap.● Using different protocols, users will see the same data.

Page 6: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache innovations:Storage Events

Page 7: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 7

Storage events: the problems

Upload a file

OK

Delete a file

OK

Catalogue: Rucio/LFC/…

Are these files on disk?

no, no, no, …

Stage files from tape

Request queued

Are these files on disk?

no, no, no, …

Are these files on disk?

no, YES, no, …

Page 8: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 8

9000 stats per second!

Page 9: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 9

Page 10: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 10

An new approach: storage events

Subscribe to events

OK

Something happened #1

Something happened #2

Something happened #3

Page 11: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 11

New solutions to old problems:

Upload

OK

Delete

OK

Rucio

Subscribe …

OK

File uploaded

File deleted

Stage files

Request queued

OK

File #16 on disk

Subscribe …

● User- and internally triggered events:

● Data uploaded● Data deleted/renamed/moved● Tape flush/stage operations

● Uses: update catalogue, metadata extraction, data normalisation, build derived data, …

● Two event systems:● Site integration (Kafka)● Per user events (SSE/inotify)

(DEMO :-)

Page 12: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache innovations:Distributed storage & Data Lakes

Page 13: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 13

Data Lakes: distributed resources● dCache has over a decade of production use as a data lake:

● NDGF is a distributed dCache, spread over five countries.● AGLT2 is a distributed dCache, spread over two campuses.

● dCache can already provide protocol-based QoS; e.g., cache data for NFS access, read remotely for HTTP/GridFTP.

● Currently building new testbed to demonstrate existing solutions and improve upon them:

Hamburg → Zeuthen (RTT: ~5 ms); Hamburg → Moscow (RTT: ~70 ms)● Adding ability to provide cached data when detached:

A “satellite” can offer data if disconnected from the rest of dCache.

Page 14: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 14

Data Lakes: cloud bursting● dCache stores data in either a local filesystem or as objects within a

CEPH cluster.● Two new developments:

● Storing data within an S3 endpoint● Dynamic pools: just start a dCache pool and that capacity becomes usable.

● Together, support the cloud bursting use-case:● As cloud capacity “comes online” either due to load (cloud burst) or due to

resources being cheap (Amazon grants) then start a dCache pool● Jobs can run “in the cloud” with dCache taking care of any data movement.

Page 15: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache innovations:Delegated Authorisation with Macaroons

Page 16: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 16

Macaroons: delegated authorisation

Photo by Alan Cleaver (CC-BY)

Page 17: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 17

GET

307

GET 3. Request data directly from dCache

2. Request a macaroon

User Database

Example use: community portals / BOINC

1. Request data

dCache

Page 18: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 18

Example use: ad-hoc sharing

2. Send to colleague(e.g. via email)

1. Request a macaroon

GET/PUT/DELETE

3. Use macaroondCache

Page 19: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 19

dCache Workshop: 2019-05-21 to 2019-05-22● Located in Madrid, Spain.● Learn more about latest

developments in dCache● Opportunity to discuss issues directly

with dCache developers● Share stories with dCache admins● Help shape the future direction of

dCache.

https://indico.desy.de/indico/event/22170/

Page 20: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 20

The take-home message

● dCache is advance storage software for data-intensive science.

● dCache:● has decades of production use throughout the world,● provides scalable resources, used by many scientific disciplines,● offers innovative solutions that help drive the next generation

of scientific discovery.

Page 21: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

Backup slides

Page 22: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 22

dCache 101: Motivation● Data never fits into a single server

● Multiple servers● Off-load to tape

● Growing number of client hosts● Mainframe vs Linux cluster

● Control over hardware/OS selection● Better tender offers● Use and enhance local expertise

Page 23: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache: An Overview | | 2019-02-27 | 23

dCache 101: Design

● Single-rooted namespace, distributed data● Client talks to namespace for metadata operations only● Bandwidth and performance grow with number of data

servers● Standard clients (OS native or experiment)● Some data can be offloaded to tape

Page 24: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

Processing data without user credentials / BOINC

GET

307

GET

4. Request data directly from dCache

2. Request a macaroon

3. Add caveats

What are macaroons good for?

Page 25: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

FTS

What are macaroons good for?

HTTP 3rd party copies

2. Request amacaroon

3. Add caveats

4. COPY with embedded macaroon

5. GET with macaroon

1. Request copy

Page 26: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

What are macaroons good for?

Enforcing catalogue permissions

Rucio1. Request accessto data

2. Request a macaroon3. Add caveats

4. Access data

Page 27: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

Comparison: it’s what industry is doing…

Page 28: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

Comparison: it’s what Open-Source is doing…

Page 29: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache Storage Events: Kafka

created Log

billingbilling

stagedLog

billingbilling

created staged

Page 30: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

dCache Server-Sent Events (SSE)● Based on HTTP v1.1● HTML 5 standard

Support for many languages and web-browsers

● Initially adding support for inotify events

(it’s how Linux does namespace notification)● Plan to add:

● Locality change notification: flush, stage, …● Transfer-related events● QoS changes

Page 31: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

Cheat sheet: Kafka vs SSE

SSE

Standard … Component Protocol

What events does it see? dCache internal events Controlled

Main benefit Easy integration Built-in security

“Catch-up” storage Memory & disk Memory-only(currently)

Target audience Site-level integration Events for users

Page 32: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

EOSC-Pilot demonstrator: EU-XFEL data ingest

extractmetadata

new RAW data

createderived

data

store derived file

metadatacatalog

update catalog

Page 33: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

Rucio demonstrator: automated replication with SSE

New data

SSERucio

Upload

Third-party copy

Page 34: dCache: An Overview...dCache: An Overview | | 2019-02-27 | 13 Data Lakes: distributed resources dCache has over a decade of production use as a data lake: NDGF is a distributed dCache,

Future directions

● Complete SSE inotify support in dCache.● Add additional events, based on initial feedback.● Further explore automated data workflow (EU-XFEL

usecase).● Work with Rucio team to explore SSE integration.● Work with dCache sites to deploy store events in

production.