Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support...

24
Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013

Transcript of Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support...

Page 1: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management

Nicholas SchwarzSoftware Services GroupAdvanced Engineering Support (AES) DivisionAdvanced Photon Source (APS)

25 June 2013

Page 2: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

2

What is Data Management?

Data Management is the development and execution of architectures, practices and procedures, and policiesthat properly manage our data lifecycle needs.

Page 3: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

3

Architecture

The architecture is the unambiguous definition of data, and the data storage and distribution infrastructure, i.e. hardware and software.

Data Examples Data are files on disk Data are a list of names and telephone numbers Data are a tuple of real numbers Data are …

Hardware and Software Examples Each sector has a dserv with storage There is central storage There is one internal and one external GlobusOnline endpoint A web-based system is used to set ownership permissions

Page 4: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

4

Practices and Procedures

Standard practices and procedures are required so that data can be handled properly. These practices and procedures must be embedded in regular operations processes.

Examples All measurement data must be saved to the local sector’s dserv every 24 hours Selected measurement data must be transferred to central storage Data on central storage must be saved in /data/managed/esaf123456 Data to be archived indefinitely must be flagged for archival within 7 days of the

end of the experiment period

Page 5: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

5

Policies

Data policies dictate what is done with data so that data management helps meet the organization’s goals and operates within its requirements.

Examples All systems must comply with requirements in ANL-593 Only members of an ESAF can access data collected with that ESAF APS firewalls must not change APS must not loose data when outside network connection is lost Data management at one sector must not interfere with data collection at another

sector All measurement data must be kept for 90 days All metadata should be kept indefinitely Old metadata must be accessible within 48 hours of a request

Page 6: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

6

Interdependency

Data polices, practices and procedures, and architecture drive each other.

ExamplesPolicy: data management at one sector must not interfere with data collection at another sectorArchitecture: distributed server (dserv) for each sector

Architecture: The only commonality of APS data is that it is stored in filesArchitecture: Data ownership enforcement mechanism is based on file system permissions

Policy: APS must not loose data when outside network connection is lostPractices and procedures: Data is stored internal to the APS

Page 7: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

7

Thoughts / Questions / Tasks

Define what data management is to the APS.

Page 8: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

8

Perspectives

Data management depends on your perspective…

User / Scientist– Do science– Output measured primarily by publications (patents)

Facility– Produce x-rays (maximize uptime)– Maximize data collection

Page 9: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

9

User / Scientist Perspective

Laboratory Microscope Data Synchrotron Derived Data

Publication Multiple figures Different types of data

Page 10: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

10

User / Scientist Perspective

Synchrotron Derived Data

Even a single figure with synchrotron data may have data from multiple facilities.

Page 11: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

11

User / Scientist Perspective

Normalize IntensityCell Finding Algorithm

Data Fusion Synchrotron Derived Data

Process of analyzing data generates new knowledge and data (and metadata).

Page 12: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

12

Facility Perspective

Sources Type Example

N Administrative Data PI, UserDatesDescriptionESAF, BTR, GUP…

N Experiment / Measurement Data Sample and sample conditionsArea Detector imagesPoint detector scalarsMotor positionsEnergy (Undulator, Monochromator)…

N Beamline / Sector DataBL 1-XX, BL 2-XX, …, BL 35-XXSector 1, Sector 2, …, Sector 35

Energy (Undulator, Monochromator)…

1 Accelerator Data Machine DataStatusOrbit, Power Supply…

Page 13: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

13

Publication

Data Source 1 Data Source 2 Data Source N

Synchrotron 1 Data Synchrotron 2 Data Synchrotron N Data

Administrative Data

Sample / Experiment / Measurement Metadata

Accelerator Data

Analysis

Measured Data

Facility

User / Scientist

Page 14: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

14

Thoughts / Questions / Tasks

What’s the perspective of the APS?

APS is a (one-of-many) scientific instruments

As a facility, what can the APS do to enable science without knowing what goes on outside the facility, and with little control of what goes on outside the facility? Every facility agrees and does the exact same thing?

– Data formats, equipment, passwords, etc. Help facilitate transition of data from facility to user?

Page 15: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

15

Data Management at the APS

1. What is/are our architecture (data, hardware, software), practices and procedures, and policies for data management?

2. As a facility, what can the APS do to enable science without knowing what goes on outside the facility, and with little control of what goes on outside the facility?

3. What are our limitations?

4. What do we hope to be?– Streamlined facility so the user can realize their perspective

Page 16: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

16

APS Architecture - Data

Many types of data at the APS Administrative Data – well defined Accelerator Data – well defined Beamline Data - varies Measurement/Experiment Data – defined based on technique/beamline/user

– Great variability: commonality is files on disk– Database entries for protein crystallography

One experiment has data from all of these categories

Page 17: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

17

APS Policies

Goal: Streamlined facility so users can realize their science perspective

Policies Maximize data collection ANL-593 Operate without outside network Firewalls can not change Data ownership (only data owners can see their data) Data should be deleted after some set amount of time Many, many more to follow…

Implications No Cloud-only based solution Critical services work internally User access is tied to APS computer access

Page 18: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

18

Data Management Roles

Data Administrator Group Manager User

Experiment (or Project) Directory rw Data administrator owns all group directories enforced at creation time

r Group manager is in experiment group Experiment directory is rx for group

r User is in experiment group Experiment directory is rx for group

Data in Experiment (or Project) Directory

rw Data administrator owns all files and subdirectories enforced with inotify script

rw Group manager is in experiment group Experiment directory is rwx for group

rw User is in experiment group Experiment directory is rwx for group

Experiment (or Project) Group create group modify group member

modify group members Group manager uid has additional group owner attribute in schema

none User can not modify group

Page 19: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

19

APS Architecture - Hardware

Beamline Acquisition Computer

dserv

lustre

gridFTP

Server

Internal gridFTP Server External GO EndpointBeamline Acquisition Computer

dserv

Beamline Acquisition Computer

dserv

Globus

APS Firewall

Page 20: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

20

APS Architecture – Software

Internal Transfer & Tracking Storage Resource Broker (SRB) (SDSC) SPADE (ALS-LBL) Modify our internal workflow pipeline (APS-ANL) SLAC has an internal system XRootDSSG is investigating which to adopt

User Accounts Integrate user badges with APS LDAP

Management Develop web site for modifying ownership and access permissions

Page 21: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

21

APS Architecture – Software

External Transfer & Access GlobusOnline provides access to APS

data from the outside Users authenticate using their APS

badge number and password Users can only see their data Users can integrate with other

Globus tools

Page 22: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

22

APS Practices and ProceduresData Storage Workflow

Data should be transferred from the acquisition computer to the local dserv Data on the dserv is transferred to lustre storage at one of the following intervals:

– Immediately– Daily (at a designated time)– Every Tuesday @ 8AM– At the end of an experiment– At the end of a run

Data on lustre is automatically deleted at a time determined by APS policy

Page 23: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

23

APS Practices and ProceduresData Storage Organization

Experiment Data Experiment data must be stored in a directory named

e[EASFNumber]_[PILastName], e.g. e123456_Smith Experiment data directories must be located in

/data/managed/experiments/r[RunNumber], e.g. /data/managed/experiments/r2013-2

/data/managed/experiments/r2013-2/e123456_Smith

Project Data Project data must be stored in a directory named p[ProjectID]_[ProjectName], e.g.

p000001_MyProject Project data directories must be located in /data/managed/projects /data/managed/projects/p000001_MyProject

Page 24: Thoughts on Data Management Nicholas Schwarz Software Services Group Advanced Engineering Support (AES) Division Advanced Photon Source (APS) 25 June 2013.

Thoughts on Data Management - SSG - 14 June 2013

24