Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow...

25
Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow

Transcript of Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow...

Page 1: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Data Centric IssuesParticle Physics and

Grid Data ManagementTony Doyle

University of Glasgow

Page 2: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Outline: Data to Metadata to DataOutline: Data to Metadata to Data

IntroductionYesterday “.. all my troubles seemed so far away”

(non-Grid) Database Access Data Hierarchy

Today “.. is the greatest day I’ve ever known”

Grids and Metadata Management File Replication Replica Optimisation

Tomorrow “.. never knows”

Event Replication Query Optimisation

IntroductionYesterday “.. all my troubles seemed so far away”

(non-Grid) Database Access Data Hierarchy

Today “.. is the greatest day I’ve ever known”

Grids and Metadata Management File Replication Replica Optimisation

Tomorrow “.. never knows”

Event Replication Query Optimisation

Page 3: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

:

:E.g.,

Resource-specific implementations of basic services

E.g., transport protocols, name servers, differentiated services, CPU schedulers, public key infrastructure, site accounting, directory service, OS bypass

Resource-independent and application-independent services

authentication, authorisation, resource location, resource allocation, events,

accounting, remote data accessremote data access, information, policy, fault detection

Distributedcomputing

toolkit

Grid Fabric (Resources)

Grid Services

(Middleware)

Application Toolkits

Data-Data-intensiveintensive

applicationsapplicationstoolkittoolkit

Collaborativeapplications

toolkit

RemoteVisualisationapplications

toolkit

Problemsolving

applicationstoolkit

Remoteinstrumentation

applicationstoolkit

Applications Chemistry

Biology

Cosmology

High Energy PhysicsHigh Energy Physics

Environment

GRID Services: ContextGRID Services: Context

Page 4: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Online Data Rate vs SizeOnline Data Rate vs Size

Level 1 Rate (Hz)

105

104

103

102

High Level-1 Trigger(1 MHz)

High No. ChannelsHigh Bandwidth(500 Gbit/s)

High Data Archive(PetaByte)

LHCB

KLOE

HERA-B

CDF II

CDF

H1ZEUS

UA1

LEP

NA49

ALICE

Event Size (bytes)

104 105 106

ATLASCMS

107

106

It doesn’t…Factor

O(1000)Online datareductionvia trigger selection

“How can this data reach theend user?”

Page 5: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Offline Data HierarchyOffline Data Hierarchy“RAW, ESD, AOD, TAG”

RAWRAW Recorded by DAQRecorded by DAQTriggered eventsTriggered events

Detector digitiDetector digitissationation~1 MB/event~1 MB/event

ESDESDPseudo-physical information:Pseudo-physical information:

Clusters, track candidates Clusters, track candidates (electrons, muons), etc.(electrons, muons), etc.

Reconstructed Reconstructed informationinformation

~100 kB/event~100 kB/event

AODAOD

Physical informationPhysical information::Transverse momentum, Transverse momentum,

Association of particles, jets, Association of particles, jets, (best) id of particles,(best) id of particles,

Physical info for relevant “objects”Physical info for relevant “objects”

Selected Selected informationinformation

~10 kB/event~10 kB/event

TAGTAGAnalysis Analysis

informationinformation~1 kB/event~1 kB/eventRelevant information Relevant information

for fast event selectionfor fast event selection

Page 6: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Physics AnalysisPhysics AnalysisESD: Data or Monte CarloESD: Data or Monte Carlo

Event Tags Event TagsEvent Selection

Analysis Object DataAnalysis Object DataAnalysis Object DataAnalysis Object DataAnalysis Object Data

AOD

Analysis Object Data

AOD

Calibration DataCalibration Data

Analysis, Skims

Raw DataRaw Data

Tier 0,1Collaboration

wide

Tier 2Analysis

Groups

Tier 3, 4Physicists

Physics Analysis

Physics

Objects Physics

Objects

Physics

Objects

INC

RE

AS

ING

DA

TA

FLO

W

Page 7: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

REAL and SIMULATED data required.

Central and Distributed production.

Data StructureData Structure

Raw DataRaw Data

Reconstruction

Data Acquisition

Level 3 trigger

Trigger TagsTrigger Tags

Event Summary Data

ESD

Event Summary Data

ESD Event Tags Event Tags

Physics Models

Monte Carlo Truth DataMonte Carlo Truth Data

MC Raw DataMC Raw Data

Reconstruction

MC Event Summary DataMC Event Summary Data MC Event Tags MC Event Tags

Detector Simulation

Calibration DataCalibration Data

Run ConditionsRun Conditions

Trigger System

Page 8: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

A running (non-Grid) experimentA running (non-Grid) experiment

Three Steps to select an event today1. Remote access to O(100) TBytes of

ESD data2. Via remote access to 100 GBytes of

TAG data3. Using offline selection e.g. ZeusIO-

Variable (Ee>20.0)and(Ntrks>4) Access to remote store via batch job 1% database event finding overhead O(1M) lines of reconstruction code No middleware 20k lines of C++ “glue” from Objectivity

(TAG) to ADAMO (ESD) database

Three Steps to select an event today1. Remote access to O(100) TBytes of

ESD data2. Via remote access to 100 GBytes of

TAG data3. Using offline selection e.g. ZeusIO-

Variable (Ee>20.0)and(Ntrks>4) Access to remote store via batch job 1% database event finding overhead O(1M) lines of reconstruction code No middleware 20k lines of C++ “glue” from Objectivity

(TAG) to ADAMO (ESD) database

TAG

ESD

100 Million selected events from 5 years’ data TAG selection via 250 variables/event

Page 9: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

A future (Grid) experimentA future (Grid) experiment

Three steps to (analysis) heaven 1. 10 (1) PByte of RAW (ESD) data/yr 2. 1 TByte of TAG data (local access)/yr3. Offline selection e.g. ATLASIO-

Variable (Mee>100.0)and(Njets>4) Interactive access to local TAG store Automated batch jobs to distributed

Tier-0, -1, -2 centres O(1M) lines of reconstruction code O(1M) lines of middleware… NEW… O(20k) lines of Java/C++ provide TAG

“glue” from TAG to ESD database All working? Efficiently?

Three steps to (analysis) heaven 1. 10 (1) PByte of RAW (ESD) data/yr 2. 1 TByte of TAG data (local access)/yr3. Offline selection e.g. ATLASIO-

Variable (Mee>100.0)and(Njets>4) Interactive access to local TAG store Automated batch jobs to distributed

Tier-0, -1, -2 centres O(1M) lines of reconstruction code O(1M) lines of middleware… NEW… O(20k) lines of Java/C++ provide TAG

“glue” from TAG to ESD database All working? Efficiently?

1000 Million events

from 1 year’s data-taking TAG selection via

250 variables

DataBase DataBase Solutions IncSolutions Inc..

Inter

Page 10: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Grid Data Management: RequirementsGrid Data Management: Requirements1. “Robust” - software development

infrastructure2. “Secure” – via Grid certificates3. “Scalable” – non-centralised

4. “Efficient” – Optimised replication Examples:

1. “Robust” - software development infrastructure

2. “Secure” – via Grid certificates3. “Scalable” – non-centralised

4. “Efficient” – Optimised replication Examples:

                                                                                                                                                        

GDMP Spitfire Reptor Optor

Page 11: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

1. Robust?Development Infrastructure

1. Robust?Development Infrastructure

CVS Repository management of DataGrid source code all code available (some mirrored)

Bugzilla Package Repository

public access to packaged DataGrid code Development of Management Tools

statistics concerning DataGrid code auto-building of DataGrid RPMs publishing of generated API documentation

latest build = Release 1.2 (August 2002)

CVS Repository management of DataGrid source code all code available (some mirrored)

Bugzilla Package Repository

public access to packaged DataGrid code Development of Management Tools

statistics concerning DataGrid code auto-building of DataGrid RPMs publishing of generated API documentation

latest build = Release 1.2 (August 2002)

testbed 1 source code lines

java

cpp

ansic

python

perl

sh

csh

sed

sql

makefile

140506 Lines of Code140506 Lines of Code10 Languages10 Languages(Release 1.0)(Release 1.0)

Page 12: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Component ETT UT IT NI NFF MB SD

Resource Broker

Job Desc. Lang.

Info. Index

User Interface

Log. & Book. Svc.

Job Sub. Svc.

Broker Info. API

SpitFire

GDMP

Rep. Cat. API

Globus Rep. Cat.

ETT Extensively Tested in Testbed

UT Unit Testing

IT Integrated Testing

NI Not Installed

NFF Some Non-Functioning Features

MB Some Minor Bugs

SD Successfully Deployed

Component ETT UT IT NI NFF MB SD

Schema

FTree

R-GMA

Archiver Module

GRM/PROVE

LCFG

CCM

Image Install.

PBS Info. Prov.

LSF Info. Prov.

Component ETT UT IT NI NFF

MB SD

SE Info. Prov.

File Elem. Script

Info. Prov. Config.

RFIO

MSS Staging

Mkgridmap & daemon

CRL update & daemon

Security RPMs

EDG Globus Config.

Component ETT UT IT NI NFF MB

SD

PingER

UDPMon

IPerf

Globus2 Toolkit

1. Robust?Software Evaluation

1. Robust?Software Evaluation

Page 13: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

1. Robust?Middleware Testbed(s)

1. Robust?Middleware Testbed(s)

B.Jones– July 2002 - n° 2

Testing Activities

WPs add unittested code toCVS repository

Run nightly build& auto. tests

AnyErrors?

I nstall on cert. Testbed& run back. compat.

tests

yesFix problems

no AnyErrors?

yes

Fix problems

no

Candidate beta ReleaseFor testing by apps.

“Development”testbed

“Certification”testbed

“Application”testbed

“WP specific”testbeds

AnyErrors?

yes no

Candidate publicrelease

for use by apps.

24x7Offi ce hours

I Team

TSTG

ATG

Apps

WPs

Validation/Maintenance=>Testbed(s)

EU-wide development

Page 14: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

1. Robust?Code Development Issues

1. Robust?Code Development Issues

Reverse Engineering (C++ code analysis and restructuring; coding standards) => abstraction of existing code to UML architecture diagrams

Language choice(currently 10 used in DataGrid)

Java = C++ - - “features” (global variables, pointer manipulation, goto statements, etc.).

Constraints (performance, libraries, legacy code) Testing (automation, object oriented testing)

Industrial strength? OGSA-compliant? O(20 year) Future proof??

Reverse Engineering (C++ code analysis and restructuring; coding standards) => abstraction of existing code to UML architecture diagrams

Language choice(currently 10 used in DataGrid)

Java = C++ - - “features” (global variables, pointer manipulation, goto statements, etc.).

Constraints (performance, libraries, legacy code) Testing (automation, object oriented testing)

Industrial strength? OGSA-compliant? O(20 year) Future proof??

Experiment-wide database selection

Output files Storage options preferences (SE, MSS, closest...)

Define execution cri teria (CE, priori ty ...)

Submit Physic Appl i

login

PRODUCTION: Simulation

else

If actor is proxy certi fied

Get LFNs for database access

Al locate output LFNs

Write submission job (JDL?) -Submit Job to Grid

VO metadata data description catalog

VO Job submission bookkeeping service

VO metadata configuration Catalog

Job resource match

VO repl ica catalog

Record job parameter

Al locate Job Id

Optimize CE choice /VO

Submit job to CE

Submit Job to Working Node

Prepare exec environment -associate PFN-LFN

Execute Physic Appl i

Manage Output Files & update Fi le catalog LFN-PFN

Record execution info

Fi le management & PFN selection

Record job parameter (JDL, input, ...)

Register/update attributes (LFN)

Management of job-related information

Display avai lable resources/JDL

Job execution accounting service

POSIX cal l -Open (LFN) Read/Wri te Close or grid wrapper to POSIX cal ls

VO Database access

Grid access via API

Appl ication is never recompi led or rel inked to run on Grid - Access to data is done via standard POSIX cal ls (???????)

Register/Update attributes (LFN) in VO metadata Catalog

Publ ish job-related information

ex: automatic file replication or fi le transfer & fi le catalog update

PHYSIC APPLICATIONGRIDEXPERIMENT SPECIFIC MODULESPRODUCTION TEAM

ETT Extensively Tested in Testbed

UT Unit Testing

IT Integrated Testing

NI Not Installed

NFF Some Non-Functioning Features

MB Some Minor Bugs

SD Successfully Deployed

testbed 1 source code lines

java

cpp

ansic

python

perl

sh

csh

sed

sql

makefile

Page 15: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Data Management on the GridData Management on the Grid “Data in particle physics is centred on events stored in a database…

Groups of events are collected in (typically GByte) files… In order to utilise additional resources and minimise data analysis time, Grid replication mechanisms are currently being used at the file level.”

Access to a database via Grid certificates

(Spitfire/OGSA-DAI) Replication of files on the Grid

(GDMP/Giggle) Replication and Optimisation Simulation

(Reptor/Optor)

“Data in particle physics is centred on events stored in a database… Groups of events are collected in (typically GByte) files… In order to utilise additional resources and minimise data analysis time, Grid replication mechanisms are currently being used at the file level.”

Access to a database via Grid certificates

(Spitfire/OGSA-DAI) Replication of files on the Grid

(GDMP/Giggle) Replication and Optimisation Simulation

(Reptor/Optor)

Page 16: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Servlet Container

SSLServletSocketFactory

TrustManager

Security Servlet

Does user specify role?

Map role to connection id

Authorization Module

HTTP + SSLRequest + client certificate

Yes

Role

Trusted CAsIs certificate signed

by a trusted CA?

No

Has certificatebeen revoked?

Revoked Certsrepository

Find default

No

Role repositoryRole ok?

Connectionmappings

Translator Servlet

RDBMS

Request a connection ID

ConnectionPool

2. Spitfire2. Spitfire

“Secure?”At the level required in Particle Physics

Page 17: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

2. Database client API2. Database client API A database client API has been defined Implement as grid service using

standard web service technologies Ongoing development with OGSA-DAI

A database client API has been defined Implement as grid service using

standard web service technologies Ongoing development with OGSA-DAI

Talk: •“Project Spitfire - Towards Grid Web Service Databases”

Page 18: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

3. GDMP and the Replica Catalogue3. GDMP and the Replica Catalogue

StorageElement1 StorageElement2 StorageElement3

Globus 2.0 Replica Catalogue (LDAP)

CentralisedLDAP based

GDMP 3.0 = File mirroring/replication toolOriginally for replicating CMS Objectivity files for High Level Trigger studies. Now used widely in HEP.

Replica Catalogue TODAY

Page 19: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

3. Giggle: “Hierarchical P2P”3. Giggle: “Hierarchical P2P”

LRC LRC LRC

RLI

RLIRLI

LRC

Hierarchical indexing. The higher-level RLI contains pointers to lower-level RLIs or LRCs.

StorageElement

StorageElement

StorageElement

StorageElement

StorageElement

RLI = Replica Location Index

LRC = Local Replica Catalog

LRC

“Scalable?”Trade-off:ConsistencyVersusEfficiency

Page 20: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

4. Reptor/Optor: File Replication/ Simulation4. Reptor/Optor: File Replication/ Simulation

Tests file replication strategies: e.g. economic model

Tests file replication strategies: e.g. economic model

ReplicaLocation

Index

Site

Replica Manager

StorageElement

ComputingElement

Optimiser

Resource Broker

User Interface

Pre-/Post-processing

Core API

Optimisation API

Processing API

LocalReplica

Catalogue

ReplicaLocation

Index

ReplicaMetadata CatalogueReplica

LocationIndex

Site

Replica Manager

StorageElement

ComputingElement

Optimiser

Pre-/Post-processing

LocalReplica

Catalogue

Reptor: Replica architectureOptor: Test file replication strategies: economic model

Reptor: Replica architectureOptor: Test file replication strategies: economic model

Demo and Poster: •“Studying Dynamic Grid Optimisation Algorithms for File Replication”

“Efficient?”Requires simulationStudies…

Page 21: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Application RequirementsApplication Requirements

“The current EMBL production database is 150 GB, which takes over four hours to download at full bandwidth capability at the EBI. The EBI's data repositories receive 100,000 to 250,000 hits per day with 20% from UK sites; 563 unique UK domains with 27 sites have more than 50 hits per day.” MyGrid Proposal

“The current EMBL production database is 150 GB, which takes over four hours to download at full bandwidth capability at the EBI. The EBI's data repositories receive 100,000 to 250,000 hits per day with 20% from UK sites; 563 unique UK domains with 27 sites have more than 50 hits per day.” MyGrid Proposal

Suggests: Less emphasis on

efficient data access and data hierarchy aspects (application specific).

Large gains in biological applications from efficient file replication.

Larger gains from application-specific replication?

Suggests: Less emphasis on

efficient data access and data hierarchy aspects (application specific).

Large gains in biological applications from efficient file replication.

Larger gains from application-specific replication?

Page 22: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Events.. to Files.. to EventsEvents.. to Files.. to Events

RAWRAW

ESDESD

AODAOD

TAGTAG

““Interesting Events List” Interesting Events List”

RAWRAW

ESDESD

AODAOD

TAGTAG

RAWRAW

ESDESD

AODAOD

TAGTAG

Tier-0Tier-0(International)(International)

Tier-1Tier-1(National)(National)

Tier-2Tier-2(Regional)(Regional)

Tier-3Tier-3(Local)(Local)

DataFiles

DataFiles

DataFiles

TAGData

DataFilesData

FilesDataFiles

RAWDataFile

DataFilesData

FilesESDData

DataFilesData

FilesAODData

Event 1 Event 2 Event 3

Not all pre-filtered events are interesting… Non pre-filtered events may be… File Replication Overhead.

Page 23: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Events.. to EventsEvent Replication and Query OptimisationEvents.. to EventsEvent Replication and Query Optimisation

RAWRAW

ESDESD

AODAOD

TAGTAG

““Interesting Events List” Interesting Events List”

RAWRAW

ESDESD

AODAOD

TAGTAG

RAWRAW

ESDESD

AODAOD

TAGTAG

Tier-0Tier-0(International)(International)

Tier-1Tier-1(National)(National)

Tier-2Tier-2(Regional)(Regional)

Tier-3Tier-3(Local)(Local)

Event 1 Event 2 Event 3

Knowledge“Stars in Stripes”

Distributed (Replicated)

Database

Page 24: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

@#%&*!

Data Grid for the ScientistData Grid for the Scientist

E = mc2

Grid Middleware

…In order to get back to the real (or simulated) data.

Incremental

Process…Level of the metadata? file?… event?… sub-event?…

Page 25: Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

SummarySummary

Yesterday’s data access issues are still here They just got bigger (by a factor 100) Data Hierarchy is required to access more data more

efficiently… insufficientToday’s Grid tools are developing rapidly

Enable replicated file access across the grid File replication standard (lfn:\\, pfn:\\) Emerging standards for Grid Data Access..

Tomorrow “.. never knows”

Replicated “Events” on the Grid?.. Distributed databases?.. or did that diagram look a little too monolithic?

Yesterday’s data access issues are still here They just got bigger (by a factor 100) Data Hierarchy is required to access more data more

efficiently… insufficientToday’s Grid tools are developing rapidly

Enable replicated file access across the grid File replication standard (lfn:\\, pfn:\\) Emerging standards for Grid Data Access..

Tomorrow “.. never knows”

Replicated “Events” on the Grid?.. Distributed databases?.. or did that diagram look a little too monolithic?