WP8 Status – Stephen Burke – 30th January 2003 WP8 Status Stephen Burke (RAL) (with thanks to...

17
WP8 Status – Stephen Burke – 30th January 2003 WP8 Status Stephen Burke (RAL) (with thanks to Frank Harris)

Transcript of WP8 Status – Stephen Burke – 30th January 2003 WP8 Status Stephen Burke (RAL) (with thanks to...

WP8 Status – Stephen Burke – 30th January 2003

WP8 Status

Stephen Burke (RAL)

(with thanks to Frank Harris)

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 2/17

Outline

Overview of objectives for 2nd project year, and the corresponding achievements

Ongoing work on use cases

Evaluations by Loose Cannons

Data Challenge work with Atlas and CMS

Comments on the key points of work in the other experiments

The organisation for D 8.3 ‘Testbed assessment for HEP applications’

The planning for the 3rd project year, and some associated issues

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 3/17

Objectives for the 2nd project year, and the corresponding achievements

OBJECTIVES

Use and exploitation of Testbed1

Validation of releases + feedback

Participation in the ATF, and the elaboration of use cases

Design of a common middleware layer for WP8 experiments

Use of EDG middleware in experiment Data Challenges(DCs)

ACHIEVEMENTS

All experiments have used the applications testbed. Babar and D0 have joined the 4 LHC experiments, and NA48 will soon join.

Both LCs and the experiments have given continual feedback to middleware from both generic and experiment specific evaluations.

The ATF is very active and executes regular ‘scenario playing’ reviews. Use case documents have been produced and will develop in the context of EDG/LCG.

This has moved into the LCG project.

Atlas and then CMS have achieved significant pioneering work in the use of EDG middleware for DCs, and have produced detailed evaluations.

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 4/17

Ongoing work on use cases

‘Common Use Cases for A HEP Common Application Layer’ (HEPCAL)

(Document produced for LCG by a WG chaired and largely manned by WP8 people)

General (authorisation,login,browse resources) 4 use cases Data Management (metadata and data operations) 19 use cases Job Management (submission,control,monitoring,errors, 16 use cases

resource estimation, job splitting…….) VO Management (resource reservation,user rights, 4 use cases

software publishing…)

. EDG 1.4.3 satisfies use cases for a basic system(authorisation/authentication, data handling, job submission)

.EDG 2 will satisfy more advanced requirements e.g. data handling (metadata) and HEP data transformation

.There are other areas for discussion e.g. virtual data, experiment s/w publishing

This work to continue within EDG and LCG

In ATF regular scenario playing for use cases to check existing and future design

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 5/17

Evaluations by Loose Cannons

The Loose Cannons have been involved in Functionality and stress testing

Middleware debugging campaigns

Configuration and testing of Storage Elements and Virtual Organisations

Data Challenges of the ATLAS and CMS experiments

Integration Team and Architectural Task Force

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 6/17

Data Challenge work with Atlas

Purpose of the evaluation Verify the use of EDG middleware for Atlas Data Challenges (DC) Verify the portability of Atlas simulation code to a grid environment

Specific Goals Compare results with those obtained without the Grid Make prioritised list of recommendations to EDG for bug-fixes and future

developments in an evaluation report

Organization Joint Atlas/EDG/LCG effort

Resources used (and functions) Sites: CERN, RAL, Lyon, Nikhef, CNAF + Karlsruhe

Several UIs: Milan, CERN, Cambridge RB: CERN

RC: Originally shared with CMS. Later a separate one at CNAF.

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 7/17

Atlas evaluations (August and Dec/Jan) (DETAILED PAPER IN PREPARATION)

RESULTS Atlas software was used in the EDG Grid environment Several hundred simulation jobs of length 4-24 hours were executed, data was

replicated using grid tools Results of simulation agreed with ‘non-Grid’ runs

OBSERVATIONS Good interaction with EDG middleware providers and with WP6/8 With a very big effort it was possible to run the jobs Showed up bugs and performance limitations (fixed or to be fixed in TB 2)

WP1 Many Long Jobs’ failed (now much better) WP2 Replication Tools were difficult to use and not reliable WP3 Information Service based on MDS gave poor performance (affected

WP1) WP4 We need to separate out application and system software

installations

We need TB2 release for use in large scale data challenges

RECOMMENDATIONS (see combined ATLAS/CMS recommendations…)

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 8/17

Data Challenge work with CMS Purpose of the “stress test”:

Verify the use of EDG middleware for CMS Production Verify the portability of the CMS Production environment to a grid

environment

Specific Goals Aim for as many simulated events as possible for physics with 1000’s of

‘short’ event generation and ‘long’ detector simulation jobs using full production system

Measure performance, efficiencies and reasons for job failures Aim for a stable system by bug fixing and the reconfiguration of components

Organization This was a joint effort involving CMS, EDG, EDT and LCG people

Resources used (and functions) Sites: CERN, RAL, Lyon, Nikhef, CNAF + Legnaro, Padova, Ecol. Poly, IC UIs: CNAF, Padova, Ecol. Poly., IC RBs: CNAF (CMS), CNAF (shared), CERN (CMS), IC(CMS+Babar) RC: Originally shared with Atlas. Later a separate one at CNAF.

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 9/17

SECE

CMS software

CMS production components interfaced to EDG middleware

BOSSDB

WorkloadManagement

System

JDL

RefDB

parameters

data registration

Job output filteringRuntime monitoring

input

dat a

lo

cat i

on

Push data or info

Pull info

UIIMPALA/BOSS

CMS production tools on UI: job creation, job submission and monitoring

CMS software (rpm-based) installed on CEs/WNs

Replica Manager

CE

CMS software

CE

CMS software

CE

WN

SECE

CMS software

SE

SE

SE

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 10/17

CMS use of the EDG TB (some statistics)

CEsSEs

Nb

. of

evts

time

Events Production within EDG as part of the Official CMS production

http://cmsdoc.cern.ch/cms/production/www/html/general/index.html

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 11/17

CMS/EDG Summary of Stress Test

Status EDG evaluation CMS evaluation EDG ver 1.4.3

Finished Correctly 5518 4601 604Crashed or bad status 818 1099 65

Total number of jobs 6336 5700 669

Efficiency 0.87 0.81 0.90

CMKIN jobs

Status EDG evaluation CMS evaluation EDG ver 1.4.3

Finished Correctly 1678 2147 394Crashed or bad status 2662 934 104

Total number of jobs 4340 3081 498Efficiency 0.39 0.70 0.79

CMSIM jobs

Short jobs

Long jobs

After Stress Test – Jan 03

After Stress Test – Jan 03

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 12/17

Main results, observations and recommendations from CMS work (detailed doc in preparation)

RESULTS Could distribute and run CMS s/w in EDG environment Generated ~250K events for physics with ~10000 jobs in 3 week period

OBSERVATIONS Were able to quickly add new sites to provide extra resources Fast turnaround in bug fixing and installing new software Job efficiency has grown from ~60% to currently more than 80% (much better for

short jobs (secs) than long jobs (hours) ) Test was labour intensive (since software was developing and the overall system

was fragile) WP1: At the start there were serious problems with long jobs - recently

improved WP2: Replication tools were difficult to use and not reliable, and the

performance of the Replica Catalogue was unsatisfactory WP3: Limitations in Information System based on MDS: performed poorly with

increasing query rate System sensitive to hardware faults and site/system mis-configuration User tools for fault diagnosis are limited

Testbed 2 should fix the major problems, providing a system suitable for full integration in distributed production

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 13/17

Joint recommendations from Atlas/CMS work

There are essential developments needed in Data Management (robustness and functionality) Information Systems (robustness and scalability) Workload Management (scalability for high rates, batch submissions,

output file specification) Mass Storage Support (gridified support due in Version 2)

We must maintain and strengthen joint Experiment/EDG work in the evaluation of system components AND the architecture (both will need to evolve – GRID developments are R/D)

Once middleware providers have done their ‘unit tests’ the applications must work with them in the areas of:

Performance evaluation for the user with increasing rates of job submission and data handling, and an expanding TB configuration

Streamlining procedures for feedback to middleware providers

EDG should provide site validation and monitoring procedures

EDG should provide good user tools for fault detection and diagnosis (what is job status?, why did it fail?……..)

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 14/17

Some key points of work in the other experiments

ALICE Developed scripts for the installation of ALICE software on EDG CEs

Developed a web interface to automatically submit jobs to the testbed and evaluate its "efficiency" (currently in use)

Current development of the AliEn/EDG interface: Able to send jobs to EDG via AliEn Completing the tests for registering/accessing data on/from both

catalogues (AliEn and EDG), which is required for interoperability

LHCb Consolidation of basic job submission capability (demonstrated at the EU

review,

and the opening of the National E-science Center, Edinburgh)

Made RPMs for the LHCb environment

Included DataGrid in the new LHCb distributed production system (DIRAC) and demonstrated that short DataGrid jobs can be submitted and managed via DIRAC

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 15/17

Babar Deployment of the BaBar VO:

VO and RC at Manchester, RB at IC CE/SE/WN at SLAC, In2p3, RAL and Ferrara

Deployment and adaptation of EDG software at SLAC (the EDG scripts had to be modified for the WN inside the Internet Free Zone)

Successfully tested BaBar analysis and simulation jobs within the EDG framework.

Next step is to run real full scale analysis on the Grid.

D0 A D0 Replica Catalogue and VO server have been set up at Nikhef.

A 124 CPU farm at NIKHEF has been successfully used with EDG s/w.

D0 support was added to the official EDG release. (Several sites now support D0 jobs and have installed the RPMs)

Will try the newer releases (and true Grid production) when RH 7.2 support appears.

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 16/17

The key content for D 8.3 ‘Testbed assessment for HEP applications’

‘Datagrid as an HEP production environment’ Detailed evaluations of the Atlas and CMS Task Forces

Evaluations by other LHC experiments (Alice, LHCb)

Evaluations from non-LHC experiments (Babar, D0)

Mapping of evaluations to the ‘common use cases’ General use cases

Data management

Job Management

VO management

A summary of lessons learned for future EDG development, and a statement of priorities for the experiments

WP8 Status – Stephen Burke – 30th Jamuary 2003 – n° 17/17

Planning for the 3rd project year, and associated issues

PLANNING Continue work with experiments using the Task Force model for Data Challenges Complete D8.3 for end March 2003 (based on 1.4.3) Continue architecture work in ATF, and participate in LCG use case/architecture

activities Evaluate Testbed 2 software, and port to experiment software environments for use

in the data challenges Complete D8.4 by Dec 2003 (based on Testbed 2)

SOME IMPORTANT ISSUES WP8 will work increasingly with experiments rather than doing generic testing,

which will taken up by WP6 Testing Group We must relate EDG/WP8 work to the use by experiments of the forthcoming LCG

Prototype, both in terms of software, hardware and user support Must organise detailed test sessions involving experiments and the providers of

middleware for information systems, data management and mass storage handling in the context of moving to Testbed 2

We look for improved diagnostic information from middleware in case of problems