Oxana Smirnova LCG/ATLAS/Lund [email protected] November 11, 2002, Uppsala 4th NorduGrid...

12
Oxana Smirnova LCG/ATLAS/Lund [email protected] November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Oxana Smirnova LCG/ATLAS/Lund [email protected] November 11, 2002, Uppsala 4th NorduGrid...

Page 1: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

Oxana SmirnovaLCG/ATLAS/[email protected] 11, 2002, Uppsala4th NorduGrid Workshop

ATLAS Data Challenges on EDG

Page 2: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 2

EU Datagrid project

Started on January 1, 2001, to deliver by end 2003 Aim: to develop a Grid middleware suitable for High Energy

physics, Earth Observation and biology applications Initial development based on existing tools, e.g., Globus,

LCFG, GDMP etc The core testbed consists of the central site at CERN

and few facilities across the Western Europe; many more sites are foreseen to join soon Italy, UK come with several sites each; Spain, Germany and

others – via the Crossgrid ATLAS-affiliated sites: Canada, Taiwan etc

By now reached the stability level sufficient to test submission of production-style tasks

Page 3: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 3

EDG Testbed

EDG is committed to create a stable testbed to be used by applications for real tasks This started to materialize in mid-August… …and coincided with the ATLAS DC1 ATLAS asked and was given the first priority

Most sites are installed from scratch using the EDG tools (RedHat 6.2 based) NIKHEF: EDG installation and configuration

only Lyon: installation on the top of existing farm A lightweight EDG installation is available

Central element: the Resource Broker (RB), distributes jobs between the resources Currently, only one RB (CERN) is available

for applications In future, may be an RB per Virtual

Organization (VO)

Page 4: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 4

EDG functionality as of today

UI

CASTOR

RC

CE

CE

CE

lxshare0393.cern.chRB

lxshare033.cern.ch

testbed010.cern.ch

lxshare0399.cern.chdo rfcp

rfcp

rfcp

replicate

GDMP or RM

GDMP or RM

jdl

LDAP NFS

RSL

OutputGDMP or RM

Chart borrowed from Guido Negri’s slides

Input

Input

Output

Page 5: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 5

ATLAS is eager to use Grid tools for the Data Challenges ATLAS Data Challenges are already on the Grid (NorduGrid, VDT) The DC1/phase2 (starting now) is expected to be done using the Grid

tools to a bigger extent ATLAS-EDG Task Force was put together in August with the aims:

To assess the usability of the EDG testbed for the immediate production tasks

To introduce the Grid awareness to the ATLAS collaboration The Task Force has representatives both from ATLAS and EDG:

40+ members (!) on the mailing list, ca 10 of them working nearly full-time

The initial task: to process 5 input partitions of the Dataset 2000 at the EDG Testbed + one non-EDG site (Karlsruhe); if this works, continue with other datasets

ATLAS-EDG Task Force

Page 6: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 6

Execution of jobs

It was expected that we can make full use of the Resource Broker functionality Data-driven job steering Best available resources otherwise

Input files are pre-staged once (copied from CASTOR and replicated elsewhere)

A job consists of the standard DC1 shell-script, very much the way it is done on a conventional cluster

A Job Definition Language is used to wrap up the job, specifying: The executable file (script) Input data Files to be retrieved manually by the user Optionally, other attributes (maxCPU, Rank etc)

Storage and registration of output files is a part of the job script: i.e., application manages output data the way it needs

Page 7: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 7

Hurdles

EDG can not replicate files directly from CASTOR and can not register them in the Replica Catalog

Replication was done via CERN SE; EDG is working on a better (though temporary) solution. CASTOR team writes a GridFTP interface, which will help a lot.

Big file transfer interrupts after 21 minutes Also known Globus GridFTP server problem, temporary fixed by

using multi-threaded GridFTP instead of EDG tools Jobs were “lost” by the system after 20 minutes of execution

Known problem of the Globus software (GASS Cache mechanism), temporary fixed on expense of frequent job submission

Static information system: if a site goes down, it should be removed manually from the index

Attempts are under way to switch to the dynamic hierarchical MDS; not yet stable due to the Globus bugs

Page 8: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 8

Other minor problems

Installation of ATLAS software: Cyclic dependencies External dependencies, esp. on system software

Authentication & authorization, users and services EDG can’t accept instantly a dozen of new country Certificate Authorities Default proxy lives only 12 hours – users keep forgetting to request longer ones

to accommodate long jobs Documentation

Is abundant and not very much user-oriented Things are improving as more users are coming

Information system faulty information providers, affecting brokering very difficult to browse/search and retrieve relevant info

Data management information about existing file collections is not easy to find management of output data is mostly manual (can not be done via JDL)

General instability of most EDG services

Page 9: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 9

Achievements:

A team of hard-working people across the Europe (ATLAS VO is 45 members strong as of today)

ATLAS software (starting from release 3.2.1) is packed into relocateable RPMs, distributed and validated elsewhere

DC1 production script is “gridified”, submission script is produced

User-friendly testbed status monitor and ATLAS VO information page are deployed

5 Dataset 2000 input files are replicated to 5 sites (2 @ each) Two production-style tests completed:

100 first partitions of the Dataset 2000 are processed Other (smaller) datasets: 4 input files (ca 400 MB each)

replicated to 4 sites; 250 jobs submitted, adjusted to run ca 4 hours each. The jobs were distributed across all the testbed by the Resource Broker

Page 10: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 10

Summary

Environment Success/failure rate

Job execution Data management

Testbed 1.2.0 GASS Cache problems, 100% failure Big file replication fails (GridFTP timeout); no CASTOR support

Testbed 1.2(.1)

only CERN site is available, GASS Cache “unfixed”

Half of the Dataset 2000 jobs are executed, 100% success

Not applicable (only one site is used)

Testbed 1.2.2

All the core sites have GASS Cache “unfixed”

400 short jobs are executed across the testbed; the rest of the Dataset 2000 jobs proceeded with > 50% re-submission rate

Short files are replicated everywhere; longer files are copied manually (GridFTP not fixed)

Testbed 1.3

A.K.A. “The Showstopper” release

To be tested (GASS Cache is expected to be fixed)

To be tested (GridFTP is expected to be fixed)

Page 11: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

2002-11-11 [email protected] 11

What next

Testbed 1.3 is available for testing (not on production site yet) from today

Precise quantification of failure/success rate using Dataset 2000 partitions to be done on Testbed 1.3

ATLAS DC1, pile-up: the runtime environment is ready, scripts are prepared

o Testbed feature: the “old” runtime environment (3.2.1) has to be replaced with a new one (4.0.1)

CASTOR-EDG interface has to be tested; GridFTP server on CASTOR is expected to arrive soon

Some ATLAS production sites may join the EDG Testbed soon

Page 12: Oxana Smirnova LCG/ATLAS/Lund oxana.smirnova@cern.ch November 11, 2002, Uppsala 4th NorduGrid Workshop ATLAS Data Challenges on EDG.

Ingo AugustinVandy Berten

Jean-Jacques BlaisingFrederic BrochuStephen Burke

Serban ConstantinescuFrancois EtienneMichael GardnerLuc GoossensMarcus HardtFrank Harris

Fabio HernandezBob Jones

Roger JonesChristos Kanellopoulos

Andrey KiryanovPeter Kunszt

Emanuele LeonardiCal Loomis

Fairouz Malek-OhlssonGonzalo Merino

Armin NairzGiudo Negri

Steve O'NealeLaura Perini

Gilbert PoulardAlois Putzer

Di QingMario Reale

David RebattoZhongliang RenSilvia Resconi

Alessandro De SalvoMarkus Schulz

Massimo SgaravattoOxana Smirnova

Chun Lik TanJeff Templon

Stan ThompsonLuca Vaccarossa

Peter Watkins

No animals were harmed in the production tests

MMII