10 th GridPP Meeting – 4 June 2004 - 1 LCG Deployment Ian Bird IT Department, CERN 10 th GridPP...

10th GridPP Meeting – 4 June 2004 - 1

LCG DeploymentLCG Deployment

Ian BirdIT Department, CERN

10th GridPP MeetingCERN

4th June 2004


OverviewOverview

Deployment area organisation Some history where we are now Data challenges – experiences Evolution service challenges Transition to EGEE Interoperability Summary


DeploymentArea ManagerDeployment

Area ManagerGrid Deployment

BoardGrid Deployment

Board

CertificationTeam

CertificationTeam

DeploymentTeam

DeploymentTeam

ExperimentIntegration

Team

ExperimentIntegration

Team

Testing groupTesting group

Security group

Security group

Storage group

Storage group

GDB task forces

JTB

HEPiX

GGF

Grid Projects:EDG,Trillium,Grid3/OSG,etc

Regional Centres

LHC Experiments

LCG Deployment Area

LCG Deployment Organisation and Collaborations

OperationsCentres- RAL

OperationsCentres- RAL

Call Centres- FZK

Call Centres- FZK

Advises, informs,Sets policy

Set requirements

Set requirements

Col

labo

rativ

e ac

tiviti

es

participate

participate


CommunicationCommunication

Weekly GDA meetings (Monday 14:00, VRVS, phone) Mail-list – [email protected] Open to all – need experiments, regional centres, etc. Technical discussions, understand what priorities are Policy issues referred back to PEB or GDB Experience so far:

• Experiments join, regional centres don’t • NEED participation of system managers and admins – we need a rounded view

of the issues

Weekly core site phone conference Address specific issues with deployment

Also at CERN: Weekly DC coordination meetings with each experiment

GDB meetings monthly Make sure your GDB rep keeps you informed

Open to ways to improve communication!


Some history – 2003/2004Some history – 2003/2004

Recall goals:

LCG-0 : pilot service was deployed in Feb/March Was used by CMS in Italy very successfully for productions

LCG-1 : based on VDT & EDG 2.0 was deployed in September Not heavily used by experiments – but was successfully used by CMS for

production over Christmas and (US-)Atlas demonstrated interoperability with Grid2003

Lacked a real (managed) SE and integration with MSS LCG-2 : based on VDT & EDG 2.1 was ready by end 2003

Data management tools integrated with SRM, intended to package dCache as managed disk-SE.

Deployed in Jan/Feb 2004 – many updates – used by experiments in 2004 data challenges

July: Introduce the initial publicly available LCG-1 global grid service

November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges


Sites in LCG-2/EGEE-0 : June 4 2004Sites in LCG-2/EGEE-0 : June 4 2004

Austria U-Innsbruck

Canada Triumf

Alberta

Carleton

Montreal

Toronto

Czech Republic

Prague-FZU

Prague-CESNET

France CC-IN2P3

Clermont-Ferrand

Germany FZK

Aachen

DESY

Wuppertal

Greece HellasGrid

Hungary Budapest

India TIFR

Israel Tel-Aviv

Weizmann

Italy CNAF

Frascati

Legnaro

Milano

Napoli

Roma

Torino

Japan Tokyo

Netherlands NIKHEF

Pakistan NCP

Poland Krakow

Portugal LIP

Russia SINP-Moscow

JINR-Dubna

Spain PIC

UAM

USC

UB-Barcelona

IFCA

CIEMAT

IFIC

Switzerland CERN

CSCS

Taiwan ASCC

NCU

UK RAL

Birmingham

Cavendish

Glasgow

Imperial

Lancaster

Manchester

QMUL

RAL-PP

Sheffield

UCL

US BNL

FNAL

HP Puerto-Rico

• 22 Countries• 58 Sites (45 Europe, 2 US, 5 Canada, 5 Asia, 1 HP)

• Coming: New Zealand, China, other HP (Brazil, Singapore)

• 3800 cpu


Experience: Data challengesExperience: Data challenges

Alice has been running since March CMS DC04 LHCb now starting seriously Atlas starting now

See talks from June 2


Data challenges – so farData challenges – so far

Resources CPU available – Alice could not fully utilise – storage limitations Disk available – mostly very small amounts

• Need:– Plan space vs cpu at a site– Ensure that commitments are provided

• To some extent not requested – delay in dcache SE – asked not to commit all to classic SE’s as expected/worried about migration

Alice and CMS – • Number and size (small) of files:

– limitations of existing Castor system, also problems in Enstore/dCache

CPU is mostly in core sites At the moment (most of) the other sites have relatively few cpu

assigned


Data challenges – 2 Data challenges – 2

Services: LCG-2 services (RB, BDII, CE, SE etc) have been extremely

reliable and stable• Even RLS was stable (other issues)

BDII has been extremely reliable• Provided to experiments – allowed them to define a view of the system

Software deployment system works Needs some improvement – esp for sites with no shared filesystem

Information system Schema does not match batch system functionality Information published (job slots, ETT, etc.) does not reflect batch

system Solve with CE per VO, need to improve/adapt schema (?)


RLS issuesRLS issues

RLS performance was biggest problem Many fixes made during challenge:

CLI tools based on C++ API in place on Java tools Added support for non-SE entries Additional tools (register with existing guid) Case sensitivity Performance analysis – usage of metadata queries Lack of bulk operations No support for transactions

Still unresolved service performance issue (see degradation) – seems to be server related

No data loss or extended service downtime Replication tests with CNAF

Not really tested by CMS


RLS – cont.RLS – cont.

Many of above issues addressed in version currently being tested

Preparing a note describing proposed improvements for discussion: e.g. Combine RMC and LRC into single db to allow db to optimise and

join Resolve issues found in data challenges Model for replicated/distributed catalogs?

Is the model of metadata appropriate? Experiment vs POOL vs RLS

With DB group continue to investigate Oracle replication


Evolution: missing featuresEvolution: missing features

A full storage element dCache has had many problems

Packaged - to be deployed Is dCache sufficient/the only solution?

Demonstrated integration of Tier 1 MSS’s Full data management tools

Nice features of SRM gave users a lot of convenience: - auto directory creation;

We were able to continue improving our setup during DC04:

- The biggest performance gain was: Michael and his team in DESY developed a new module that reduces the delegated proxy's modulus size in SRM and speeds up the interaction between SRM client and server 3.5 times;

(From CMS FNAL team, based on work done by deployment group)


Evolution: missing featuresEvolution: missing features Functionally:

Port to other RH-derived linux• This is now becoming urgent – new hardware, security patches, …

VOMS • At least the basic part

R-GMA• For monitoring

Replace OpenPBS as default batch system Operationally:

Assumption of real operational management by GOCs• A lot of work on basics has been done – but need problem management

User call centre • Lack of take-up• Propose FZK/GOC team come to CERN for 1-2 days to really sort this out

Accounting:• Critical – we have no information about what has been used during the DC’s –

important for us and for the experiments Monitoring:

• Grid: lack of consistency in what is presented for each site• Experiments: we must put R-GMA in place (at least)


Evolution: Service ChallengesEvolution: Service Challenges

Purpose Understand what it takes to operate a real grid service – run for days/weeks

at a time (outside of experiment Data Challenges) Trigger/encourage the Tier1 planning – move towards real resource

planning for phase 2 – based on realistic usage patterns• How does a Tier 1 decide what capacity to provide?• What planning is needed to achieve that?• Where are we in this process?

Get the essential grid services ramped up to needed levels – and demonstrate that they work

Set out milestones needed to achieve goals during the service challenges

NB: This is focussed on Tier 0 – Tier 1/large Tier 2 Data management, batch production and analysis

By end 2004 – have in place a robust and reliable data management service and support infrastructure and robust batch job submission


Service challenges – examples Service challenges – examples

Data Management Networking, file transfer, data management Storage management and interoperability Fully functional storage element (SE)

Continuous job probes Understand limits

Operations centres Accounting, assume levels of service responsibility, etc Hand-off of responsibility (RAL-Taipei-US/Canada)

"Security incident" Detection, incident response, dissemination and resolution

IP connectivity Milestones to remove (implementation) need outbound connection from WN

User support Assumption of responsibility, demonstrate staff in place, etc

VO management Robust and flexible registration, management interfaces, etc

Etc.


Data Management – example Data Management – example

Data management builds on a stack of underlying services: Network Robust file transfer Storage interfaces and functionality Replica location service Data management tools


Data management – 2 Data management – 2

Network layer: Proposed set of network milestones already in draft

• Network and fabric groups at CERN – collaborate with (initially) “official” Tier 1’s• Dedicated private networks for Tier 0 Tier 1 “online” raw data transfers

File transfer service layer: Move a file from A to B, with good perfomance and reliability This service would normally only be visible via the data movement service

• Only app that can access/schedule/control this network E.g. of this is gridftp, bbftp, etc. Reliability – the service must detect failure, retry, etc. Interfaces to storage systems (SRM)

The US-CMS/CERN “Edge Computing” project might be an instance of this layer (network + file transfer)


Data management – 3 Data management – 3

Data movement service layer: Builds on top of file transfer and network layers To provide an absolutely reliable and dependable service with good

performance Implement queuing, priorities, etc. Initiates file transfers using file transfer service Acts on application’s behalf – a file handed to the service will be

guaranteed to arrive

Replica Location Service: Makes use of data movement Should be distributed:

• Distributed/replicated databases (Oracle) with export/import to XML/other db’s?

• RLI model?


Job probes – example Job probes – example

Continuous flood of jobs Fill all resources Use as probes – test if they can use the resources

• Data access, cpu, etc Understand limitations, bottlenecks of the system

• Baseline measurement, find limits, build and improve

This might be a function of the GOC Overseen by RAL-Taipei-+ collaboration ?

A challenge might run for a week Outside of experiment data challenges In parallel (or part of) data management or other challenges


Clarify: LCG project LCG applications LCG middleware release LCG infrastructure

EGEE (i.e. LCG) infrastructure The LCG-2 infrastructure IS the EGEE infrastructure Can be used now by other applications

Expect to run LCG-2 based infrastructure for 1 year New middleware has to be better than this becomes

EGEE-developed middleware runs on pre-production Moves to production when more functional/stable/reliable/…

middleware and infrastructure

Transition to EGEETransition to EGEE


Some remarksSome remarks

Existing LCG-2 sites already support many VOs Not only LCG Front-line support for all VOs is via the ROCs

Process to introduce a new VO Well defined Some tools needed to make the mechanics simpler

Evaluation of new middleware by applications, and preparation for deployment in EGEE-1 This is what the pre-production service is for

Resource allocation/negotiation OMC/ROC managers/NA4 – negotiate with RC’s and apps


Joining EGEE – Overview of processJoining EGEE – Overview of process

Application nominates VO manager Find (CIC) to operate VO server VO is added to registration procedure Determine access policy:

Propose discussion (body) NA4 + ROC manager group• Which sites will accept to run app (funding, political constraints)• Need for a test VO?

Modify site configs to allow the VO access Negotiate CICs to run VO-specific services:

VO server (see above) RLS service if required Resource Brokers (can be some general at CIC and others owned by apps),

UIs – general at CIC/ROC – or on apps machines etc Potentially (if needed) BDII to define apps view of resources

Application software installation Understand application environment, and how installed at sites

Many of these issues can be negotiated by NA4/SA1 in a short discussion with the new apps community


Resource Negotiation PolicyResource Negotiation Policy

The EGEE infrastructure is intended to support and provide resources to many virtual organisations Initially HEP (4 LHC experiments) + Biomedical Each RC supports many VOs and several application domains – situation now

for centres in LCG

Initially must balance resources contributed by the application domains and those that they consume Resource centres may have specific allocation policies

• E.g. due to funding agency attribution by science or by project Expect a level of peer review within application domains to inform the allocation

process New VOs and Resource centres should satisfy minimum requirements

Commit to bring a level of additional resources consistent with their requirements

Requirement on JRA1 to provide mechanisms to implement/enforce quotas, etc

Selection of new VO/RC via NA4


New Resource CentresNew Resource Centres

Procedure for new sites to join LCG2/EGEE is well defined and documented

Sites can join now Coordination for this is via the ROCs

Who will support the installations, set-up, and operation


Certification, Testing and Release CycleCertification, Testing and Release Cycle

CERTIFICATIONTESTING SERVICES

Integrate

BasicFunctionality

Tests

Run testsC&T suitesSite suites

RunCertification

Matrix

Releasecandidate

tag

PR

E-P

RO

DU

CT

ION

PR

OD

UC

TIO

N

APPINTEGR

Certifiedrelease

tag

DE

VE

LO

PM

EN

T &

IN

TE

GR

AT

ION

UN

IT &

FU

NC

TIO

NA

L T

ES

TIN

G

DevTag

JRA1

HEPEXPTS

BIO-MED

OTHERTBD

APPSSW

Installation

DE

PL

OY

ME

NT

PR

EP

AR

AT

ION

Deploymentrelease

tag

DEPLOY

SA1

Productiontag


InteroperabilityInteroperability

Several grid infrastructures for LHC experiments: LCG-2/EGEE, Grid2003/OSG, Nordugrid, other national grids

LCG/EGEE explicit goals to interoperate One of LCG service challenges Joint projects on storage elements, file catalogs, VO management,

etc.

Most are VDT (or at least Globus-based) Grid3 & LCG use GLUE schema

Issues are: File catalogs, information schema, etc at technical level Policy and semantic issues


Deployment – GridPP supportDeployment – GridPP support

GridPP contributions to deployment have been crucial: 5 of CERN deployment team funded by PPARC Essential to bringing the current release to such stability and

reliability – and it’s not that hard to install – 58 sites so far Grid Operations Centre at RAL Security team – very active


SummarySummary

Huge amount of work done in the last year to produce a robust set of middleware These lessons must be applied to new developments

LCG is being successfully used in experiment data challenges Many problems found and addressed (tools, bugs, etc) Other fundamental problems subject of development Services are now very reliable

Plans for service challenges to help move forward Must ensure that only single developments – coordinate

EGEE/LCG/OSG/etc. Push for interoperability at all levels – experiments have big role to play in

insisting on single solutions Emphasis now on strengthening the operational infrastructure

EGEE investment helps here

PPARC/GridPP support has been essential

10 th GridPP Meeting – 4 June 2004 - 1 LCG Deployment Ian Bird IT Department, CERN 10 th GridPP...

Documents

Transcript of 10 th GridPP Meeting – 4 June 2004 - 1 LCG Deployment Ian Bird IT Department, CERN 10 th GridPP...