10 th GridPP Meeting – 4 June 2004 - 1 LCG Deployment Ian Bird IT Department, CERN 10 th GridPP...
-
Upload
jason-neal -
Category
Documents
-
view
214 -
download
0
Transcript of 10 th GridPP Meeting – 4 June 2004 - 1 LCG Deployment Ian Bird IT Department, CERN 10 th GridPP...
10th GridPP Meeting – 4 June 2004 - 1
LCG DeploymentLCG Deployment
Ian BirdIT Department, CERN
10th GridPP MeetingCERN
4th June 2004
10th GridPP Meeting – 4 June 2004 - 2
OverviewOverview
Deployment area organisation Some history where we are now Data challenges – experiences Evolution service challenges Transition to EGEE Interoperability Summary
10th GridPP Meeting – 4 June 2004 - 3
DeploymentArea ManagerDeployment
Area ManagerGrid Deployment
BoardGrid Deployment
Board
CertificationTeam
CertificationTeam
DeploymentTeam
DeploymentTeam
ExperimentIntegration
Team
ExperimentIntegration
Team
Testing groupTesting group
Security group
Security group
Storage group
Storage group
GDB task forces
JTB
HEPiX
GGF
Grid Projects:EDG,Trillium,Grid3/OSG,etc
Regional Centres
LHC Experiments
LCG Deployment Area
LCG Deployment Organisation and Collaborations
OperationsCentres- RAL
OperationsCentres- RAL
Call Centres- FZK
Call Centres- FZK
Advises, informs,Sets policy
Set requirements
Set requirements
Col
labo
rativ
e ac
tiviti
es
participate
participate
10th GridPP Meeting – 4 June 2004 - 4
CommunicationCommunication
Weekly GDA meetings (Monday 14:00, VRVS, phone) Mail-list – [email protected] Open to all – need experiments, regional centres, etc. Technical discussions, understand what priorities are Policy issues referred back to PEB or GDB Experience so far:
• Experiments join, regional centres don’t • NEED participation of system managers and admins – we need a rounded view
of the issues
Weekly core site phone conference Address specific issues with deployment
Also at CERN: Weekly DC coordination meetings with each experiment
GDB meetings monthly Make sure your GDB rep keeps you informed
Open to ways to improve communication!
10th GridPP Meeting – 4 June 2004 - 5
Some history – 2003/2004Some history – 2003/2004
Recall goals:
LCG-0 : pilot service was deployed in Feb/March Was used by CMS in Italy very successfully for productions
LCG-1 : based on VDT & EDG 2.0 was deployed in September Not heavily used by experiments – but was successfully used by CMS for
production over Christmas and (US-)Atlas demonstrated interoperability with Grid2003
Lacked a real (managed) SE and integration with MSS LCG-2 : based on VDT & EDG 2.1 was ready by end 2003
Data management tools integrated with SRM, intended to package dCache as managed disk-SE.
Deployed in Jan/Feb 2004 – many updates – used by experiments in 2004 data challenges
July: Introduce the initial publicly available LCG-1 global grid service
November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges
10th GridPP Meeting – 4 June 2004 - 6
Sites in LCG-2/EGEE-0 : June 4 2004Sites in LCG-2/EGEE-0 : June 4 2004
Austria U-Innsbruck
Canada Triumf
Alberta
Carleton
Montreal
Toronto
Czech Republic
Prague-FZU
Prague-CESNET
France CC-IN2P3
Clermont-Ferrand
Germany FZK
Aachen
DESY
Wuppertal
Greece HellasGrid
Hungary Budapest
India TIFR
Israel Tel-Aviv
Weizmann
Italy CNAF
Frascati
Legnaro
Milano
Napoli
Roma
Torino
Japan Tokyo
Netherlands NIKHEF
Pakistan NCP
Poland Krakow
Portugal LIP
Russia SINP-Moscow
JINR-Dubna
Spain PIC
UAM
USC
UB-Barcelona
IFCA
CIEMAT
IFIC
Switzerland CERN
CSCS
Taiwan ASCC
NCU
UK RAL
Birmingham
Cavendish
Glasgow
Imperial
Lancaster
Manchester
QMUL
RAL-PP
Sheffield
UCL
US BNL
FNAL
HP Puerto-Rico
• 22 Countries• 58 Sites (45 Europe, 2 US, 5 Canada, 5 Asia, 1 HP)
• Coming: New Zealand, China, other HP (Brazil, Singapore)
• 3800 cpu
10th GridPP Meeting – 4 June 2004 - 7
Experience: Data challengesExperience: Data challenges
Alice has been running since March CMS DC04 LHCb now starting seriously Atlas starting now
See talks from June 2
10th GridPP Meeting – 4 June 2004 - 8
Data challenges – so farData challenges – so far
Resources CPU available – Alice could not fully utilise – storage limitations Disk available – mostly very small amounts
• Need:– Plan space vs cpu at a site– Ensure that commitments are provided
• To some extent not requested – delay in dcache SE – asked not to commit all to classic SE’s as expected/worried about migration
Alice and CMS – • Number and size (small) of files:
– limitations of existing Castor system, also problems in Enstore/dCache
CPU is mostly in core sites At the moment (most of) the other sites have relatively few cpu
assigned
10th GridPP Meeting – 4 June 2004 - 9
Data challenges – 2 Data challenges – 2
Services: LCG-2 services (RB, BDII, CE, SE etc) have been extremely
reliable and stable• Even RLS was stable (other issues)
BDII has been extremely reliable• Provided to experiments – allowed them to define a view of the system
Software deployment system works Needs some improvement – esp for sites with no shared filesystem
Information system Schema does not match batch system functionality Information published (job slots, ETT, etc.) does not reflect batch
system Solve with CE per VO, need to improve/adapt schema (?)
10th GridPP Meeting – 4 June 2004 - 10
RLS issuesRLS issues
RLS performance was biggest problem Many fixes made during challenge:
CLI tools based on C++ API in place on Java tools Added support for non-SE entries Additional tools (register with existing guid) Case sensitivity Performance analysis – usage of metadata queries Lack of bulk operations No support for transactions
Still unresolved service performance issue (see degradation) – seems to be server related
No data loss or extended service downtime Replication tests with CNAF
Not really tested by CMS
10th GridPP Meeting – 4 June 2004 - 11
RLS – cont.RLS – cont.
Many of above issues addressed in version currently being tested
Preparing a note describing proposed improvements for discussion: e.g. Combine RMC and LRC into single db to allow db to optimise and
join Resolve issues found in data challenges Model for replicated/distributed catalogs?
Is the model of metadata appropriate? Experiment vs POOL vs RLS
With DB group continue to investigate Oracle replication
10th GridPP Meeting – 4 June 2004 - 12
Evolution: missing featuresEvolution: missing features
A full storage element dCache has had many problems
Packaged - to be deployed Is dCache sufficient/the only solution?
Demonstrated integration of Tier 1 MSS’s Full data management tools
Nice features of SRM gave users a lot of convenience: - auto directory creation;
We were able to continue improving our setup during DC04:
- The biggest performance gain was: Michael and his team in DESY developed a new module that reduces the delegated proxy's modulus size in SRM and speeds up the interaction between SRM client and server 3.5 times;
(From CMS FNAL team, based on work done by deployment group)
10th GridPP Meeting – 4 June 2004 - 13
Evolution: missing featuresEvolution: missing features Functionally:
Port to other RH-derived linux• This is now becoming urgent – new hardware, security patches, …
VOMS • At least the basic part
R-GMA• For monitoring
Replace OpenPBS as default batch system Operationally:
Assumption of real operational management by GOCs• A lot of work on basics has been done – but need problem management
User call centre • Lack of take-up• Propose FZK/GOC team come to CERN for 1-2 days to really sort this out
Accounting:• Critical – we have no information about what has been used during the DC’s –
important for us and for the experiments Monitoring:
• Grid: lack of consistency in what is presented for each site• Experiments: we must put R-GMA in place (at least)
10th GridPP Meeting – 4 June 2004 - 14
Evolution: Service ChallengesEvolution: Service Challenges
Purpose Understand what it takes to operate a real grid service – run for days/weeks
at a time (outside of experiment Data Challenges) Trigger/encourage the Tier1 planning – move towards real resource
planning for phase 2 – based on realistic usage patterns• How does a Tier 1 decide what capacity to provide?• What planning is needed to achieve that?• Where are we in this process?
Get the essential grid services ramped up to needed levels – and demonstrate that they work
Set out milestones needed to achieve goals during the service challenges
NB: This is focussed on Tier 0 – Tier 1/large Tier 2 Data management, batch production and analysis
By end 2004 – have in place a robust and reliable data management service and support infrastructure and robust batch job submission
10th GridPP Meeting – 4 June 2004 - 15
Service challenges – examples Service challenges – examples
Data Management Networking, file transfer, data management Storage management and interoperability Fully functional storage element (SE)
Continuous job probes Understand limits
Operations centres Accounting, assume levels of service responsibility, etc Hand-off of responsibility (RAL-Taipei-US/Canada)
"Security incident" Detection, incident response, dissemination and resolution
IP connectivity Milestones to remove (implementation) need outbound connection from WN
User support Assumption of responsibility, demonstrate staff in place, etc
VO management Robust and flexible registration, management interfaces, etc
Etc.
10th GridPP Meeting – 4 June 2004 - 16
Data Management – example Data Management – example
Data management builds on a stack of underlying services: Network Robust file transfer Storage interfaces and functionality Replica location service Data management tools
10th GridPP Meeting – 4 June 2004 - 17
Data management – 2 Data management – 2
Network layer: Proposed set of network milestones already in draft
• Network and fabric groups at CERN – collaborate with (initially) “official” Tier 1’s• Dedicated private networks for Tier 0 Tier 1 “online” raw data transfers
File transfer service layer: Move a file from A to B, with good perfomance and reliability This service would normally only be visible via the data movement service
• Only app that can access/schedule/control this network E.g. of this is gridftp, bbftp, etc. Reliability – the service must detect failure, retry, etc. Interfaces to storage systems (SRM)
The US-CMS/CERN “Edge Computing” project might be an instance of this layer (network + file transfer)
10th GridPP Meeting – 4 June 2004 - 18
Data management – 3 Data management – 3
Data movement service layer: Builds on top of file transfer and network layers To provide an absolutely reliable and dependable service with good
performance Implement queuing, priorities, etc. Initiates file transfers using file transfer service Acts on application’s behalf – a file handed to the service will be
guaranteed to arrive
Replica Location Service: Makes use of data movement Should be distributed:
• Distributed/replicated databases (Oracle) with export/import to XML/other db’s?
• RLI model?
10th GridPP Meeting – 4 June 2004 - 19
Job probes – example Job probes – example
Continuous flood of jobs Fill all resources Use as probes – test if they can use the resources
• Data access, cpu, etc Understand limitations, bottlenecks of the system
• Baseline measurement, find limits, build and improve
This might be a function of the GOC Overseen by RAL-Taipei-+ collaboration ?
A challenge might run for a week Outside of experiment data challenges In parallel (or part of) data management or other challenges
10th GridPP Meeting – 4 June 2004 - 20
Clarify: LCG project LCG applications LCG middleware release LCG infrastructure
EGEE (i.e. LCG) infrastructure The LCG-2 infrastructure IS the EGEE infrastructure Can be used now by other applications
Expect to run LCG-2 based infrastructure for 1 year New middleware has to be better than this becomes
EGEE-developed middleware runs on pre-production Moves to production when more functional/stable/reliable/…
middleware and infrastructure
Transition to EGEETransition to EGEE
10th GridPP Meeting – 4 June 2004 - 21
Some remarksSome remarks
Existing LCG-2 sites already support many VOs Not only LCG Front-line support for all VOs is via the ROCs
Process to introduce a new VO Well defined Some tools needed to make the mechanics simpler
Evaluation of new middleware by applications, and preparation for deployment in EGEE-1 This is what the pre-production service is for
Resource allocation/negotiation OMC/ROC managers/NA4 – negotiate with RC’s and apps
10th GridPP Meeting – 4 June 2004 - 22
Joining EGEE – Overview of processJoining EGEE – Overview of process
Application nominates VO manager Find (CIC) to operate VO server VO is added to registration procedure Determine access policy:
Propose discussion (body) NA4 + ROC manager group• Which sites will accept to run app (funding, political constraints)• Need for a test VO?
Modify site configs to allow the VO access Negotiate CICs to run VO-specific services:
VO server (see above) RLS service if required Resource Brokers (can be some general at CIC and others owned by apps),
UIs – general at CIC/ROC – or on apps machines etc Potentially (if needed) BDII to define apps view of resources
Application software installation Understand application environment, and how installed at sites
Many of these issues can be negotiated by NA4/SA1 in a short discussion with the new apps community
10th GridPP Meeting – 4 June 2004 - 23
Resource Negotiation PolicyResource Negotiation Policy
The EGEE infrastructure is intended to support and provide resources to many virtual organisations Initially HEP (4 LHC experiments) + Biomedical Each RC supports many VOs and several application domains – situation now
for centres in LCG
Initially must balance resources contributed by the application domains and those that they consume Resource centres may have specific allocation policies
• E.g. due to funding agency attribution by science or by project Expect a level of peer review within application domains to inform the allocation
process New VOs and Resource centres should satisfy minimum requirements
Commit to bring a level of additional resources consistent with their requirements
Requirement on JRA1 to provide mechanisms to implement/enforce quotas, etc
Selection of new VO/RC via NA4
10th GridPP Meeting – 4 June 2004 - 24
New Resource CentresNew Resource Centres
Procedure for new sites to join LCG2/EGEE is well defined and documented
Sites can join now Coordination for this is via the ROCs
Who will support the installations, set-up, and operation
10th GridPP Meeting – 4 June 2004 - 25
Certification, Testing and Release CycleCertification, Testing and Release Cycle
CERTIFICATIONTESTING SERVICES
Integrate
BasicFunctionality
Tests
Run testsC&T suitesSite suites
RunCertification
Matrix
Releasecandidate
tag
PR
E-P
RO
DU
CT
ION
PR
OD
UC
TIO
N
APPINTEGR
Certifiedrelease
tag
DE
VE
LO
PM
EN
T &
IN
TE
GR
AT
ION
UN
IT &
FU
NC
TIO
NA
L T
ES
TIN
G
DevTag
JRA1
HEPEXPTS
BIO-MED
OTHERTBD
APPSSW
Installation
DE
PL
OY
ME
NT
PR
EP
AR
AT
ION
Deploymentrelease
tag
DEPLOY
SA1
Productiontag
10th GridPP Meeting – 4 June 2004 - 26
InteroperabilityInteroperability
Several grid infrastructures for LHC experiments: LCG-2/EGEE, Grid2003/OSG, Nordugrid, other national grids
LCG/EGEE explicit goals to interoperate One of LCG service challenges Joint projects on storage elements, file catalogs, VO management,
etc.
Most are VDT (or at least Globus-based) Grid3 & LCG use GLUE schema
Issues are: File catalogs, information schema, etc at technical level Policy and semantic issues
10th GridPP Meeting – 4 June 2004 - 27
Deployment – GridPP supportDeployment – GridPP support
GridPP contributions to deployment have been crucial: 5 of CERN deployment team funded by PPARC Essential to bringing the current release to such stability and
reliability – and it’s not that hard to install – 58 sites so far Grid Operations Centre at RAL Security team – very active
10th GridPP Meeting – 4 June 2004 - 28
SummarySummary
Huge amount of work done in the last year to produce a robust set of middleware These lessons must be applied to new developments
LCG is being successfully used in experiment data challenges Many problems found and addressed (tools, bugs, etc) Other fundamental problems subject of development Services are now very reliable
Plans for service challenges to help move forward Must ensure that only single developments – coordinate
EGEE/LCG/OSG/etc. Push for interoperability at all levels – experiments have big role to play in
insisting on single solutions Emphasis now on strengthening the operational infrastructure
EGEE investment helps here
PPARC/GridPP support has been essential