RLS Production Services
-
Upload
paloma-hall -
Category
Documents
-
view
24 -
download
0
description
Transcript of RLS Production Services
RLS Production ServicesMaria Girone
PPARC-LCG, CERNLCG-POOL and IT-DB Physics Services
10th GridPP Meeting, CERN, 3rd June 2004
- What is the RLS
- RLS and POOL
- Service Overview
- Experience in Data Challenges
- Towards a Distributed RLS
- Summary
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
What is the RLS
• The LCG Replica Location Service (LCG-RLS) is the central Grid File Catalog, responsible for maintaining a consistent list of accessible files (physical and logical names) together with their relevant file metadata attributes
• The RLS (and POOL) refers to files via a unique and immutable file identifier, (FileID) generated at creation time
• Stable inter-file reference
FileIDLFN1LFN1 PFN1PFN1LFN2LFN2
LFNnLFNnPFN2PFN2PFNnPFNn
File metadata(jobid, owner, …)
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
POOL and the LCG-RLS
• POOL is the LCG Persistency Framework• See talk from Radovan Chytracek
• The LCG-RLS is one of the three POOL File Catalog implementations• XML based local file catalog• MySQL based shared catalog • RLS based Grid-aware file catalog
• A complete production chain deploys several of these• Cascading changes from isolated worker nodes (XML catalog)
up to the RLS service• DC04 used MySQL catalog at Tier1, RLS at Tier0
• RLS deployment at Tier1 sites• See talk from James Casey
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
RLS Service Goals
• RLS is a critical service for the correct operation of the Grid!
• Minimal downtime for both scheduled and unscheduled interruptions• Good level of availability at iAS and DB level
• Meet requirements of Data Challenges• In terms of performance (look-up / insert rate) and capacity (total
number of GUID-PFN mappings and file-level meta-data entries)
• Currently, the performance is not limited by the service itself
• Prepare for future needs and increase reliability/ manageability
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
RLS Service Overview• Currently deploys LRC and RMC middleware components from
EDG• Distributed Replica Location Index not deployed in LCG-2
• For now, a central service deployed at CERN
• RLS uses Oracle Application Server (iAS) and Database (DB)
• Dedicated farm node (iAS) per VO
• Shared disk server (DB) for production VOs
• Similar set-up is used for testing and software certificationRLS
AppServers(production)
RLS AppServers(certification)
RLS AppServers
(test)production
RLS DB(certification)
RLS DB(test)
spare
ALICEATLAS
CMSLHCb
DTEAM
RLS DB(production)
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
Handling Interventions
• High level – ‘run like an experiment’:• On-call team; primary responsible and backup• Documented procedures, training for on-call personnel, daily meetings• List of experts to call in case standard actions do not work• Planning of interventions
• Most frequent: security patches
• iAS: can transparently switch to new box using DNS alias change• Used for both scheduled and unscheduled interruptions
• DB: short interruption to move to ‘stand-by’ DB
• Total up-time achieved: 99.91%
• Looking at Standard Oracle solutions for High Availability:• iAS clusters and DB clusters • Data Guard (for data protection)
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
Experience in Data Challenges
• The RLS was used for the first time in production during the CMS Data Challenge DC04 (3M PFNs and file metadata stored)• ATLAS and LHCb ramping up
• The service was stable throughout DC04• Looking up file information by GUID seems sufficiently fast
• Clear problems wrt to the performance of the RLS • Partially due to the normal “learning curve” on all sides in using a new system• Bulk operations were missing in the deployed RLS version • Also, cross-catalog queries are not efficient by RLS design
• Several solutions produced ‘in flight’• EDG-based tools, POOL workarounds
• Support for bulk operations now addressed by IT-GD (in edg-rls v2.2.7). POOL will support it in the next release (POOL V1.7)
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
Towards a Distributed RLS
• RLS in LCG-2 still lacks consistent replication between multiple catalog servers
• EDG RLI component has not been deployed as part of LCG
• Central single catalog expected to result in scalability and availability problems
• Joint evaluation with CMS of Oracle asynchronous database replication as part of DC04 (in parallel to production)
• Tested a minimal (two node) multi-master system between CERN and CNAF
• Catalog inserts/update propagated in both directions
• First Results• RLS application could be deployed with only minor changes
• No stability and performance problems observed so far
• Network problems and temporary server unavailability were handled gracefully
• Setup could not unfortunately be tested in full production mode in DC04 due to lack of time/resource
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
Next Generation RLS
• LCG Grid Deployment group is currently working with the experiments to gather requirements for the next generation RLS • Taking into account the experience from DC04
• Build on DC04 work: move to replicated rather distributed catalogs?
• Still need to prove • Stability and performance with production access patterns
• Scaling to a sufficient number of replicas (4-6 Tier1 sites?)
• Automated resolution of catalog conflicts that may arise as consequence of asynchronous replication
• Propose to continue evaluation, possibly using Oracle streams• in the context of the Distributed Database Deployment activity, in the LCG
deployment area
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
Summary
• The Replica Location Service is a central part of the LCG infrastructure
• Strong requirements in terms of reliability of the service• Significant contribution from GridPP funded people
• The LCG-RLS middleware and service have passed there first production test
• Good service stability was achieved
• Experience in Data Challenge proven to be essential for improving performance and scalability of the RLS middleware
• Oracle replication tests are expected to provide important input to define replicated RLS and handling of distributed metadata in general
GridPP Meeting, 3rd June 2004
Maria Girone
Dat
abas
e an
d A
ppli
cati
on S
ervi
ces
The RLS Supported Configuration
• A “Local Replica Catalogue” (LRC)• Contains GUID <-> PFN mapping for all local files
• A “Replica Metadata Catalogue” (RMC)• Contains GUID <-> LFN mapping for all local files and all file
metadata information
• A “Replica Location Index” (RLI) <-- Not deployed in LCG-2 • Allows files at other sites to be found
• All LRCs are configured to publish to all remote RLIs