RLS Production Services

11
RLS Production Services Maria Girone PPARC-LCG, CERN LCG-POOL and IT-DB Physics Services 10 th GridPP Meeting, CERN, 3 rd June 2004 - What is the RLS - RLS and POOL - Service Overview - Experience in Data Challenges - Towards a Distributed RLS - Summary

description

RLS Production Services. Maria Girone PPARC-LCG, CERN LCG-POOL and IT-DB Physics Services 10 th GridPP Meeting, CERN, 3 rd June 2004. What is the RLS RLS and POOL Service Overview Experience in Data Challenges Towards a Distributed RLS Summary. What is the RLS. - PowerPoint PPT Presentation

Transcript of RLS Production Services

Page 1: RLS Production Services

RLS Production ServicesMaria Girone

PPARC-LCG, CERNLCG-POOL and IT-DB Physics Services

10th GridPP Meeting, CERN, 3rd June 2004

- What is the RLS

- RLS and POOL

- Service Overview

- Experience in Data Challenges

- Towards a Distributed RLS

- Summary

Page 2: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

What is the RLS

• The LCG Replica Location Service (LCG-RLS) is the central Grid File Catalog, responsible for maintaining a consistent list of accessible files (physical and logical names) together with their relevant file metadata attributes

• The RLS (and POOL) refers to files via a unique and immutable file identifier, (FileID) generated at creation time

• Stable inter-file reference

FileIDLFN1LFN1 PFN1PFN1LFN2LFN2

LFNnLFNnPFN2PFN2PFNnPFNn

File metadata(jobid, owner, …)

Page 3: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

POOL and the LCG-RLS

• POOL is the LCG Persistency Framework• See talk from Radovan Chytracek

• The LCG-RLS is one of the three POOL File Catalog implementations• XML based local file catalog• MySQL based shared catalog • RLS based Grid-aware file catalog

• A complete production chain deploys several of these• Cascading changes from isolated worker nodes (XML catalog)

up to the RLS service• DC04 used MySQL catalog at Tier1, RLS at Tier0

• RLS deployment at Tier1 sites• See talk from James Casey

Page 4: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

RLS Service Goals

• RLS is a critical service for the correct operation of the Grid!

• Minimal downtime for both scheduled and unscheduled interruptions• Good level of availability at iAS and DB level

• Meet requirements of Data Challenges• In terms of performance (look-up / insert rate) and capacity (total

number of GUID-PFN mappings and file-level meta-data entries)

• Currently, the performance is not limited by the service itself

• Prepare for future needs and increase reliability/ manageability

Page 5: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

RLS Service Overview• Currently deploys LRC and RMC middleware components from

EDG• Distributed Replica Location Index not deployed in LCG-2

• For now, a central service deployed at CERN

• RLS uses Oracle Application Server (iAS) and Database (DB)

• Dedicated farm node (iAS) per VO

• Shared disk server (DB) for production VOs

• Similar set-up is used for testing and software certificationRLS

AppServers(production)

RLS AppServers(certification)

RLS AppServers

(test)production

RLS DB(certification)

RLS DB(test)

spare

ALICEATLAS

CMSLHCb

DTEAM

RLS DB(production)

Page 6: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

Handling Interventions

• High level – ‘run like an experiment’:• On-call team; primary responsible and backup• Documented procedures, training for on-call personnel, daily meetings• List of experts to call in case standard actions do not work• Planning of interventions

• Most frequent: security patches

• iAS: can transparently switch to new box using DNS alias change• Used for both scheduled and unscheduled interruptions

• DB: short interruption to move to ‘stand-by’ DB

• Total up-time achieved: 99.91%

• Looking at Standard Oracle solutions for High Availability:• iAS clusters and DB clusters • Data Guard (for data protection)

Page 7: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

Experience in Data Challenges

• The RLS was used for the first time in production during the CMS Data Challenge DC04 (3M PFNs and file metadata stored)• ATLAS and LHCb ramping up

• The service was stable throughout DC04• Looking up file information by GUID seems sufficiently fast

• Clear problems wrt to the performance of the RLS • Partially due to the normal “learning curve” on all sides in using a new system• Bulk operations were missing in the deployed RLS version • Also, cross-catalog queries are not efficient by RLS design

• Several solutions produced ‘in flight’• EDG-based tools, POOL workarounds

• Support for bulk operations now addressed by IT-GD (in edg-rls v2.2.7). POOL will support it in the next release (POOL V1.7)

Page 8: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

Towards a Distributed RLS

• RLS in LCG-2 still lacks consistent replication between multiple catalog servers

• EDG RLI component has not been deployed as part of LCG

• Central single catalog expected to result in scalability and availability problems

• Joint evaluation with CMS of Oracle asynchronous database replication as part of DC04 (in parallel to production)

• Tested a minimal (two node) multi-master system between CERN and CNAF

• Catalog inserts/update propagated in both directions

• First Results• RLS application could be deployed with only minor changes

• No stability and performance problems observed so far

• Network problems and temporary server unavailability were handled gracefully

• Setup could not unfortunately be tested in full production mode in DC04 due to lack of time/resource

Page 9: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

Next Generation RLS

• LCG Grid Deployment group is currently working with the experiments to gather requirements for the next generation RLS • Taking into account the experience from DC04

• Build on DC04 work: move to replicated rather distributed catalogs?

• Still need to prove • Stability and performance with production access patterns

• Scaling to a sufficient number of replicas (4-6 Tier1 sites?)

• Automated resolution of catalog conflicts that may arise as consequence of asynchronous replication

• Propose to continue evaluation, possibly using Oracle streams• in the context of the Distributed Database Deployment activity, in the LCG

deployment area

Page 10: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

Summary

• The Replica Location Service is a central part of the LCG infrastructure

• Strong requirements in terms of reliability of the service• Significant contribution from GridPP funded people

• The LCG-RLS middleware and service have passed there first production test

• Good service stability was achieved

• Experience in Data Challenge proven to be essential for improving performance and scalability of the RLS middleware

• Oracle replication tests are expected to provide important input to define replicated RLS and handling of distributed metadata in general

Page 11: RLS Production Services

GridPP Meeting, 3rd June 2004

Maria Girone

Dat

abas

e an

d A

ppli

cati

on S

ervi

ces

The RLS Supported Configuration

• A “Local Replica Catalogue” (LRC)• Contains GUID <-> PFN mapping for all local files

• A “Replica Metadata Catalogue” (RMC)• Contains GUID <-> LFN mapping for all local files and all file

metadata information

• A “Replica Location Index” (RLI) <-- Not deployed in LCG-2 • Allows files at other sites to be found

• All LRCs are configured to publish to all remote RLIs