July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

30
Data access and Storage And more From the xrootd and Scalla perspective Fabrizio Furano CERN IT/GS July-08 South African National Compute Grid Training Deployment and Strategy Meeting University of Cape Town http://savannah.cern.ch/projects/xrootd http://xrootd.slac.stanford.edu

Transcript of July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Page 1: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Data access and Storage

And moreFrom the xrootd and Scalla perspective

Fabrizio FuranoCERN IT/GS

July-08South African National Compute Grid Training

Deployment and Strategy MeetingUniversity of Cape Town

http://savannah.cern.ch/projects/xrootdhttp://xrootd.slac.stanford.edu

Page 2: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd

Physics experiments rely on rare events and statistics

◦Huge amount of data to get a significant number of eventsThe typical data store can reach 5-10 PB… nowMillions of files, thousands of concurrent clients

Each one opening many files (about 100-150 in Alice, up to 1000 in GLAST)

Each one keeping many open files

◦The transaction rate is very highNot uncommon O(103) file opens/sec per cluster

Average, not peak

Traffic sources: local GRID site, local batch system, WAN

Need scalable high performance data access◦No imposed limits on performance and size, connectivity

The historical Problem: data access

July-2008 2

Page 3: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd 3

The evolution of the BaBar-initiated xrootd projectData access with HEP requirements in mind◦But a fully generic platform, however

Structured Cluster Architecture for Low Latency Access

◦Low Latency Access to data via xrootd serversPOSIX-style byte-level random access

By default, arbitrary data organized as files

Hierarchical directory-like name space

Protocol includes high performance featuresExponentially scalable and self organizing

◦Tools and methods to cluster, harmonize, connect, …

What is Scalla?

July-2008

Page 4: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd

xrootd Plugin Architecture

July-2008

lfn2pfnprefix encoding

Storage System(oss, drm/srm, etc)

authentication(gsi, krb5, etc)

Clustering(cmsd)

authorization(name based)

File System(ofs, sfs, alice, etc)

Protocol (1 of n)(xrootd)

Protocol Driver(XRD)

4

Page 5: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Default set of plugins :◦Scalable file server functionalities

Its primary historical function

◦To be used in common data mngmt schemesThe ROOT framework bundles it as it is◦And provides one more plugin: XrdProofdProtocol

◦Plus several other ROOT-side classes

◦The heart of PROOF: the Parallel ROOT FacilityA completely different task by loading a different pluginMassive low latency parallel computing of independent

items (events in physics)Using the characteristics of the xrootd framework

Different usages

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 5

Page 6: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd 6

No weird configuration requirementsScale setup complexity with the requirements’ complexity

Fault toleranceHigh, scalable transaction rate

Open many files per second. Double the system and double the rate.

NO DBs! Would you put one in front of your laptop’s file system?

No known limitations in size and global throughput for the repo

Very low CPU usageHappy with many clients per server

Thousands. But check their bw consumption vs the disk/net performance!

WAN friendly (client+protocol+server)Enable efficient remote POSIX-like data access

WAN friendly (server clusters)Can set up WAN-wide repositories by aggregating remote clusters

Most famous basic features

July-2008

Page 7: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd 7

Basic working principle

July-2008

cmsdxrootd

cmsdxrootd

cmsdxrootd

cmsdxrootd

Client

A small2-level cluster.

Can hold

Up to 64 servers

P2P-like

Page 8: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Simple LAN clusters

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 8

cmsdxrootd

cmsdxrootd

cmsdxrootd

cmsdxrootd

Simple clusterUp to 64 data servers1-2 mgr redirectors

cmsd

cmsdxrootd

cmsdxrootd

cmsdxrootd

cmsdxrootd

cmsdxrootd cmsd

xrootdcmsd

xrootd

cmsdxrootd

cmsdxrootd cmsd

xrootdcmsd

xrootd

cmsdxrootd

cmsdxrootd

Advanced clusterUp to 4096 (2 lvls) or

262K (3 lvls) data servers

Everything can have hot spares

Page 9: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Very carefully crafted, heavily multithreaded◦Server side: promote speed and scalability

High level of internal parallelism + statelessExploits OS features (e.g. async i/o, polling, selecting)Many many speed+scalability oriented featuresSupports thousands of client connections per serverNo interactions with complicated things to do simple tasks

◦Client: Handles the state of the communicationReconstructs everything to present it as a simple interface

Fast data pathNetwork pipeline coordination + latency hidingSupports connection multiplexing + intelligent server cluster

crawlingServer and client exploit multi core CPUs natively

Single point performance

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 9

Page 10: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Server side◦If servers go, the overall functionality can be fully preserved

Redundancy, MSS staging of replicas, …Can means that weird deployments can give it up

E.g. storing in a DB the physical endpoint addresses for each file. Generally a bad idea.

Client side (+protocol)The client crawls the server metacluster looking for data

◦The application never notices errorsTotally transparent, until they become fatal

i.e. when it becomes really impossible to get to a working endpoint to resume the activity

◦Typical tests (try it!)Disconnect/reconnect network cablesKill/restart servers

Fault tolerance

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 10

Page 11: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Password-based (pwd)◦Either system or dedicated password file

User account not neededGSI (gsi)◦Handle GSI proxy certificates

◦VOMS support should be OK now (Andreas, Gerri)

◦No need of Globus libraries (and super-fast!)Kerberos IV, V (krb4, krb5)◦Ticket forwarding supported for krb5

◦Fast ID (unix, host) to be used w/ authorizationALICE security tokens◦Emphasis on ease of setup and performance

Available auth protocols

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 11Courtesy of Gerardo Ganis (CERN PH-SFT)

Page 12: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Creating big clusters scales linearlyThe throughput and the size, keeping latency very low

We like the idea of disk-based cacheThe bigger (and faster), the better

So, why not to use the disk of every WN ?In a dedicated farm500GB * 1000WN 500TBThe additional cpu usage is anyway quite low

Can be used to set up a huge cache in front of a MSSNo need to buy a bigger MSS, just lower the miss rate !

Adopted at BNL for STAR (up to 6-7PB online)See Pavel Jakl’s (excellent) thesis work

They also optimize MSS access to nearly double the staging performance

Quite similar to the PROOF approach to storageOnly storage. PROOF is very different for the computing part.

The “many” paradigm

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 12

Page 13: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

We want to make WAN data analysis convenient◦A process does not always read every byte in a file

◦Even if it does… no problem

◦The typical way in which HEP data is processed is (or can be) often known in advance

TTreeCache in ROOT does an amazing job for this

◦xrootd: fast and scalable server sideMakes things run quite smooth

Gives room for improvement at the client sideAbout WHEN transferring the data

There might be better moments to trigger a chunk xfer

with respect to the moment it is neededThe app has not to wait while it receives data… in parallel

WAN direct access – Motivation

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 13

Page 14: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

WAN direct access – hiding latency

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 14

Pre-xferdata

“locally”

Remoteaccess

Remoteaccess+Data

Processing

Data access

OverheadNeed for

potentiallyuseless replicas

And a hugeBookkeeping!

LatencyWasted CPU

cyclesBut easy

to understand

Interesting!Efficientpractical

Page 15: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd 15

Setup: client at CERN, data at SLAC◦164ms RTT time, available bandwidth < 100Mb/s

Smart features switched OFF◦Test 1: Read a large ROOT Tree

(~300MB, 200k interactions)Expected time: 38000s (latency)+750s (data)+CPU 10 hrs!➙No time to waste to precisely measure this!

◦Test 2: Draw a histogram from that tree data(~6k interactions)

Measured time 20min

Using xrootd with WAN optimizations disabled

Dumb WAN Access*

July-2008

*Federico Carminati, The ALICE Computing Status and Readiness, LHCC, November 2007

Page 16: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd 16

Smart features switched ONROOT TTreeCache + XrdClient Async mode +

15*multistreaming◦Test 1 actual time: 60-70 seconds

Compared to 30 seconds using a Gb LANVery favorable for sparsely used files

… at the end, even much better than certain always-overloaded SEs…..

◦Test 2 actual time: 7-8 seconds Comparable to LAN performance (5-6 secs)100x improvement over dumb WAN access (was 20

minutes)

Smart WAN Access*

July-2008

*Federico Carminati, The ALICE Computing Status and Readiness, LHCC, November 2007

Page 17: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd 17

Up to now, xrootd clusters could be populated◦With xrdcp from an external machine

◦Writing to the backend store (e.g. CASTOR/DPM/HPSS etc.) E.g. FTD in ALICE now uses the first. It “works”…

Load and resources problemsAll the external traffic of the site goes through one machine

Close to the dest cluster

If a file is missing or lost◦For disk and/or catalog screwup

◦Job failure... manual intervention neededWith 107 online files finding the source of a trouble can be

VERY tricky

Cluster globalization

July-2008

Page 18: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd 18

Purpose:◦A request for a missing file comes at cluster X,

◦X assumes that the file ought to be thereAnd tries to get it from the collaborating clusters, from the fastest one

Note that X itself is part of the game◦And it’s composed by many servers

The idea is that◦Each cluster considers the set of ALL the others like a

very big online MSS

◦This is much easier than what it seemsSlowly Into production for ALICE

Virtual MSS

July-2008

Page 19: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Cluster Globalization… an example

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 19

cmsd

xrootdPragueNIHAM

… any other

cmsd

xrootd

CERN

cmsd

xrootd

ALICE global redirector (alirdr)all.role meta managerall.manager meta alirdr.cern.ch:1312

root://alirdr.cern.ch/Includes

CERN, GSI, and othersxroot clusters

Meta Managers can be geographically

replicatedCan have several in different places for region-aware load

balancing

cmsd

xrootd

GSIall.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312all.role manager all.role manager all.role manager

Page 20: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Global redirector acts as a WAN xrootd meta-managerLocal clusters subscribe to it◦And declare the path prefixes they export

◦Local clusters (without local MSS) treat the globality as a very big MSS

◦Coordinated by the Global redirectorLoad balancing, negligible loadPriority to files which are online somewherePriority to fast, least-loaded sitesFast file location

True, robust, realtime collaboration between storage elements!

◦Very attractive for tier-2s

Many pieces

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 20

Page 21: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

cmsd

xrootd

GSI

The Virtual MSS Realized

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 21

cmsd

xrootd PragueNIHAM

… any other

cmsd

xrootd

CERN

cmsd

xrootd

ALICE global redirector

all.role meta managerall.manager meta alirdr.cern.ch:1312

all.role manager all.role managerall.role manager

But missing a file?Ask to the global metamgr

Get it from any othercollaborating cluster

all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312

Local clients worknormally

Page 22: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Powerful mechanism to increase reliability◦Data replication load is widely distributed

◦Multiple sites are available for recoveryAllows virtually unattended operation◦Automatic restore due to server failure

Missing files in one cluster fetched from anotherTypically the fastest one which has the file really online

No costly out of time (and sync!) DB lookups

◦Practically no need to track file locationBut does not stop the need for metadata repositories

Virtual MSS – The vision

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 22

Page 23: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

The mechanism is there, fully “boxed”◦The new setup does almost everything it’s needed

A (good) side effect:◦Pointing an app to the “area” global redirector gives

complete, load-balanced, low latency view of all the repository

◦An app using the “smart” WAN mode can just runProbably now a full scale production/analysis won’t

But what about an interactive small analysis on a laptop?

After all, HEP sometimes just copies everything, useful and not

I cannot say that in some years we will not have a more powerful WAN infrastructure

And using it to copy more useless data looks just ugly

If a web browser can do it, why not a HEP app? Looks just a little more difficult.

Better if used with a clear design in mind

Virtual MSS

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 23

Page 24: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Scalla is a data access system◦Some users/applications want file system semantics

More transparent but much less scalable (transactional namespace)

For years users have asked ….◦Can Scalla create a file system experience?

The answer is ….◦It can to a degree that may be good enough

We relied on FUSE to show howUsers shall rely on themselves to decide

If they actually need a huge multi-PB unique filesystemProbably there is something else which is “strange”

Data System vs File System

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 24

Page 25: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Filesystem in UserspaceUsed to implement a file system in a user space

program ◦Linux 2.4 and 2.6 only

◦Refer to http://fuse.sourceforge.net/Can use FUSE to provide xrootd access

Looks like a mounted file system

Several people have xrootd-based versions of this◦Wei Yang at SLAC

Tested and fully functional (used to provide SRM access for ATLAS)

What is FUSE

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 25

Page 26: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

XrootdFS (Linux/FUSE/Xrootd)

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 26

Redirectorxrootd:1094

Name Spacexrootd:2094Redirector

Host

ClientHost opendir

createmkdir

mvrm

rmdir

xrootd POSIX Client

Kernel

User Space

Appl

POSIX File System

InterfaceFUSE

FUSE/Xroot Interface

Should run cnsd on serversto capture non-FUSE eventsAnd keep the FS namespace!

Page 27: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Makes some things much simpler◦Most SRM implementations run transparently

◦Avoid pre-load library worriesBut impacts other things◦Performance is limited

Kernel-FUSE interactions are not cheapThe implementation is OK but quite simple-mindedRapid file creation (e.g., tar) is limited

Remember that the comparison is with a plain xrootd cluster, much faster

◦FUSE must be administratively installed to be usedDifficult if involves many machines (e.g., batch workers)Easier if it involves an SE node (i.e., SRM gateway)

So, it’s good for the SRM-side of a repo◦But not much for the job side

Why XrootdFS?

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 27

Page 28: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Many new ideas are reality or comingTypically dealing with◦True realtime data storage distribution

◦Interoperability (Grid, SRMs, file systems, WANs…)

◦Enabling interactivity (and storage is not the only part of it)The setup encapsulation + vMSS is ready◦In production at CERN for ALICE::CERN::SE

Trying to avoid common mistakesBoth manual and automated setups are honorful and to be honoured!

Conclusion

July-2008Fabrizio Furano - The Scalla suite and the Xrootd 28

Page 29: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd 29

Old and new software Collaborators◦Andy Hanushevsky, Fabrizio Furano (client-side), Alvise Dorigo

◦Root: Fons Rademakers, Gerri Ganis (security), Bertrand Bellenot (windows porting)

◦Alice: Derek Feichtinger, Andreas Peters, Guenter Kickinger

◦STAR/BNL: Pavel Jackl, Jerome Lauret

◦GSI: Kilian Schwartz

◦Cornell: Gregory Sharp

◦SLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger, Bill Weeks

◦Peter ElmerOperational collaborators◦BNL, CERN, CNAF, FZK, INFN, IN2P3, RAL, SLAC

Acknowledgements

July-2008

Page 30: July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.

Fabrizio Furano - The Scalla suite and the Xrootd

Single Level Switch

July-2008

Client Redirector(Head Node)

Data Servers

open file X

A

B

C

go to C

open file X

Who has file X?

I have

Cluster

Client sees all servers as xrootd data servers

2nd open X

go to C

RedirectorsCache filelocation

30