Data access and Storage

download Data access and Storage

If you can't read please download the document

description

Data access and Storage. And more From the xrootd and Scalla perspective Fabrizio Furano CERN IT/GS July-08 South African National Compute Grid Training Deployment and Strategy Meeting University of Cape Town http://savannah.cern.ch/projects/xrootd http://xrootd.slac.stanford.edu. - PowerPoint PPT Presentation

Transcript of Data access and Storage

GSI xrootd seminar

Data access and StorageAnd moreFrom the xrootd and Scalla perspective

Fabrizio FuranoCERN IT/GSJuly-08South African National Compute Grid TrainingDeployment and Strategy MeetingUniversity of Cape Town

http://savannah.cern.ch/projects/xrootdhttp://xrootd.slac.stanford.edu

Physics experiments rely on rare events and statisticsHuge amount of data to get a significant number of eventsThe typical data store can reach 5-10 PB nowMillions of files, thousands of concurrent clientsEach one opening many files (about 100-150 in Alice, up to 1000 in GLAST)Each one keeping many open filesThe transaction rate is very highNot uncommon O(103) file opens/sec per clusterAverage, not peakTraffic sources: local GRID site, local batch system, WANNeed scalable high performance data accessNo imposed limits on performance and size, connectivity

The historical Problem: data accessJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd1The evolution of the BaBar-initiated xrootd projectData access with HEP requirements in mindBut a fully generic platform, howeverStructured Cluster Architecture for Low Latency AccessLow Latency Access to data via xrootd serversPOSIX-style byte-level random accessBy default, arbitrary data organized as files Hierarchical directory-like name spaceProtocol includes high performance featuresExponentially scalable and self organizingTools and methods to cluster, harmonize, connect,

What is Scalla?July-2008Fabrizio Furano - The Scalla suite and the Xrootd2xrootd Plugin ArchitectureJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd

lfn2pfnprefix encoding

Storage System(oss, drm/srm, etc)

authentication(gsi, krb5, etc)

Clustering(cmsd)

authorization(name based)

File System(ofs, sfs, alice, etc)

Protocol (1 of n)(xrootd)Protocol Driver(XRD)3Default set of plugins :Scalable file server functionalitiesIts primary historical functionTo be used in common data mngmt schemesThe ROOT framework bundles it as it isAnd provides one more plugin: XrdProofdProtocolPlus several other ROOT-side classesThe heart of PROOF: the Parallel ROOT FacilityA completely different task by loading a different pluginMassive low latency parallel computing of independent items (events in physics)Using the characteristics of the xrootd frameworkDifferent usagesJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd4No weird configuration requirementsScale setup complexity with the requirements complexityFault toleranceHigh, scalable transaction rateOpen many files per second. Double the system and double the rate.NO DBs! Would you put one in front of your laptops file system?No known limitations in size and global throughput for the repoVery low CPU usageHappy with many clients per serverThousands. But check their bw consumption vs the disk/net performance!WAN friendly (client+protocol+server)Enable efficient remote POSIX-like data accessWAN friendly (server clusters)Can set up WAN-wide repositories by aggregating remote clustersMost famous basic featuresJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd5Basic working principleJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd6cmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdClientA small2-level cluster.Can holdUp to 64 serversP2P-likeSimple LAN clustersJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd7cmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdSimple clusterUp to 64 data servers1-2 mgr redirectors

cmsdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdcmsdxrootdAdvanced clusterUp to 4096 (2 lvls) or 262K (3 lvls) data servers

Everything can have hot sparesVery carefully crafted, heavily multithreadedServer side: promote speed and scalabilityHigh level of internal parallelism + statelessExploits OS features (e.g. async i/o, polling, selecting)Many many speed+scalability oriented featuresSupports thousands of client connections per serverNo interactions with complicated things to do simple tasksClient: Handles the state of the communicationReconstructs everything to present it as a simple interfaceFast data pathNetwork pipeline coordination + latency hidingSupports connection multiplexing + intelligent server cluster crawlingServer and client exploit multi core CPUs nativelySingle point performanceJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd8Server sideIf servers go, the overall functionality can be fully preservedRedundancy, MSS staging of replicas, Can means that weird deployments can give it upE.g. storing in a DB the physical endpoint addresses for each file. Generally a bad idea.Client side (+protocol)The client crawls the server metacluster looking for dataThe application never notices errorsTotally transparent, until they become fatali.e. when it becomes really impossible to get to a working endpoint to resume the activityTypical tests (try it!)Disconnect/reconnect network cablesKill/restart serversFault toleranceJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd9Password-based (pwd)Either system or dedicated password fileUser account not neededGSI (gsi)Handle GSI proxy certificatesVOMS support should be OK now (Andreas, Gerri)No need of Globus libraries (and super-fast!)Kerberos IV, V (krb4, krb5)Ticket forwarding supported for krb5Fast ID (unix, host) to be used w/ authorizationALICE security tokensEmphasis on ease of setup and performanceAvailable auth protocolsJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd10Courtesy of Gerardo Ganis (CERN PH-SFT)Creating big clusters scales linearlyThe throughput and the size, keeping latency very lowWe like the idea of disk-based cacheThe bigger (and faster), the betterSo, why not to use the disk of every WN ?In a dedicated farm500GB * 1000WN 500TBThe additional cpu usage is anyway quite lowCan be used to set up a huge cache in front of a MSSNo need to buy a bigger MSS, just lower the miss rate !Adopted at BNL for STAR (up to 6-7PB online)See Pavel Jakls (excellent) thesis workThey also optimize MSS access to nearly double the staging performanceQuite similar to the PROOF approach to storageOnly storage. PROOF is very different for the computing part.The many paradigmJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd11We want to make WAN data analysis convenientA process does not always read every byte in a fileEven if it does no problemThe typical way in which HEP data is processed is (or can be) often known in advanceTTreeCache in ROOT does an amazing job for thisxrootd: fast and scalable server sideMakes things run quite smoothGives room for improvement at the client sideAbout WHEN transferring the dataThere might be better moments to trigger a chunk xferwith respect to the moment it is neededThe app has not to wait while it receives data in parallelWAN direct access MotivationJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd12WAN direct access hiding latencyJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd13Pre-xferdatalocallyRemoteaccessRemoteaccess+DataProcessingData accessOverheadNeed forpotentiallyuseless replicasAnd a hugeBookkeeping!LatencyWasted CPUcyclesBut easyto understandInteresting!EfficientpracticalSetup: client at CERN, data at SLAC164ms RTT time, available bandwidth < 100Mb/sSmart features switched OFFTest 1: Read a large ROOT Tree (~300MB, 200k interactions)Expected time: 38000s (latency)+750s (data)+CPU10 hrs!No time to waste to precisely measure this!Test 2: Draw a histogram from that tree data(~6k interactions)Measured time 20min Using xrootd with WAN optimizations disabledDumb WAN Access*July-2008Fabrizio Furano - The Scalla suite and the Xrootd14*Federico Carminati, The ALICE Computing Status and Readiness, LHCC, November 2007Smart features switched ONROOT TTreeCache + XrdClient Async mode + 15*multistreamingTest 1 actual time: 60-70 secondsCompared to 30 seconds using a Gb LANVery favorable for sparsely used files at the end, even much better than certain always-overloaded SEs..Test 2 actual time: 7-8 seconds Comparable to LAN performance (5-6 secs)100x improvement over dumb WAN access (was 20 minutes)

Smart WAN Access*July-2008Fabrizio Furano - The Scalla suite and the Xrootd15*Federico Carminati, The ALICE Computing Status and Readiness, LHCC, November 2007Up to now, xrootd clusters could be populatedWith xrdcp from an external machineWriting to the backend store (e.g. CASTOR/DPM/HPSS etc.) E.g. FTD in ALICE now uses the first. It worksLoad and resources problemsAll the external traffic of the site goes through one machineClose to the dest clusterIf a file is missing or lostFor disk and/or catalog screwupJob failure... manual intervention neededWith 107 online files finding the source of a trouble can be VERY trickyCluster globalizationJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd16Purpose:A request for a missing file comes at cluster X,X assumes that the file ought to be thereAnd tries to get it from the collaborating clusters, from the fastest oneNote that X itself is part of the gameAnd its composed by many serversThe idea is thatEach cluster considers the set of ALL the others like a very big online MSSThis is much easier than what it seemsSlowly Into production for ALICEVirtual MSSJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd17Cluster Globalization an exampleJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd18cmsdxrootdPragueNIHAM any othercmsdxrootdCERNcmsdxrootdALICE global redirector (alirdr)all.role meta managerall.manager meta alirdr.cern.ch:1312

root://alirdr.cern.ch/IncludesCERN, GSI, and othersxroot clustersMeta Managers can be geographically replicatedCan have several in different places for region-aware load balancingcmsdxrootdGSIall.manager meta alirdr.cern.ch:1312all.manager meta alirdr.cern.ch:1312all.manager meta alirdr.cern.ch:1312all.role managerall.role manager

all.role manager

Global redirector acts as a WAN xrootd meta-managerLocal clusters subscribe to itAnd declare the path prefixes they exportLocal clusters (without local MSS) treat the globality as a very big MSSCoordinated by the Global redirectorLoad balancing, negligible loadPriority to files which are online somewherePriority to fast, least-loaded sitesFast file locationTrue, robust, realtime collaboration between storage elements!Very attractive for tier-2sMany piecesJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd19

cmsdxrootdGSIThe Virtual MSS RealizedJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd20cmsdxrootdPragueNIHAM any other

cmsdxrootdCERNcmsdxrootdALICE global redirectorall.role meta managerall.manager meta alirdr.cern.ch:1312all.role manager

all.role manager

all.role managerBut missing a file?Ask to the global metamgrGet it from any othercollaborating clusterall.manager meta alirdr.cern.ch:1312all.manager meta alirdr.cern.ch:1312all.manager meta alirdr.cern.ch:1312Local clients worknormally

Powerful mechanism to increase reliabilityData replication load is widely distributedMultiple sites are available for recoveryAllows virtually unattended operationAutomatic restore due to server failureMissing files in one cluster fetched from anotherTypically the fastest one which has the file really onlineNo costly out of time (and sync!) DB lookups Practically no need to track file locationBut does not stop the need for metadata repositoriesVirtual MSS The visionJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd21The mechanism is there, fully boxedThe new setup does almost everything its neededA (good) side effect:Pointing an app to the area global redirector gives complete, load-balanced, low latency view of all the repositoryAn app using the smart WAN mode can just runProbably now a full scale production/analysis wont But what about an interactive small analysis on a laptop?After all, HEP sometimes just copies everything, useful and notI cannot say that in some years we will not have a more powerful WAN infrastructureAnd using it to copy more useless data looks just uglyIf a web browser can do it, why not a HEP app? Looks just a little more difficult.Better if used with a clear design in mindVirtual MSSJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd22Scalla is a data access systemSome users/applications want file system semanticsMore transparent but much less scalable (transactional namespace)For years users have asked .Can Scalla create a file system experience?The answer is .It can to a degree that may be good enoughWe relied on FUSE to show howUsers shall rely on themselves to decideIf they actually need a huge multi-PB unique filesystemProbably there is something else which is strangeData System vs File SystemJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd23Filesystem in UserspaceUsed to implement a file system in a user space program Linux 2.4 and 2.6 onlyRefer to http://fuse.sourceforge.net/Can use FUSE to provide xrootd accessLooks like a mounted file systemSeveral people have xrootd-based versions of thisWei Yang at SLAC Tested and fully functional (used to provide SRM access for ATLAS)What is FUSEJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd24XrootdFS (Linux/FUSE/Xrootd)July-2008Fabrizio Furano - The Scalla suite and the Xrootd25Redirectorxrootd:1094Name Spacexrootd:2094RedirectorHostClientHostopendircreatemkdirmvrmrmdirxrootd POSIX ClientKernelUser SpaceApplPOSIX File SystemInterfaceFUSEFUSE/Xroot InterfaceShould run cnsd on serversto capture non-FUSE eventsAnd keep the FS namespace!Makes some things much simplerMost SRM implementations run transparentlyAvoid pre-load library worriesBut impacts other thingsPerformance is limitedKernel-FUSE interactions are not cheapThe implementation is OK but quite simple-mindedRapid file creation (e.g., tar) is limitedRemember that the comparison is with a plain xrootd cluster, much fasterFUSE must be administratively installed to be usedDifficult if involves many machines (e.g., batch workers)Easier if it involves an SE node (i.e., SRM gateway)So, its good for the SRM-side of a repoBut not much for the job sideWhy XrootdFS?July-2008Fabrizio Furano - The Scalla suite and the Xrootd26Many new ideas are reality or comingTypically dealing withTrue realtime data storage distributionInteroperability (Grid, SRMs, file systems, WANs)Enabling interactivity (and storage is not the only part of it)The setup encapsulation + vMSS is readyIn production at CERN for ALICE::CERN::SETrying to avoid common mistakesBoth manual and automated setups are honorful and to be honoured!ConclusionJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd27Old and new software CollaboratorsAndy Hanushevsky, Fabrizio Furano (client-side), Alvise DorigoRoot: Fons Rademakers, Gerri Ganis (security), Bertrand Bellenot (windows porting)Alice: Derek Feichtinger, Andreas Peters, Guenter KickingerSTAR/BNL: Pavel Jackl, Jerome LauretGSI: Kilian SchwartzCornell: Gregory SharpSLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger, Bill WeeksPeter ElmerOperational collaboratorsBNL, CERN, CNAF, FZK, INFN, IN2P3, RAL, SLACAcknowledgementsJuly-2008Fabrizio Furano - The Scalla suite and the Xrootd28Single Level SwitchJuly-2008Fabrizio Furano - The Scalla suite and the XrootdClientRedirector(Head Node)Data Serversopen file XABCgo to Copen file XWho has file X?I haveClusterClient sees all servers as xrootd data servers2nd open Xgo to CRedirectorsCache filelocation29