PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ²...

26
PROOF PROOF : the : the P P arallel arallel ROO ROO T T F F acility acility Scheduling and Scheduling and Load-balancing Load-balancing ACAT 2007 ACAT 2007 Jan Iwaszkiewicz Jan Iwaszkiewicz ¹ ² Gerardo Ganis Gerardo Ganis ¹ Fons Rademakers Fons Rademakers ¹ ¹ ¹ CERN PH/SFT CERN PH/SFT ² ² University of Warsaw University of Warsaw

Transcript of PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ²...

Page 1: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

PROOFPROOF: the : the PParallel arallel ROOROOT T FFacilityacility

Scheduling andScheduling andLoad-balancingLoad-balancing

ACAT 2007ACAT 2007

Jan Iwaszkiewicz Jan Iwaszkiewicz ¹¹ ²²Gerardo Ganis Gerardo Ganis ¹¹

Fons Rademakers Fons Rademakers ¹¹

¹ ¹ CERN PH/SFTCERN PH/SFT² ² University of WarsawUniversity of Warsaw

Page 2: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 22

OutlineOutline

• Introduction to Introduction to PParallel arallel ROOROOT T FFacilityacility

• Packetizer – load balancingPacketizer – load balancing

• Resource SchedulingResource Scheduling

Page 3: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 33

Analysis of theAnalysis of theLLarge arge HHadron adron CCollierollier data data

• Necessity of distributed analysisNecessity of distributed analysis

• ROOT – popular particle physics data ROOT – popular particle physics data analysis frameworkanalysis framework

• PROOF (ROOT’s extension) – PROOF (ROOT’s extension) – automatically parallelizeautomatically parallelizess processing processing to computing clusters or multicore to computing clusters or multicore machinesmachines

Page 4: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 44

Who is using PROOFWho is using PROOF• PHOBOSPHOBOS

– MIT, dedicated cluster, interfaced with CondorMIT, dedicated cluster, interfaced with Condor– Real data analysis, Real data analysis, in productionin production

• ALICEALICE– CERN Analysis Facility (CAF)CERN Analysis Facility (CAF)

• CMSCMS– Santander group, dedicated clusterSantander group, dedicated cluster– Physics TDR analysisPhysics TDR analysis

Very positive experienceVery positive experience• functionality, large speedup, efficientfunctionality, large speedup, efficient

But not really the LHC scenarioBut not really the LHC scenario• Usage limited to a few experienced usersUsage limited to a few experienced users

Page 5: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 55

Using PROOF: exampleUsing PROOF: example

• PROOF is designed for analysis of independent PROOF is designed for analysis of independent objects, e.g. ROOT Trees (basic data format in objects, e.g. ROOT Trees (basic data format in partice physics)partice physics)

• Example of processing a set of ROOT trees: Example of processing a set of ROOT trees:

// Create a chain of treesroot[0] TChain *c = CreateMyChain();

// MySelec is a TSelectorroot[1] c->Process(“MySelec.C+”);

// Create a chain of treesroot[0] TChain *c = CreateMyChain();

// Start PROOF and tell the chain// to use itroot[1] TProof::Open(“masterURL”);root[2] c->SetProof()

// Process goes via PROOFroot[3] c->Process(“MySelec.C+”);

PROOFLocal ROOT

Page 6: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 66

Classic bClassic batch processingatch processing

StorageBatch farm

queues

manager

catalog

query

submit

files

jobs

data file splitting

myAna.C

mergingfinal analysis

static use of resources jobs frozen: 1 job / worker node

external splitting, merging

outputs

Page 7: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 77

PROOF processingPROOF processingcatalog StoragePROOF farm

schedulerquery

MASTER

PROOF job:data file list, myAna.C

files

final outputs

(merged)feedbacks (merged)

farm perceived as extension of local PC same syntax as in local session

more dynamic use of resources real time feedback automated splitting and merging

Page 8: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 88

Challenges for PROOFChallenges for PROOF

• Remain efficient under heavy loadRemain efficient under heavy load

• 100% exploitation of resources100% exploitation of resources

• ReliabilityReliability

Page 9: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 99

Levels of schedulingLevels of scheduling

• The packetizerThe packetizer– Load balancing on the level of a jobLoad balancing on the level of a job

• Resource scheduling Resource scheduling

(assigning resources to different (assigning resources to different jobs)jobs)– Introducing a central schedulerIntroducing a central scheduler– Priority based scheduling on worker Priority based scheduling on worker

nodesnodes

Page 10: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1010

Packetizer’s Packetizer’s rolerole

• Lookup – check locations of all files and Lookup – check locations of all files and initiate staging, if neededinitiate staging, if needed

• Workers contact packetizer and ask for Workers contact packetizer and ask for new packets (pull architecture)new packets (pull architecture)

• A Packet has info onA Packet has info on– which file to openwhich file to open– which part of file to processwhich part of file to process

• Packetizer keeps assigning packets until Packetizer keeps assigning packets until the dataset is processedthe dataset is processed

Page 11: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1111

PROOF dynamic load PROOF dynamic load balancingbalancing• Pull architecture guarantees scalabilityPull architecture guarantees scalability

• Adapts to variations in performance Adapts to variations in performance

Worker 1 Worker NMaster

packet:unit of work distribution

Time

Page 12: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1212

TPacketizer: the original TPacketizer: the original packetizerpacketizer• StrategyStrategy

– Each worker processes its local files and Each worker processes its local files and then processes remaining remote filesthen processes remaining remote files

– Fixed size packetsFixed size packets– Avoid overloading data server by Avoid overloading data server by

allowing max 4 remote files to be servedallowing max 4 remote files to be served

• Problems with Problems with the Tthe TPacketizerPacketizer– Long tails Long tails with with some I/O bound jobssome I/O bound jobs

Page 13: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1313

Performance tests with Performance tests with ALICEALICE• 35 PCs, dual Xeon 2.8 Ghz, ~200 GB disk35 PCs, dual Xeon 2.8 Ghz, ~200 GB disk

– Standard CERN hardware for LHCStandard CERN hardware for LHC

• Machine pools managed by xrootdMachine pools managed by xrootd– Data of Physics Data Challenge ’06 distributed Data of Physics Data Challenge ’06 distributed

(~ 1 M events)(~ 1 M events)

• Tests performedTests performed– SpeedupSpeedup (scalability) tests (scalability) tests– System response when running a System response when running a combination combination

of job types for increasing # of concurrent of job types for increasing # of concurrent usersusers

Page 14: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1414

Example of problemsExample of problems w with some I/O ith some I/O bound jobsbound jobs

Processing rate during a query:

Resource utilization:

Page 15: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1515

How to improveHow to improve

• Focus on I/O based jobsFocus on I/O based jobs– Limited by hard drive or network Limited by hard drive or network

bandwidthbandwidth

• Predict which Predict which datadata serverservers can s can become bottlenecksbecome bottlenecks

• Make sure that other workers help Make sure that other workers help analyzing data from those analyzing data from those serversservers

• Use tUse time-based packet sizeime-based packet sizess

Page 16: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1616

TAdaptivePacketizerTAdaptivePacketizer

• StrategyStrategy– Predicting the processing timePredicting the processing time of of

local files for each workerlocal files for each worker– For the workers that are expected to For the workers that are expected to

finish faster, finish faster, keep assigning remote keep assigning remote files from the beginning of the jobfiles from the beginning of the job..

– Assign remote files from the most Assign remote files from the most heavily heavily loaded file loaded file serversservers

– VariableVariable packet size packet size

Page 17: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1717

Improvement by up to 30%Improvement by up to 30%TPacketizer TAdaptivePacketizer

Page 18: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1818

Scaling comparison for Scaling comparison for randomly distributed data setrandomly distributed data set

Page 19: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1919

Resource schedulingResource scheduling

• MotivationMotivation

• Central schedulerCentral scheduler– ModelModel– Interface Interface

• Priority based scheduling on worker Priority based scheduling on worker nodesnodes

Page 20: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2020

Why scheduling?Why scheduling?

• Controlling resources and how they are Controlling resources and how they are usedused

• Improving efficiency Improving efficiency – assigning to a job those nodes that have data assigning to a job those nodes that have data

which needs to be analyzed.which needs to be analyzed.

• Implementing different scheduling policiesImplementing different scheduling policies– e.g. fair share, group priorities & quotase.g. fair share, group priorities & quotas

• Efficient use even in case of congestionEfficient use even in case of congestion

Page 21: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2121

PROOF specific PROOF specific requirementsrequirements• Interactive systemInteractive system

– JJobs should be processed as soon as submitted.obs should be processed as soon as submitted.– However when max. system throughput is However when max. system throughput is

reached some jobs has to postponedreached some jobs has to postponed

• I/O bound jobs use more resources at the I/O bound jobs use more resources at the start and less at the end (file distribution)start and less at the end (file distribution)

• Try to process data locallyTry to process data locally• User defines a dataset not the #workersUser defines a dataset not the #workers• Possibility to remove/add workers during a Possibility to remove/add workers during a

jobjob

Page 22: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2222

Starting a queryStarting a query withwith a central schedulera central scheduler (planed) (planed)

DatasetLookup

Client Master

ExternalScheduler

jobpacketizerpacketizer

Start workers

Clusterstatus

Userpriority,history

Page 23: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2323

PlansPlans

• Interface for scheduling "per job”Interface for scheduling "per job”– Special functionality will allow to change Special functionality will allow to change

the set of nodes during a session the set of nodes during a session without loosing user libraries and other without loosing user libraries and other settingssettings

• Removing workers during a jobRemoving workers during a job

• Integration with a schedulerIntegration with a scheduler– MauiMaui, LSF, LSF??

Page 24: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2424

Priority based scheduling on Priority based scheduling on nodesnodes

• Priority-based worker level load balancingPriority-based worker level load balancing– Simple and solid implementation, no central Simple and solid implementation, no central

unitunit– Group priorities defined in the configuration fileGroup priorities defined in the configuration file

• Performed on each worker node Performed on each worker node independentlyindependently

• Lower priority processes slowdownLower priority processes slowdown– sleep before next packet requestsleep before next packet request

Page 25: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2525

SummarySummary

• The adaptive packetizer is working The adaptive packetizer is working very well in current environment. Will very well in current environment. Will be further tuned after introducing the be further tuned after introducing the schedulerscheduler

• Advanced work on PROOF interface Advanced work on PROOF interface to scheduler.to scheduler.

• Priority-based scheduling on nodes is Priority-based scheduling on nodes is being testedbeing tested

Page 26: PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007

Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2626

The PROOF TeamThe PROOF Team

• Maarten BallintijnMaarten Ballintijn

• Bertrand BellenotBertrand Bellenot

• Rene BrunRene Brun

• Gerardo GanisGerardo Ganis

• Jan IwaszkiewiczJan Iwaszkiewicz

• Andreas PetersAndreas Peters

• Fons RademakersFons Rademakers

http://root.cern.chhttp://root.cern.ch