Super Scaling PROOF to very large clusters Maarten Ballintijn, Kris Gulbrandsen, Gunther Roland /...

Super Scaling PROOF to very large clusters

Maarten Ballintijn, Kris Gulbrandsen,Gunther Roland / MIT

Rene Brun, Fons Rademakers / CERNPhilippe Canal / FNAL

CHEP 2004

September, 2004 Super Scaling PROOF to Very Large Clusters 2

Outline PROOF Overview Benchmark Package Benchmark results Other developments Future plans


PROOF – Parallel ROOT Facility

Interactive analysis of very large sets of ROOT data files on a cluster of computers

Employ inherent parallelism in event data The main design goals are:

Transparency, scalability, adaptability On the GRID, extended from local cluster to

wide area virtual cluster or cluster of clusters

Collaboration between ROOT group at CERN and MIT Heavy Ion Group


PROOF, continued

Multi Tier architecture Optimize for Data Locality WAN Ready and GRID

compatible

Internet

Master

SlaveSlaveSlaveSlave

User


PROOF - Architecture Data Access Strategies

Local data first, also rootd, rfio, SAN/NAS Transparency

Input objects copied from client Output objects merged, returned to client

Scalability and Adaptability Vary packet size (specific workload, slave

performance, dynamic load) Heterogeneous Servers

Migrate to multi site configurations


Outline PROOF Overview Benchmark Package

Dataset generation Benchmark TSelector Statistics and Event Trace

Benchmark results Other developments Future plans


Dataset generation

Use the ROOT “Event” example class Script for creating PAR file is provided

Generate data on all nodes with slaves

Slaves generate data files in parallel Specify location, size and number of

files% make_event_par.sh% rootroot[0] gROOT->Proof()root[1] .X make_event_trees.C(“/tmp/data”,100000,4)root[2] .L make_tdset.Croot[2] TDSet *d = make_tdset.C()


Benchmark TSelector Three selectors are used

EventTree_NoProc.C – Empty Process() function, reads no data

EventTree_Proc.C – Reads all data and fills histogram (actually only 35% read in this test)

EventTree_ProcOpt.C – Reads a fraction of the data (20%) and fills histogram


Statistics and Event Trace

Global Histograms to monitor master Number of packets, number of events,

processing time, get packet latency; per slave Can be viewed using standard feedback

Trace Tree, detailed log of events during query

Master only or Master and Slave Detailed List of recorded events follows

Implemented using standard ROOT classes and PROOF facilities


Events recorded in Trace

Each event contains a timestamp and the recording slave or master

Begin and End of Query Begin and End of File Packet details and processing time File Open statistics (slaves) File Read statistics (slaves) Easy to add new events


Outline

PROOF Overview Benchmark Package Benchmark results Other developments Future plans


Benchmark Results

CDF cluster at Fermilab 160 nodes, initial tests

Pharm, Phobos private cluster, 24 nodes

6, 730 MHz P3 dual 6, 930 MHz P3 dual 12, 1.8 GHz P4 dual

Dataset: 1 files per slave, 60000 events, 100 Mb


Results on Pharm


Results on Pharm, continued


Local and remote File open

Local

local

remote


Slave I/O Performance


Benchmark Results

Phobos-RCF, central facility at BNL, 370 nodes total 75, 3.05 Ghz P4 dual, IDE 99, 2.4 Ghz P4 dual, IDE 18, 1.4 Ghz P3 dual, IDE

Dataset: 1 files per slave, 60000 events, 100 Mb


PHOBOS RCF LAN Layout


Results on Phobos-RCF


Looking at the problem


Processing time distributions


Processing time, detailed


Request packet from Master


Benchmark Conclusions The benchmark and measurement

facility has proven to be a very useful tool

Don’t use NFS based home directories LAN topology is important LAN speed is important More testing is required to pinpoint

sporadic long latency


Other developments

Packetizer fixes and new dev version PROOF Parallel startup TDrawFeedback TParameter utility class TCondor improvements Authentication improvements Long64_t introduction


Future plans Understand and Solve LAN latency

problem In prototype stage

TProof::Draw() Multi level master configuration

Documentation HowTo Benchmarking

PEAC PROOF Grid scheduler


The End

Questions?


Parallel Script Execution

root

Remote PROOF Cluster

proof

proof

proof

TNetFile

TFile

Local PC

$ root

ana.Cstdout/obj

node1

node2

node3

node4

$ rootroot [0] .x ana.C$ rootroot [0] .x ana.Croot [1] gROOT->Proof(“remote”)

$ rootroot [0] tree->Process(“ana.C”)root [1] gROOT->Proof(“remote”)root [2] dset->Process(“ana.C”)

ana.C

proof

proof = slave server

proof

proof = master server

#proof.confslave node1slave node2slave node3slave node4

*.root

*.root

*.root

*.root

TFile

TFile


Simplified message flow

Client Master Slave(s)

SendFileSendFile

Process(dset,sel,inp,num,first) GetEntries

Process(dset,sel,inp,num,first)

GetPacket

ReturnResults(out,log)

ReturnResults(out,log)


TSelector control flow

TProof Slave(s)

Begin()

TSelector TSelector

SlaveBegin()Send Input Objects

Terminate()

SlaveTerminate()Return Output Objects

Process()

Process()...


PEAC System Overview


Active Files during Query


Pharm Slave I/O


Active Files during Query

Super Scaling PROOF to very large clusters Maarten Ballintijn, Kris Gulbrandsen, Gunther Roland /...

Documents

Transcript of Super Scaling PROOF to very large clusters Maarten Ballintijn, Kris Gulbrandsen, Gunther Roland /...