Summary Session ISummary Session IRené Brun
27 May 2005ACAT05
R. Brun, ACAT05 DESY, Zeuthen 2
OutlineOutline
Data Analysis, Data Acquisition and Tools : 6
GRID Deployment : 4Applications on the GRID : 5High Speed Computing : 4
19 presentations
R. Brun, ACAT05 DESY, Zeuthen 3
Data Analysis, Acquisition, Data Analysis, Acquisition, ToolsTools
• Evolution of the Babar configuration data base design• DAQ software for SND detector• Interactive Analysis environment of unified accelerator libraries• DaqProVis, a toolkit for acquisition, analysis, visualisation• The Graphics Editor in ROOT• Parallel interactive and batch HEP data analysis with PROOF
Evolution of the Configuration Evolution of the Configuration Database DesignDatabase Design
Andrei Salnikov, SLACFor BaBar Computing Group
ACAT05 – DESY, Zeuthen
R. Brun, ACAT05 DESY, Zeuthen 5
BaBar database migrationBaBar database migration
• BaBar was using Objectivity/DB ODBMS for many of its databases
• About two years ago started migration from Objectivity to ROOT for event store, which was a success and improvement
• No reason to keep pricey Objectivity only because of “secondary” databases
• Migration effort started in 2004 for conditions, configuration, prompt reconstruction, and ambient databases
R. Brun, ACAT05 DESY, Zeuthen 6
Configuration database APIConfiguration database API• Main problem of the old database – API exposed too much to the
implementation technology:• Persistent objects, handles, class names, etc.
• API has to change but we don’t want to make the same mistakes again (new mistakes are more interesting)
• Pure transient-level abstract API independent on any specific implementation technology
• Always make abstract APIs to avoid problems in the future (this may be hard and need few iterations)
• Client code should be free from any specific database implementation details
• Early prototyping could answer a lot of questions, but five years of experience count too
• Use different implementations for clients with different requirements• Implementation would benefit from features currently missing in
C++: reflection, introspection (or from completely new language)
DAQ software for SND DAQ software for SND detectordetector
Budker Institute of Nuclear Physics, NovosibirskM. Achasov, A. Bogdanchikov, A. Kim, A. Korol
R. Brun, ACAT05 DESY, Zeuthen 8
Main data flowMain data flow
100 Hz
1 KB
Readout and events building
Events packing
Events filtering
1 kHz
4 KB
1 kHz
4 KB
1 kHz
1 KB
Storage
Expected rates:• Events fragments: 4 МB/s are read from IO processors over
Ethernet;• Event building: 4 MB/s;• Event packing: 1 МB/s;• Events filtering (90% screening): 100 KB/sec.
R. Brun, ACAT05 DESY, Zeuthen 9
Filtered events
DAQ architectureDAQ architecture
Detector
Readout & Event Building
Calibration process
Front-end electronics
Buffer TLT computers
Backup
Off-line
Visualization
X 12X 16
KLUKVAKLUKVAKLUKVAKLUKVACAMACCAMAC
Database
System support
R. Brun, ACAT05 DESY, Zeuthen 10
Interactive Analysis Environment of Interactive Analysis Environment of Unified Accelerator LibrariesUnified Accelerator Libraries
V. Fine, N. Malitsky, R.Talman
R. Brun, ACAT05 DESY, Zeuthen 12
AbstractAbstract
Unified Accelerator Libraries (UAL,http://www.ual.bnl.gov) software is an open accelerator simulation environment addressing a broad spectrum of accelerator tasks ranging from online-oriented efficient models to full-scale realistic beam dynamics studies. The paper introduces a new package integrating UAL simulation algorithms with the Qt-based Graphical User Interface and an open collection of analysis and visualization components. The primary user application is implemented as an interactive and configurable Accelerator Physics Player whose extensibility is provided by plug-in architecture. Its interface to data analysis and visualization modules is based on the Qt layer (http://root.bnl.gov) developed and supported by the Star experiment. The present version embodies the ROOT (http://root.cern.ch) data analysis framework and Coin 3D (http://www.coin3d.org) graphics library.
R. Brun, ACAT05 DESY, Zeuthen 13
Accelerator Physics PlayerAccelerator Physics Player
An open collection of algorithms
An open collectionof viewers
UAL::USPAS::BasicPlayer* player = new UAL::USPAS::BasicPlayer(); player->setShell(&shell); qApp.setMainWidget(player); player->show(); qApp.exec();
R. Brun, ACAT05 DESY, Zeuthen 14
Examples of the Accelerator-Specific Examples of the Accelerator-Specific
ViewersViewers
Bunch 2D Distributions(based on ROOT TH2F)
Turn-By-Turn BPM data(based on ROOT TH2F or TGraph )
Twiss plots (based on ROOT TGraph)
Bunch 3D Distributions(based on COIN 3D)
Parallel Interactive and Batch HEP-Data Analysis
with PROOF
Maarten Ballintijn*, Marek Biskup**, Rene Brun**, Philippe Canal***,
Derek Feichtinger****, Gerardo Ganis**, Guenter Kickinger**, Andreas Peters**,
Fons Rademakers**
* - MIT ** - CERN *** - FNAL **** - PSI
R. Brun, ACAT05 DESY, Zeuthen 16
ROOT Analysis ModelROOT Analysis Model
Client
Local file
Remote file(dcache, Castor, RFIO, Chirp)
Rootd/xrootdserver
standard model Files analyzed on a local computer
Remote data accessed via remote fileserver (rootd/xrootd)
R. Brun, ACAT05 DESY, Zeuthen 17
PROOF Basic ArchitecturePROOF Basic Architecture
Slaves
ClientMaster Files
Commands, scripts
Histograms, plots
Single-Cluster mode The Master divides the work among the
slaves
After the processing finishes, merges the results (histograms, scatter plots)
And returns the result to the Client
R. Brun, ACAT05 DESY, Zeuthen 18
PROOF and SelectorsPROOF and Selectors
No user’s control of the entries loop!
Many Trees are
being processed
Initialize each slave
The code is shipped to each slave and SlaveBegin(), Init(), Process(), SlaveTerminate() are executed there
The same code works also without PROOF.
R. Brun, ACAT05 DESY, Zeuthen 19
Analysis session snapshotAnalysis session snapshot
What we are implementing:What we are implementing:
AQ1: 1s query produces a local histogram
AQ2: a 10mn query submitted to PROOF1
AQ3->AQ7: short queries
AQ8: a 10h query submitted to PROOF2
BQ1: browse results of AQ2
BQ2: browse temporary results of AQ8
BQ3->BQ6: submit 4 10mn queries to PROOF1
CQ1: Browse results of AQ8, BQ3->BQ6
Monday at 10h15
ROOT sessionOn my laptop
Monday at 16h25
ROOT sessionOn my laptop
Wednesday at 8h40
sessionon any web
browser
R. Brun, ACAT05 DESY, Zeuthen 20
ROOT Graphics EditorROOT Graphics Editorby Ilka Antchevaby Ilka Antcheva
ROOT graphics editor can be:
• Embedded – connected only with the canvas in the application window
• Global – has own application window and can be connected to any created canvas in a ROOT session.
R. Brun, ACAT05 DESY, Zeuthen 21
Focus on UsersFocus on Users
• Novices (for a short time)• Theoretical understanding, no practical experience with ROOT• Impatient with learning concepts; patient with performing tasks
• Advanced beginners (many people remain at this level)• Focus on a few tasks and learn more on a need-to-do basis• Perform several given tasks well
• Competent performers (fewer then previous class)• Know and perform complex tasks that require coordinated
actions• Interested in solving problems and tracking down errors
• Experts (identified by others)• Ability to find solution in complex functionality• Interested in theories behind the design• Interested in interacting with other expert systems
R. Brun, ACAT05 DESY, Zeuthen 22
DaqProVisDaqProVisM.MorhacM.Morhac
• DaqProVis, a toolkit for acquisition, interactive analysis, processing and visualization of multidimensional data
• Basic features• DaqProVis is well suited for interactive analysis of
multiparameter data from small and medium sized experiments in nuclear physics.
• data acquisition part of the system allows one to acquire multiparameter events either directly from the experiment or from a list file, i.e., the system can work either in on-line or off-line acquisition mode.
• in on-line acquisition mode, events can be taken directly from CAMAC crates or from VME system that cooperates with DaqProVis in the client-server working mode.
• in off-line acquisition mode the system can analyze event data even from big experiments, e.g. from Gammasphere.
• the event data can be read also from another DaqProVis system. The capability of DaqProVis to work simultaneously in both the client and the server working mode enables us to realize remote as well as distributed nuclear data acquisition, processing and visualization systems and thus to create multilevel configurations
R. Brun, ACAT05 DESY, Zeuthen 23
DaqProVis (Visualisation)DaqProVis (Visualisation)
R. Brun, ACAT05 DESY, Zeuthen 24
DaqProVis (suite)DaqProVis (suite)
• DaqProVis and ROOT teams are already cooperating.• Agreement during the workshop to extend this
cooperation
R. Brun, ACAT05 DESY, Zeuthen 25
GRID deploymentGRID deployment
• Towards the operation of the Italian Tier-1 for CMS: Lessons learned from the CMS Data Challenge
• GRID technology in production at DESY• Grid middleware Configuration at the KIPT CMS Linux
Cluster• Storage resources management and access at Tier1
CNAF
Towards the operations ofTowards the operations ofthe Italian Tier-1 for CMS:the Italian Tier-1 for CMS:
lessons learned from the CMS Data Challengelessons learned from the CMS Data Challenge
D. Bonacorsi(on behalf of INFN-CNAF Tier-1 staff and the CMS experiment)
ACAT 2005X Int. Work. on Advanced Computing & Analysis Techniques in Physics Research
May 22nd-27th, 2005 - DESY, Zeuthen, Germany
R. Brun, ACAT05 DESY, Zeuthen 27
DC04 outcome DC04 outcome (grand-summary + focus on (grand-summary + focus on INFN T1)INFN T1)
• reconstruction/data-transfer/analysis may run at 25 Hz• automatic registration and distribution of data, key role of the TMDB
• was the embrional PhEDEx!• support a (reasonable) variety of different data transfer tools and set-up
• Tier-1’s: different performances, related to operational choices• SRB, LCG Replica Manager and SRM investigated: see CHEP04 talk
• INFN T1: good performance of LCG-2 chain (PIC T1 also)• register all data and metadata (POOL) to a world-readable catalogue
• RLS: good as a global file catalogue, bad as a global metadata catalogue• analyze the reconstructed data at the Tier-1’s as data arrive
• LCG components: dedicated bdII+RB; UIs, CEs+WNs at CNAF and PIC• real-time analysis at Tier-2’s was demonstrated to be possible
• ~15k jobs submitted• time window between reco data availability - start of analysis jobs can be
reasonably low (i.e. 20 mins)• reduce number of files (i.e. increase <#events>/<#files>)
• more efficient use of bandwidth• reduce overhead of commands• address scalability of MSS systems (!)
R. Brun, ACAT05 DESY, Zeuthen 28
• Some general considerations may apply:• although a DC is experiment-specific, maybe its conclusions are
not
• an “experiment-specific” problem is better addressed if conceived as a “shared” one in a shared Tier-1
• an experiment DC just provides hints, real work gives insight
crucial role of the experiments at the Tier-1• find weaknesses of CASTOR MSS system in particular operating conditions• stress-test new LSF farm with official production jobs by CMS• testing DNS-based load-balancing by serving data for production and/or
analysis from CMS disk-servers• test new components, newly installed/upgraded Grid tools, etc… • find bottleneck and scalability problems in DB services• give feedback on monitoring and accounting activities• …
Learn from DC04 lessons…Learn from DC04 lessons…
R. Brun, ACAT05 DESY, Zeuthen 29
PhEDEx at INFNPhEDEx at INFN• INFN-CNAF is a T1 ‘node’ in PhEDEx
• CMS DC04 experience was crucial to start-up PhEDEX in INFN• CNAF node operational since the beginning
• First phase (Q3/4 2004):• Agent code development + focus on operations: T0T1 transfers
• >1 TB/day T0T1 demonstrated feasible• … but the aim is not to achieve peaks, but to sustain them in normal operations
• Second phase (Q1 2005):• PhEDEx deployment in INFN to Tier-n, n>1:
• “distributed” topology scenario• Tier-n agents run at remote sites, not at the T1: know-how required, T1 support
• already operational at Legnaro, Pisa, Bari, Bologna
Third phase (Q>1 2005): Many issues.. e.g. stability of service, dynamic routing, coupling PhEDEx to CMS official production system, PhEDEx involvement in SC3-phaseII, etc…
~450 Mbps CNAF T1 ~450 Mbps CNAF T1 LNL-T2 LNL-T2 ~205 Mbps CNAF T1 ~205 Mbps CNAF T1 Pisa-T2 Pisa-T2An example:
data flow to T2’s in daily operations (here: a test with ~2000 files, 90 GB, with no optimization)
Storage resources management and Storage resources management and access at TIER1 CNAFaccess at TIER1 CNAF
ACAT 2005May 22-27 2005
DESY Zeuthen, Germany
Ricci Pier Paolo, Lore Giuseppe, Vagnoni Vincenzo on behalf of INFN TIER1 Staff
R. Brun, ACAT05 DESY, Zeuthen 31
TIER1 INFN CNAF Storage TIER1 INFN CNAF Storage
Linux SL 3.0 clients (100-1000 nodes)
WAN or TIER1 LAN
STK180 with 100 LTO-1 (10Tbyte Native)
STK L5500 robot (5500 slots) 6 IBM LTO-2, 2 (4) STK 9940B drives
PROCOM 3600 FC NAS2 9000 Gbyte
PROCOM 3600 FC NAS3 4700 Gbyte
NAS1,NAS43ware IDE SAS1800+3200 Gbyte
AXUS BROWIEAbout 2200 GByte 2 FC interface
2 Gadzoox Slingshot 4218 18 port FC Switch
STK BladeStoreAbout 25000 GByte 4 FC interfaces
Infortrend 4 x 3200 GByte SATA A16F-R1A2-M1
NFS-RFIO-GridFTP oth...
W2003 Server with LEGATO Networker (Backup)
CASTOR HSM servers
H.A.
Diskservers with Qlogic FC HBA 2340
IBM FastT900 (DS 4500) 3/4 x 50000 GByte 4 FC interfaces
2 Brocade Silkworm 3900 32 port FC Switch
Infortrend 5 x 6400 GByte SATA A16F-R1211-M2 + JBOD
SAN 2 (40TB) SAN 1 (200TB)
HSM (400 TB) NAS (20TB)
NFSRFIO
R. Brun, ACAT05 DESY, Zeuthen 32
CASTOR HSMCASTOR HSMSTK L5500 2000+3500 mixed slots
6 drives LTO2 (20-30 MB/s)
2 drives 9940B (25-30 MB/s)
1300 LTO2 (200 GB native)
650 9940B (200 GB native)
Sun Blade v100 with 2 internal ide disks with software raid-0 running ACSLS 7.0 OS Solaris 9.0 1 CASTOR (CERN)Central
Services server RH AS3.0
8 tapeserver
Linux RH AS3.0
HBA Qlogic 2300
6 stager with diskserver RH AS3.0
15 TB Local staging area
EXPERIMENT Staging area (TB)
Tape pool (TB native)
ALICE 8 12
ATLAS 6 20
CMS 2 15
LHCb 18 30
BABAR,AMS+oth 2 4
Point to Point FC 2Gb/s connections
1 ORACLE 9i rel 2 DB server RH AS 3.0
8 or more rfio diskservers
RH AS 3.0 min 20TB staging area
SAN 1
WAN or TIER1 LAN
SAN 2Indicates Full rendundancy FC 2Gb/sconnections (dual controller HW and Qlogic SANsurfer Path Failover SW)
R. Brun, ACAT05 DESY, Zeuthen 33
DISK access (2)DISK access (2)We have different protocols in production for accessing the disk
storage. In our diskservers and Grid SE front-ends we corrently have:
1. NFS on local filesystem: ADV. Easy client implementation and compatibility and possibility of failover (RH 3.0). DIS. Bad perfomance scalability for an high number of access (1 client 30MB/s 100 client 15MB/s throughtput)
2. RFIO on local filesystem: ADV. Good performance and compatibility with Grid Tools and possibility of failover. DIS. No scalability of front-ends for the single filesystem, no possibility of load-balancing
3. Grid SE Gridftp/rfio over GPFS (CMS,CDF): ADV: Separation from GPFS servers (accessing the disks) and SE GPFS clients. Load balancing and HA on the GPFS servers and possibility to implement the same on the Grid SE services (see next slide). DIS. GPFS layer requirements on OS and Certified Hardware for support.
4. Xrootd (BABAR): ADV: Good performance DIS: No possibility of load-balancing for the single filesystem backends, not grid compliant (at present...)
NOTE The IBM GPFS 2.2 is a CLUSTERED FILESYSTEM so is possible from many front-ends (i.e. gridftp or rfio server) to access simultaneously the SAME filesystem. Also can use bigger filesystem size (we use 8-12TB).
1
R. Brun, ACAT05 DESY, Zeuthen 34
Generic BenchmarkGeneric Benchmark(here shown for 1 GB files)(here shown for 1 GB files)
WRITE (MB/s) READ (MB/s)
# of simultaneous client processes
1 5 10 50 120 1 5 10 50 120
GPFS 2.3.0-1
native 114 160 151 147 147 85 301 301 305 305
NFS 102 171 171 159 158 114 320 366 322 292
RFIO 79 171 158 166 166 79 320 301 320 321
Lustre 1.4.1
native 102 512 512 488 478 73 366 640 453 403
RFIO 93 301 320 284 281 68 269 269 314 349
• Numbers are reproducible with small fluctuations• Lustre tests with NFS export not yet performed
R. Brun, ACAT05 DESY, Zeuthen 35
Grid Technology in Production at DESY
Andreas Gellrich*
DESY
ACAT 2005
24 May 2005
*http://www.desy.de/~gellrich/
R. Brun, ACAT05 DESY, Zeuthen 36
• With the HERA-II luminosity upgrade, the demand for MC production rapidly increased while the outside collaborators moved there computing resources towards LCG
• The ILC group plans the usage of Grids for their computing needs
• The LQCD group develops a Data Grid to exchange data
• DESY considers a participation in LHC experiments
EGEE and D-GRID
dCache is a DESY / FNAL development
Since spring 2004 an LCG-2 Grid infrastructure in operation
Grid @ DESY
R. Brun, ACAT05 DESY, Zeuthen 37
Grid Infrastructure @ DESY …
• DESY installed (SL3.04, Quattor, yaim) and operates a complete independent Grid infrastructure which provides generic (non- experiment specific) Grid services to all experiments and groups
• The DESY Production Grid is based on LCG-2_4_0 and includes: Resource Broker (RB), Information Index (BDII), Proxy (PXY) Replica Location Services (RLS) In total 24 + 17 WNs (48 + 34 = 82 CPUs) dCache-based SE with access to the entire DESY data space
• VO management for the HERA experiments (‘hone’, ‘herab’, ‘hermes’, ‘szeu’), LQCD (‘ildg’), ILC (‘ilc’, ‘calice’), Astro-particle Physics (‘baikal’, ‘icecube’)
• Certification services for DESY users in cooperation with GridKa
R. Brun, ACAT05 DESY, Zeuthen 38
Grid Middleware Configuration Grid Middleware Configuration at the KIPT CMS Linux Clusterat the KIPT CMS Linux Cluster
S. Zub, L. Levchuk, P. Sorokin, D. Soroka
Kharkov Institute of Physics & Technology, 61108 Kharkov, Ukraine
http://www.kipt.kharkov.ua/[email protected]
R. Brun, ACAT05 DESY, Zeuthen 40
What is our specificity?What is our specificity?
Small PC-farm (KCC)
Small scientific group of 4 physicists, combining their work with system administration
CMS tasks orientation
No commercial software installed
Self-security providing
Narrow bandwidth communication channel
Limited traffic
R. Brun, ACAT05 DESY, Zeuthen 41
SummarySummary• An enormous data flow expected in the LHC experiments
forces the HEP community to resort to the Grid technology• The KCC is a specialized PC farm constructed at the NSC KIPT
for computer simulations within the CMS physics program and preparation to the CMS data analysis
• Further development of the KCC is planned with considerable increase of its capacities and deeper integration into the LHC Grid (LCG) structures
• Configuration of the LCG middleware can be troublesome (especially at small farms with poor internet connection), since this software is neither universal nor “complete”, and one has to resort to special tips
• Scripts are developed that facilitate the installation procedure at a small PC farm with a narrow internet bandwidth
R. Brun, ACAT05 DESY, Zeuthen 42
Applications on the GridApplications on the Grid
• The CMS analysis chain in a distributed environment• Monte Carlo Mass production for ZEUS on the Grid• Metadata services on the Grid• Performance comparison of the LCG2 and gLite File
Catalogues• Data Grids for Lattice QCD
R. Brun, ACAT05 DESY, Zeuthen 43
The CMS analysis chain in a distributed environment
ACAT 2005ACAT 2005DESY, Zeuthen, Germany 22DESY, Zeuthen, Germany 22ndnd – –
2727thth May, 2005 May, 2005
on behalf of theCMS collaboration
Nicola De Filippis
R. Brun, ACAT05 DESY, Zeuthen 44
Overview:• Data management
• Data Transfer service: PHEDEX• Data Validation stuff: ValidationTools• Data Publication service: RefDB/PubDB
• Analysis Strategy• Distributed Software installation: XCMSI • Analysis job submission tool: CRAB
• Job Monitoring • System monitoring: BOSS• application job monitoring: JAM
The CMS analysis tools
R. Brun, ACAT05 DESY, Zeuthen 45
CRABJob submission tool
Computing Element
StorageElement
Resource Broker (RB)
UI
Workload Management
System
The user provides:• Dataset (runs,#event,..)
• private code
DataSet Catalogue
(PubDB/RefDB)
The end-user analysis wokflow
Worker node
XCMSI
CRAB discovers data and sites hosting them by querying RefDB/ PubDB CRAB prepares, splits and submits jobs to the Resource Broker
The RB sends jobs at sites hosting the data provided the CMS software was installed
CRAB retrieves automatically the output files of the the job
R. Brun, ACAT05 DESY, Zeuthen 46
CMS first working prototype for Distributed User Analysis isavailable and used by real users
Phedex, PubDB, ValidationTools, XCMSI, CRAB, BOSS, JAM under development, deployment and in production in many sites
CMS is using Grid infrastructure for physics analyses and Monte Carlo production
tens of users, 10 million of analysed data, 10000 jobs submitted
CMS is designing a new architecture for the analysis workflow
Conclusions
R. Brun, ACAT05 DESY, Zeuthen 47
R. Brun, ACAT05 DESY, Zeuthen 48
Metadata Services on the GRIDMetadata Services on the GRID
Nuno Santos
ACAT’05 May 25th, 2005
R. Brun, ACAT05 DESY, Zeuthen 50
Metadata on the GRIDMetadata on the GRID
• Metadata is data about data• Metadata on the GRID
• Mainly information about files• Other information necessary for running jobs• Usually living on DBs
• Need simple interface for Metadata access• Advantages
• Easier to use by clients - no SQL, only metadata concepts• Common interface – clients don’t have to reinvent the wheel
• Must be integrated in the File Catalogue• Also suitably for storing information about other resources
R. Brun, ACAT05 DESY, Zeuthen 51
ARDA ImplementationARDA Implementation• Backends
• Currently: Oracle, PostgreSQL, SQLite
• Two frontends• TCP Streaming
• Chosen for performance• SOAP
• Formal requirement of EGEE• Compare SOAP with TCP Streaming
• Also implemented as standalone Python library• Data stored on filesystem
Python Interpreter
Metadata Python
APIClient
filesystem
Metadata Server
MDServer
SOAP
TCP Streaming
PostgreSQL
Oracle
SQLite
Client
Client
R. Brun, ACAT05 DESY, Zeuthen 52
SOAP Toolkits performanceSOAP Toolkits performance
• Test communication performance• No work done on the backend• Switched 100Mbits LAN
• Language comparison• TCP-S with similar performance in
all languages• SOAP performance varies strongly
with toolkit• Protocols comparison
• Keepalive improves performance significantly
• On Java and Python, SOAP is several times slower than TCP-S
1000 pings
0
5
10
15
20
25
Exe
cutio
n T
ime
[s]
C++ (gSOAP) Java (Axis) Python (ZSI)
TCP-S no KATCP-S KA
gSOAP no KAgSOAP KA
R. Brun, ACAT05 DESY, Zeuthen 53
R. Brun, ACAT05 DESY, Zeuthen 54
R. Brun, ACAT05 DESY, Zeuthen 55
R. Brun, ACAT05 DESY, Zeuthen 56
R. Brun, ACAT05 DESY, Zeuthen 57
R. Brun, ACAT05 DESY, Zeuthen 58
R. Brun, ACAT05 DESY, Zeuthen 59
High speed ComputingHigh speed Computing
• Infiniband• Analysis of SCTP and TCP based communication in
high-speed cluster• The apeNEXT Project• Optimisation of Lattice QCD codes for the Opteron
processor
A. Heiss, U. Schwickerath
InfiniBand – Experiences at InfiniBand – Experiences at Forschungszentrum KarlsruheForschungszentrum Karlsruhe
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Credits: Inge Bischoff-Gauss Marc García Martí
Bruno Hoeft Carsten Urbach
InfiniBand-Overview Hardware setup at IWR HPC applications:
MPI performance lattice QCD LM
HTC applications rfio xrootd
Lattice QCD Benchmark GE wrt/ Lattice QCD Benchmark GE wrt/ InfiniBandInfiniBand
Memory and communi- cation intensive application Benchmark by
C. Urbach See also CHEP04 talk
given by A. Heiss
Significant speedupby using InfiniBand
Thanks to Carsten UrbachFU Berlin and DESY Zeuthen
RFIO/IB Point-to-Point file transfers RFIO/IB Point-to-Point file transfers (64bit)(64bit)
RFIO/IB see ACAT03 NIM A 534(2004) 130-134
Notes
PCI-X and PCI-Express throughput
solid: file transfers cache->/dev/nulldashed: network+protocol only
best results with PCI-Express: > 800MB/s raw transfer speed > 400MB/s file transfer speed
Disclaimer on PPC64: Not an official IBM Product. Technology Prototype. (see also slide 5 and 6)
Xrootd and InfiniBandXrootd and InfiniBand
First preliminary resultsNotes:
IPoIB notes: Dual Opteron V20z Mellanox Gold drivers SM on InfiniCon 9100 same nodes as for GE
Native IB notes: proof of concept version based on Mellanox VAPI using IB_SEND dedicated send/recv buffers same nodes as above
10GE notes: IBM xseries 345 nodes Xeon 32bit, single CPU 1 and 2 GB RAM 2.66GHz clock speed Intel PRO/10GbE LR cards used for long distance tests
TCP vs. SCTP TCP vs. SCTP in high-speed cluster environmentin high-speed cluster environment
Miklos KozlovszkyBudapest University of Technology and Economics
BUTE
R. Brun, ACAT05 DESY, Zeuthen 65
TCP vs. SCTPTCP vs. SCTP
TCP SCTP
Byte stream oriented Message oriented
3 way handshake connection init 4 way handshake connection init (cookie)
Old (more than 20 years) Quite new (2000-)
Multihoming
Path-mtu discovery
Both:• IPv4 & IPv6 compatible• Reliable• Connection oriented• Offers acknowledged, error free, non-duplicated transfer • Almost same Flow and Congestion Control
R. Brun, ACAT05 DESY, Zeuthen 66
SummarySummary• SCTP inherited all the “good features of TCP”• SCTP want to behave like a next generation TCP• It is more secure than TCP, and has many attractive feature
(e.g.:multihoming)• Theoretically it can work better than TCP, but TCP is faster (yet
“poor” implementations)• Well standardized, and can be useful for cluster
R. Brun, ACAT05 DESY, Zeuthen 67
R. Brun, ACAT05 DESY, Zeuthen 68
R. Brun, ACAT05 DESY, Zeuthen 69
R. Brun, ACAT05 DESY, Zeuthen 70
R. Brun, ACAT05 DESY, Zeuthen 71
R. Brun, ACAT05 DESY, Zeuthen 72
R. Brun, ACAT05 DESY, Zeuthen 73
R. Brun, ACAT05 DESY, Zeuthen 74
My ImpressionsMy Impressions
R. Brun, ACAT05 DESY, Zeuthen 76
ConcernsConcerns
• Only a small fraction of the Session I talks correspond to the original spirit of the AIHEP/ACAT Session I talks.
• In particular, many of the GRID talks about deployment and infrastructure should be given to CHEP, not here.
• The large LHC collaborations have their own ACAT a few times/year.
• The huge experiment software frameworks do not encourage cross-experiments discussions or tools.
• For the next ACAT, the key people involved in the big experiments should work together to encourage more talks or reviews.
R. Brun, ACAT05 DESY, Zeuthen 77
Positive aspectsPositive aspects
• ACAT continues to be a good opportunity to meet with other cultures. Innovation may come from small groups or non HENP fields.
• Contacts (even sporadic) with Session III or plenary talks are very beneficial, in particular to young people.
R. Brun, ACAT05 DESY, Zeuthen 78
The Captain of KopenickThe Captain of Kopenick
• Question to the audience :• Is Friedrich Wilhelm Voigt (Captain of Kopenick) an
ancestor of Voigt, the father of the Voigt function ?
Top Related