Running the multi-platform , multi-experiment cluster at CCIN2P3

29
W.A.Wojcik/CCIN2P3, May 2001 1 Running the multi- platform, multi-experiment cluster at CCIN2P3 Wojciech A. Wojcik IN2P3 Computing Center e-mail: [email protected] URL: http://webcc.in2p3.fr

description

Running the multi-platform , multi-experiment cluster at CCIN2P3. Wojciech A. Wojcik. IN2P3 Computing Center. e-mail: [email protected] URL: http:// webcc .in2p3.fr. IN2P3 Computer Center. Provides the computing and data services for the French high energy and nuclear physic ist s: - PowerPoint PPT Presentation

Transcript of Running the multi-platform , multi-experiment cluster at CCIN2P3

Page 1: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 1

Running the multi-platform, multi-experiment

cluster at CCIN2P3

Running the multi-platform, multi-experiment

cluster at CCIN2P3

Wojciech A. Wojcik

IN2P3 Computing Center

e-mail: [email protected] URL: http://webcc.in2p3.fr

Page 2: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 2

IN2P3 Computer CenterIN2P3 Computer Center Provides the computing and data services

for the French high energy and nuclear physicists:

• IN2P3 – 18 physics labs (in all big towns in France)

• CEA/DAPNIA French groups are involved in 35

experiments at CERN, SLAC, FNAL, BNL, DESY and other sites (also astrophysics).

Specific situation: our CC is not directly connected to experimental facilities, like CERN, FNAL, SLAC, DESY, BNL.

Page 3: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 3

General rulesGeneral rules All groups/experiments share the same

interactive and batch (BQS) clusters and other type of services (disk servers, tapes, HPSS and networking). Some exceptions later…

/usr/bin and lib (OS and compilers) are local

/usr/local/* on AFS, specific for each platform

/scratch – local tmp disk space System, group and user profiles define the

proper environment

Page 4: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 4

General rulesGeneral rules User has the AFS account with access to

the following AFS disk spaces:• HOME - backup by CC• THRONG_DIR (up to 2GB) - backup by CC• GROUP_DIR (n * 2GB), no – backup

Data are on: disks (GROUP_DIR, Objectivity), tapes (xtage system) or in HPSS

Data exchange on the following media:• DLT, 9480• Network (bbftp)

ssh/ssf - access to/from external domains recommended.

Page 5: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 5

Supported platformsSupported platforms

Supported platforms: 1. Linux (RedHat 6.1, kernel 2.2.17-14smp) with

different egcs compilers (gcc 2.91.66, gcc 2.91.66 with patch for Objy 5.2, gcc 2.95.2 – installed on /usr/local), requested by different experiments

2. Solaris 2.6, 2.7 soon

3. AIX 4.3.2

4. HP-UX 10.20 – end of this service already announced

Page 6: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 6

Support for experimentsSupport for experiments About 35 different High Energy,

Astrophysics and Nuclear Physics experiments.

LHC experiments: CMS, Atlas, Alice and LHCb.

Big non-CERN experiments: BaBar, D0, STAR, PHENIX, AUGER, EROS II.

Page 7: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 7

Page 8: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 8

Page 9: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 9

Disk spaceDisk space

Need to make the disk storage independent of the operating system.

Disk servers based on:• A3500 from Sun with 3.4 TB

• VSS from IBM with 2.2 TB

• ESS from IBM with 7.2 TB

• 9960 from Hitachi with 21.0 TB

Page 10: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 10

Mass storageMass storage

Supported medias (all in the STK robots):• 3490

• DLT4000/7000

• 9840 (Eagles)

• Limited support for Redwood HPSS – local developments:

• Interface with RFIO:– API: C, Fortran (via cfio from CERNLIB)

– API: C++ (iostream)

• bbftp – secure parallel ftp using RFIO interface

Page 11: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 11

Mass storageMass storage HPSS – test and production services

• $HPSS_TEST_SERVER:/hpsstest/in2p3.fr/…

• $HPSS_SERVER:/hpss/in2p3.fr/… HPSS – usage:

• BaBar - usage via ams/oofs and RFIO

• EROS II – already 1.6 TB in HPSS

• AUGER, D0, ATLAS, LHCb

• Other experiments on tests: SNovae, DELPHI, ALICE, PHENIX, CMS

Page 12: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 12

Networking - LANNetworking - LAN

Fast Ethernet (100 Mb full duplex) --> to interactive and batch services

Giga Ethernet (1 Gb full duplex) --> to disk servers and Objectivity/DB server

Page 13: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 13

Networking - WANNetworking - WAN

Academic public network “Renater 2” based on virtual networking (ATM) with guaranteed bandwidth (VPN on ATM)

Lyon CERN at 34Mb (155 Mb in June 2001)

Lyon US is going through CERN Lyon Esnet (via STAR TAP), 30-40 Mb,

reserved for the traffic to/from ESnet, except FNAL.

Page 14: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 14

BAHIA - interactive front-end

BAHIA - interactive front-end

Based on multi-processors: Linux (RedHat 6.1) -> 10 PentiumII450 +

12 PentiumIII1GHz (2 processors) Solaris 2.6 -> 4 Ultra-4/E450 Solaris 2.7 -> 2 Ultra-4/E450 AIX 4.3.2 -> 6 F40 HP-UX 10.20 -> 7 HP9000/780/J282

Page 15: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 15

Batch system - BQSBatch system - BQSBatch based on BQS (CCIN2P3 product) In constant development, used since 7 years Posix compliant, platform independent

(portable) Possibilities to define the resources for the

job (the class of job is calculated by scheduler as a function of):

• CPU time, memory

• CPU bound or I/O bound

• Platform(s)

• System resources: local scratch disk, stdin/out size

• User resources (switches, counters)

Page 16: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 16

Batch system - BQSBatch system - BQS Scheduler takes into account:

• Targets for groups (declared twice a year for the big production runs)

• Consumption of cpu time in last periods: month, week, day for user and group

• Proper aging and interleave in the class queues Possibility to open the worker for any

combination of classes.

Page 17: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 17

Batch system - configuration

Batch system - configuration

Linux (RedHat 6.1) -> 96 dual PIII 750MHz + 110 dual PIII1GHz

Solaris 2.6 -> 25 * Ultra60 Solaris 2.7 -> 2 * Ultra60 (test service) AIX 4.3.2 -> 29 * RS390 + 20 * 43P-B50 HP-UX 10.20 -> 52 * HP9000/780

Page 18: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 18

Batch system – cpu usage

Batch system – cpu usage

Page 19: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 19

Batch system – Linux cluster

Batch system – Linux cluster

Page 20: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 20

Regional Center for:Regional Center for: EROS II (Expérience de Recherches

d’Objets Sombres par effet de lentilles gravitationnelles)

BaBar Auger (PAO) D0

Page 21: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 21

EROS IIEROS II Raw data (from ESO site in Chili) on DLTs

(tar format). Restructuring of the data from DLT to

3490 or 9480, creation of metadata on Oracle DB.

Data server (on development) - 7TB of data actually, 20TB at the end of experiment – using HPSS + WEB server.

Page 22: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 22

BaBarBaBar AIX and HP-UX not supported by BaBar,

Solaris 2.6 with Workshop 4.2 and Linux (RedHat 6.1). Solaris 2.7 in preparation.

Data are stored in ObjectivityDB, import/export of data is done using bbftp. The import/export on the tapes has been abandoned.

Objectivity (ams/oofs) servers (dedicated only to BaBar) have been installed (10 servers).

Usage of HPSS for staging the ObjectivityDB files.

Page 23: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 23

Experiment PAOExperiment PAO

Page 24: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 24

PAO - sitesPAO - sites

Page 25: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 25

PAO - AUGERPAO - AUGER

CCIN2P3 is acting as AECC (AUGER European CC).

Access granted to all AUGER users (AFS accounts provided).

CVS repository for AUGER software has been installed at CCIN2P3, access from AFS (from the local and non-local cells) and from non-AFS environment using ssh.

Linux is the preferred platform. Simulation software based on Fortran

programs.

Page 26: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 26

D0D0 Linux is one of D0 supported platforms

and is available at CCIN2P3. D0 software is using the KAI C++

compiler Import/export of D0 data (using internal

Enstore format) is a complicated work. We will try to use the bbftp as a file transfer program.

Page 27: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 27

CCIN2P3

Import/export Import/export

CERNCASTOR

HPSS

SLACHPSS

FNALENSTORE

SAM

BNLHPSS

? ?

? ?HPSS

Page 28: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 28

ProblemsProblems To add the new Objy servers (for other

experiments) is very complicated. It needs the new separate machines, with modified port numbers in /etc/services. Under development for CMS.

The OS system versions and levels The compilers versions (mainly for Objy

for different experiments). Solutions?

Page 29: Running the multi-platform ,  multi-experiment cluster at CCIN2P3

W.A.Wojcik/CCIN2P3, May 2001 29

ConclusionsConclusions The data exchange should be done using

the standards (e.g. files or tapes) and common access interfaces (bbftp and rfio are the good examples).

Needs for better coordination and similar requirements on supported system and compiler levels between experiments.

The choice of the CASE technologie is out of the control of our CC acting as Regional Computer Center .

GRID will require more uniform configuration of the distributed elements.

Who can help? HEPCCC? HEPiX? GRID?