Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing...

12
Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory

Transcript of Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing...

Page 1: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Site Report: The Linux Farm at the RCF

HEPIX-HEPNTOctober 22-25, 2002

Ofer Rind

RHIC Computing FacilityBrookhaven National Laboratory

Page 2: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Ofer Rind - RHIC Computing Facility Site Report

RCF - Overview

Provide computing facilities for RHIC users:

➔General computing environment

●General interactive tasks (email, document processing, web)

➔Data analysis facility

●Computing infrastructure for RHIC experiments

➢Code development, repository & distribution

➢Raw data recording & reconstruction

➢Data analysis

ACF: US Atlas Tier 1 Computing Facility

➔ Shared infrastructure and synergy with RCF

Support staff: 25 FTE's (4 dedicated to Linux Farm)

Page 3: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Ofer Rind - RHIC Computing Facility Site Report

RCF - Structure

Page 4: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

RCF - Component Summary

Mass Storage Subsystem➔ StorageTek library managed by HPSS

●4 Silos, 1.2PB capacity (expanding to 4.5PB)●In Run-2, raw data recorded at a common rate of 70MB/sec for a total of 170TB●Total data store ~300TB

Disk Storage➔ Fibre channel SAN served by NFS

●~110TB Raid5 ●14 Sun 450, Solaris 8 [2-02] (5 Sun 480 coming online)●IBM AFS servers (AIX)

Linux Server Farm

Ofer Rind - RHIC Computing Facility Site Report

Page 5: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm Hardware

➔840 1U and 2U servers (pre-'99 towers have been retired)

➔69 kSPECint95, expanding to 100 kSPECint95 (2+ TFLOPS)

➔Most have 1GB mem (at least 500MB)

➔Local SCSI disks up to 140GB/node

➔Allocated by experiment➔Further allocated for Raw Data

Reconstruction (CRS) and Re- constructed Data Analysis (CAS)

VA Linux PIII 450Mz 148 Jun 99VA Linux PIII 700Mz 48 Aug 00VA Linux PIII 800Mz 168 Nov 00

IBM PIII 1000Mz 316 Aug 01IBM PIII 1400Mz 160 Oct 02

Ofer Rind - RHIC Computing Facility Site Report

Page 6: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm Software Configuration

● RedHat 7.2 upgraded to 2.4.9-31 kernel

● Image(s) installed via Kickstart server and customized for RCF environment via rpm

● NFS + AFS home directory and file access

● Interactive login allowed on selected nodes

● Job management:

(CAS) LSF 4.2 - slightly re-architected for robustness. Peak throughput before summer conferences was >150K jobs/week.

(CRS) Locally produced Perl-based batch system (AIX needed for HPSS API). Approx. 670K jobs processed for Run-2.

● Expanding use of distributed disk models (rootd, ??)

● Atlas Grid testbed

Ofer Rind - RHIC Computing Facility Site Report

Page 7: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Tracking LSF Usage

Star queues weekly job statistics(week of Oct. 10)

Job starts/hr

Avg runtime/hr

Runtime

Ofer Rind - RHIC Computing Facility Site Report

Page 8: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Security and Monitoring

Security:●RCF firewall within BNL site firewall●SSH2 only access through gateway bastion nodes (Solaris x86)●User access restricted to a subset of systems (CAS only)

Monitoring:●24 hr. on-call staff for critical systems during RHIC operation●Cluster mgmt. software:

➔VACM (VA Linux)➔xCAT (IBM, http://www.x-cat.org)

●Cron scripts to "clean" nodes and head off possible problems (memory leaks, full disks, etc.)●CTS system for problem reports

Ofer Rind - RHIC Computing Facility Site Report

Page 9: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Farm Alert System

Web-monitoring (user-accessible) plus paging/email alerts

Python scripts running locally transferring node status information to a MySQL database.

Notification of problems with NFS/AFS (e.g. stale file handles), LSF daemons, high load, etc.

Ofer Rind - RHIC Computing Facility Site Report

Page 10: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Network Operation Status

Perl scripts monitor network service connectivity for all nodes (ssh, yp, etc.)

Ofer Rind - RHIC Computing Facility Site Report

Page 11: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Load Monitoring and History

MySQL database for usage history

History available back to Sept. '01 via web interface.

CPU Load averaged over (98) Phenix machines during the month of September.

Ofer Rind - RHIC Computing Facility Site Report

Page 12: Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Plans for the Near Future

● 160 newly delivered IBM nodes to be brought online

● Expect purchase bid to go out for ~220 more nodes at beginning of FY03 (pending funding approval)

● Scaling up data storage capacity and throughput for Run-3 (up to 10X data increase over Run-2, starting in December)

● Evaluation of LSF 5 and Condor ongoing, with an eye towards distributed disk services

● Expanding Atlas GRID services

Ofer Rind - RHIC Computing Facility Site Report