Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April...

Northgrid

Alessandra FortiM. Doidge, S. Jones, A. McNab, E. Korolkova

Gridpp26 Brighton

30 April 2011

Efficiency

The ratio of the effective or useful output to the total input in any system.

Pledges

site Job slots HEPSPEC06 TB Pledged HS06

Pledge TB

Lancaster 2,368 28,627 1116 3,395 515

Liverpool 572 8,318 406 2,996 554

Manchester 2,770 22,264 687 7,284 224

Sheffield 400 4,800 297 1,198 252

CPU efficiency

Usage

NorthGrid Normalised CPU time (HEPSPEC06) by SITE and VO. TOP10 VOs (and Other VOS). September 2010 - February 2011.

Successful jobs rate

ANALY_MANC (192046/52963)

ANALY_LANCS (155233/61368)

ANALY_SHEF (161043/25537)

ANALY_LIV (146994/21563)

UKI-NORTHGRID-MAN-HEP (494559/35496)

UKI-NORTHGRID-LANCS-HEP (252864/33889)

UKI-NORTHGRID-LIV-HEP (227395/15185)

UKI-NORTHGRID-SHEF-HEP (140804/8525)

Lancaster – keeping things smooth

Our main strategy for efficient running at Lancaster involves comprehensive monitoring and configuration management.

Effective monitoring allows us to jump on incidents and spot problems before they bite us on the backside, as well as enabling us to better understand, and therefore tune, our systems.

Cfengine on our nodes, and kusu on the HEC machines, enables us to pre-empt misconfiguration issues on individual nodes, quickly ratify errors and ensure swift, homogenous rollout of configs and changes.

Whatever the monitoring, e-mail alerts keep us in the know.Among the many tools and tactics we use to keep on top of things are:Syslog (with Logwatch mails) , Ganglia, Nagios (with e-mail alerts), Atlas Panda

Monitoring, Steve’s Pages, on-board monitoring and e-mail alerts for our Areca raid arrays, Cacti for our network (and the HEC nodes), plus a whole bunch of hacky scripts and bash one-liners!zzzz

Lancaster – TODO list

We’ll probably never stop finding things to polish, but some things that are at are on top of the wishlist (in that we wish we could get time to implement them!) are:

A site dashboard (a huge, beautiful site dashboard)More ganglia metrics!And more in-depth nagios tests, particularly for batch system

monitoring and raid monitoring (recent storage purchases have 3ware and Adaptec raids).

Intelligent syslog monitoring as the number of nodes at our site grow.Increased network and job monitoring, the more detailed the picture

we have of what’s going on the better we can tune things.Other ideas for increasing our efficiency include SMS alerts, internal

ticket management and introducing a more formalised on-call system.

Planning, design and testing -Storage and node specificationsNetwork design, e.g.

– Minimise contention– Bonding

Extensive HW and SW soak testing, experimentation, tuningAdjustments and refinementUPS coverage

Liverpool hardware measures

Builds and maintenance -

dhcp, kickstart, yum, puppet, yaim, standards

Monitoring -

nagios (local and gridpp), ganglia, cacti/weathermap, log monitoring, tickets and mail lists.

testnodes – local software that checks worker-nodes to isolate potential “blackhole” conditions.

Liverpool Building and monitoring

Manchester install & config &

monitor Have to look after ~550 machines

Install Dhcp, Kickstart, YAIM, Yum, Cfengine

Monitor Nagios (ganglia), cfengine,

weathermap, raid cards monitoring, custom scripts to parse log files, OS tools.

Each machines has a profile for each tool Difficult to keep consistent changes Manpower reduced can't afford this bad

tracking

Manchester Integration with RT

• Use Nagios for monitoring nodes and services– Both external tests (eg ssh to port)– And internal tests (via node's nrpe

daemon)• Use RT (“Request Tracker”) for tickets

– Includes Asset Tracker which has a powerful as has a web interface and links to tickets

ManchesterIntegration with RT (2)

• Previously maintained lists of hosts and group membership in Nagios cfg files– Now make these from the AT MySQL DB

• Obvious advantages in monitoring services only where cfengine has installed them

• Automatic cross link between AT and nagios• Future extensions to other lists as dhcp,

cfengine, online and offline nodes

Sheffield: efficiency

2 clusters Jobs requiring better network bandwidth directed to WNs with better backbone

•Storage

• 90 TB (9 disk pools , SW RAID5 (without raid controllers))

• Absence of raid controllers increases site efficiency :– No common failures related to RAID controllers:

unavailable disk servers and data loss

• 2 TB disks seagate barracuda disks, fast and robust

• 5x16bay unit with 2 fs, 4x24 bay unit with 2 fs

• Cold spare unit on standby in each server

• Simple cluster structure makes it easy to support high efficiency and to upgrade it to new requirements of experiments

Sheffield:efficiency

• Monitoring (checks are on a regular basis several times a day)

– Ganglia: general check of the cluster health

– Regional nagios, warnings sent via email from regional nagios

– Logwatch/syslog check

– GRIDMAP

– All ATLAS monitoring tools :– ATLAS SAM test page

– AtTLAS (and LHCb) Site Status Board

– DDM dashboard

– PANDA monitor

– Detailed check of atlas performance (check the reason for a particular failure of production and analysis jobs)

Sheffield:efficiency

• Installation

– Use PXE boot

– Redhat kickstart install

– Using many cron jobs for monitoring

– Bash post-install (includes yaim)

• Cron jobs

– Monitor the temperature in cluster room (in case of temperature raise only some of the worker nodes shut down automatically)

– Generate a web page of queues and jobs for both grid and local

– Check and restart of vital services if they are down (bdii, srm)

– Generate a warning email in case of disk failure (in any server)

Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April...

Documents

Transcript of Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April...