Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April...
-
Upload
edgar-potter -
Category
Documents
-
view
217 -
download
1
Transcript of Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April...
Northgrid
Alessandra FortiM. Doidge, S. Jones, A. McNab, E. Korolkova
Gridpp26 Brighton
30 April 2011
Pledges
site Job slots HEPSPEC06 TB Pledged HS06
Pledge TB
Lancaster 2,368 28,627 1116 3,395 515
Liverpool 572 8,318 406 2,996 554
Manchester 2,770 22,264 687 7,284 224
Sheffield 400 4,800 297 1,198 252
Usage
NorthGrid Normalised CPU time (HEPSPEC06) by SITE and VO. TOP10 VOs (and Other VOS). September 2010 - February 2011.
Successful jobs rate
ANALY_MANC (192046/52963)
ANALY_LANCS (155233/61368)
ANALY_SHEF (161043/25537)
ANALY_LIV (146994/21563)
UKI-NORTHGRID-MAN-HEP (494559/35496)
UKI-NORTHGRID-LANCS-HEP (252864/33889)
UKI-NORTHGRID-LIV-HEP (227395/15185)
UKI-NORTHGRID-SHEF-HEP (140804/8525)
Lancaster – keeping things smooth
Our main strategy for efficient running at Lancaster involves comprehensive monitoring and configuration management.
Effective monitoring allows us to jump on incidents and spot problems before they bite us on the backside, as well as enabling us to better understand, and therefore tune, our systems.
Cfengine on our nodes, and kusu on the HEC machines, enables us to pre-empt misconfiguration issues on individual nodes, quickly ratify errors and ensure swift, homogenous rollout of configs and changes.
Whatever the monitoring, e-mail alerts keep us in the know.Among the many tools and tactics we use to keep on top of things are:Syslog (with Logwatch mails) , Ganglia, Nagios (with e-mail alerts), Atlas Panda
Monitoring, Steve’s Pages, on-board monitoring and e-mail alerts for our Areca raid arrays, Cacti for our network (and the HEC nodes), plus a whole bunch of hacky scripts and bash one-liners!zzzz
Lancaster – TODO list
We’ll probably never stop finding things to polish, but some things that are at are on top of the wishlist (in that we wish we could get time to implement them!) are:
A site dashboard (a huge, beautiful site dashboard)More ganglia metrics!And more in-depth nagios tests, particularly for batch system
monitoring and raid monitoring (recent storage purchases have 3ware and Adaptec raids).
Intelligent syslog monitoring as the number of nodes at our site grow.Increased network and job monitoring, the more detailed the picture
we have of what’s going on the better we can tune things.Other ideas for increasing our efficiency include SMS alerts, internal
ticket management and introducing a more formalised on-call system.
Planning, design and testing -Storage and node specificationsNetwork design, e.g.
– Minimise contention– Bonding
Extensive HW and SW soak testing, experimentation, tuningAdjustments and refinementUPS coverage
Liverpool hardware measures
Builds and maintenance -
dhcp, kickstart, yum, puppet, yaim, standards
Monitoring -
nagios (local and gridpp), ganglia, cacti/weathermap, log monitoring, tickets and mail lists.
testnodes – local software that checks worker-nodes to isolate potential “blackhole” conditions.
Liverpool Building and monitoring
Manchester install & config &
monitor Have to look after ~550 machines
Install Dhcp, Kickstart, YAIM, Yum, Cfengine
Monitor Nagios (ganglia), cfengine,
weathermap, raid cards monitoring, custom scripts to parse log files, OS tools.
Each machines has a profile for each tool Difficult to keep consistent changes Manpower reduced can't afford this bad
tracking
Manchester Integration with RT
• Use Nagios for monitoring nodes and services– Both external tests (eg ssh to port)– And internal tests (via node's nrpe
daemon)• Use RT (“Request Tracker”) for tickets
– Includes Asset Tracker which has a powerful as has a web interface and links to tickets
ManchesterIntegration with RT (2)
• Previously maintained lists of hosts and group membership in Nagios cfg files– Now make these from the AT MySQL DB
• Obvious advantages in monitoring services only where cfengine has installed them
• Automatic cross link between AT and nagios• Future extensions to other lists as dhcp,
cfengine, online and offline nodes
Sheffield: efficiency
2 clusters Jobs requiring better network bandwidth directed to WNs with better backbone
•Storage
• 90 TB (9 disk pools , SW RAID5 (without raid controllers))
• Absence of raid controllers increases site efficiency :– No common failures related to RAID controllers:
unavailable disk servers and data loss
• 2 TB disks seagate barracuda disks, fast and robust
• 5x16bay unit with 2 fs, 4x24 bay unit with 2 fs
• Cold spare unit on standby in each server
• Simple cluster structure makes it easy to support high efficiency and to upgrade it to new requirements of experiments
Sheffield:efficiency
• Monitoring (checks are on a regular basis several times a day)
– Ganglia: general check of the cluster health
– Regional nagios, warnings sent via email from regional nagios
– Logwatch/syslog check
– GRIDMAP
– All ATLAS monitoring tools :– ATLAS SAM test page
– AtTLAS (and LHCb) Site Status Board
– DDM dashboard
– PANDA monitor
– Detailed check of atlas performance (check the reason for a particular failure of production and analysis jobs)
Sheffield:efficiency
• Installation
– Use PXE boot
– Redhat kickstart install
– Using many cron jobs for monitoring
– Bash post-install (includes yaim)
• Cron jobs
– Monitor the temperature in cluster room (in case of temperature raise only some of the worker nodes shut down automatically)
– Generate a web page of queues and jobs for both grid and local
– Check and restart of vital services if they are down (bdii, srm)
– Generate a warning email in case of disk failure (in any server)