Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.

Northgrid Status

Alessandra Forti

Gridpp22 UCL

2 April 2009

Outline

• Resilience

• Hardware resilience

• Software changes resilience

• Manpower resilience• Communication• Site resilience status• General status

• Conclusions

Resilience

• Definition:1. The power or ability to return to the original form, position,

etc., after being bent, compressed, or stretched; elasticity.2. Ability to recover readily from illness, depression,

adversity, or the like; buoyancy.

• Translation:– Hardware resilience: Redundancy and capacity.– Manpower resilience: Continuity– Software resilience: Simplicity and easiness of

maintenance – Communication resilience: Effectiveness

Hardware resilience

• The system has to be redundant and has capacity enough to take the load.

• There are many levels of redundancy and capacity with increasing cost– Single machine components: disks, memory, CPUs– Full Redundancy: replication of services in the same room– Full redundancy paranoid: replication of services in different

places

• Clearly there is a tradeoff on how important is a service and how much money a site has to do the replication

Manpower resilience

• The man power has to insure continuity of service. This continuity is lost when people change.– It takes many months to train a new system administrator– It takes even longer in the grid environment where there

are no well defined guidelines, the documentation is dispersed and most of the knowledge goes from mouth to mouth

• Protocols and procedure for almost every action should be written to ensure continuity. – How to shut down a service for maintenance– What to do in case of security breach – Who to call is the main link to JANET goes down– What to do to update the software– What to do to reinsert a node in the batch system after a

memory replacement– ......

Software resilience

• Simplicity and easiness of maintenance are a key component to at least two things:– Service recovery in case disaster strikes– Less steep learning curve for new people

• The grid software is neither simple nor easy to maintain. It is complicated, ill-documented and changes continuously at the least.– Dcache is a flagship example of this and this is why it is

being abandoned by many sites.– But there is also a problem with continuous changes in

the software itself: lcg-CE, glite-CE,cream-CE, 4 or 5 storage sysems that are almost incompatible with each other, RB or WMS or experiments pilot frameworks, SRM yes, no SRM is dead............................................................................................................................................

Communication

• Communication has to be effective. If one mean of communication is not effective it should be replaced with one more effective– I was always missing SA1 ACL requests for the SVN

repository I redirected them to the manchester helpdesk. Now I respond within 2 hours during working hours

– System admins in Manchester weren't listening to each other during meetings now there is a rule to write EVERYTHING in the tickets.

– Atlas putting offline sites was a problem because the action was written in the atlas shifter elogs. Now they'll write it in the ticket so the site is made aware immediately of what is happening.

Lancaster

• Twin CEs• New kit has dual PSU• All systems in cfengine• Daily back up of databases• Current machine room has new redundant air con• Temperature sensors with nagios alarms have been

installed• 2nd machine room with modern chilled racks

– Available in july

• Only on fibre uplink to JANET

Liverpool

Strong points:• Reviewed and fixed single points of failure 2 years ago.• High spec servers with RAID1 and dual PSU.• UPS on critical servers, RAIDS and switches.• Distributed software servers with high level of

redundancy.• Active rack monitoring Nagios, Ganglia and custom

scripts.• RAID6 on SE data servers.• WAN connection has redundancy and automatic failover

recovery.• Spares for long lead time items.• Capability of maintaining our own hardware.

Liverpool (cont.)

Weak points:

• BDII and MON nodes are old hardware.• Single CE is single point of failure.• Only 0.75 FTE over 3 years dedicated to grid admin.• Air-con is ageing and in need of constant maintenance• University has agreed to install new water-cooled racks for

future new hardware.

Manchester

• Machine room: 2 generators + 3 UPS + 3 air cond unit– Uni staff dedicated to the maintenance

• Two independent clusters (2CEs, 2x2 SEs, 2 SW servers)• All main services have raid1 and memory and disks have

also been upgraded• They are in the same rack, attached to different PDUs• Services can be restarted from remote• All services and worker nodes are installed and maintained

with kickstart+cfengine which allows to reinstall the system within an hour – Anything that cannot go in cfengine goes in YAIM

pre/local/post in an effort to eliminate any forgettable manual steps

• All services are monitored• Backup system of all databases is in place

Manchester (cont)

• We lack protocols and procedures for dealing in the same way when a situation occurs– Started to write from things as simple as switching off

machines for maintenance• Disaster recovery happening only when a disaster happens• Irregular maintenance periods brought to clashes with

generators routine tests• RT system used for comunication with users but also to log

everything that is done in the T2– Bad comunication between sys admins has been a

major problem

Sheffield

The main weak point for Sheffield is the limited physical access to the cluster. We have it 9-17 weekdays only.

• We use quite expensive SCSI disk for exp-software, as it's expensive we do not have a spare disk in the case of failure. So we need some time to order it plus to write all experimental software back

• CE and the Mon Box have only one power supply and only one disk each.

• In future perhaps RAID1 system with 2 PSUs for CE and the Mon box. It would be good to have UPS.

• DPM head node already has 2 PSUs and RAID5 system with extra disk.

• We have similar WN's, CE and MonBox, so can find spare parts. We managed to have quite stable reliability

General Status (1)

17%4.525.6182.5DPMyesyesSL4Glite3.1Sheffield

15%39.2142/104/

202160

dcache/DPM/xrootdyesyesSL4Glite3.1

Manchester

10%13.7130559Dcache -> DPMyes

yesSL4Glite3.1

Liverpool

19%39.62001040DPMyesyesSL4Glite3.1Lancaster

Storage usage %

Used Storage(TB)

Storage (TB)

CPU (kSI2K)

SRM brand

Space Tokens

SRM2.2OS

MiddlewareSite

General Status (2)

General Status (3)

General Status (4)

Conclusions

• As it was written on the building sites of the Milan 3rd underground line: We are working for you!

Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.

Documents

Transcript of Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.