Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.
-
Upload
rachel-sheridan -
Category
Documents
-
view
214 -
download
0
Transcript of Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.
Northgrid Status
Alessandra Forti
Gridpp22 UCL
2 April 2009
Outline
• Resilience
• Hardware resilience
• Software changes resilience
• Manpower resilience• Communication• Site resilience status• General status
• Conclusions
Resilience
• Definition:1. The power or ability to return to the original form, position,
etc., after being bent, compressed, or stretched; elasticity.2. Ability to recover readily from illness, depression,
adversity, or the like; buoyancy.
• Translation:– Hardware resilience: Redundancy and capacity.– Manpower resilience: Continuity– Software resilience: Simplicity and easiness of
maintenance – Communication resilience: Effectiveness
Hardware resilience
• The system has to be redundant and has capacity enough to take the load.
• There are many levels of redundancy and capacity with increasing cost– Single machine components: disks, memory, CPUs– Full Redundancy: replication of services in the same room– Full redundancy paranoid: replication of services in different
places
• Clearly there is a tradeoff on how important is a service and how much money a site has to do the replication
Manpower resilience
• The man power has to insure continuity of service. This continuity is lost when people change.– It takes many months to train a new system administrator– It takes even longer in the grid environment where there
are no well defined guidelines, the documentation is dispersed and most of the knowledge goes from mouth to mouth
• Protocols and procedure for almost every action should be written to ensure continuity. – How to shut down a service for maintenance– What to do in case of security breach – Who to call is the main link to JANET goes down– What to do to update the software– What to do to reinsert a node in the batch system after a
memory replacement– ......
Software resilience
• Simplicity and easiness of maintenance are a key component to at least two things:– Service recovery in case disaster strikes– Less steep learning curve for new people
• The grid software is neither simple nor easy to maintain. It is complicated, ill-documented and changes continuously at the least.– Dcache is a flagship example of this and this is why it is
being abandoned by many sites.– But there is also a problem with continuous changes in
the software itself: lcg-CE, glite-CE,cream-CE, 4 or 5 storage sysems that are almost incompatible with each other, RB or WMS or experiments pilot frameworks, SRM yes, no SRM is dead............................................................................................................................................
Communication
• Communication has to be effective. If one mean of communication is not effective it should be replaced with one more effective– I was always missing SA1 ACL requests for the SVN
repository I redirected them to the manchester helpdesk. Now I respond within 2 hours during working hours
– System admins in Manchester weren't listening to each other during meetings now there is a rule to write EVERYTHING in the tickets.
– Atlas putting offline sites was a problem because the action was written in the atlas shifter elogs. Now they'll write it in the ticket so the site is made aware immediately of what is happening.
Lancaster
• Twin CEs• New kit has dual PSU• All systems in cfengine• Daily back up of databases• Current machine room has new redundant air con• Temperature sensors with nagios alarms have been
installed• 2nd machine room with modern chilled racks
– Available in july
• Only on fibre uplink to JANET
Liverpool
Strong points:• Reviewed and fixed single points of failure 2 years ago.• High spec servers with RAID1 and dual PSU.• UPS on critical servers, RAIDS and switches.• Distributed software servers with high level of
redundancy.• Active rack monitoring Nagios, Ganglia and custom
scripts.• RAID6 on SE data servers.• WAN connection has redundancy and automatic failover
recovery.• Spares for long lead time items.• Capability of maintaining our own hardware.
Liverpool (cont.)
Weak points:
• BDII and MON nodes are old hardware.• Single CE is single point of failure.• Only 0.75 FTE over 3 years dedicated to grid admin.• Air-con is ageing and in need of constant maintenance• University has agreed to install new water-cooled racks for
future new hardware.
Manchester
• Machine room: 2 generators + 3 UPS + 3 air cond unit– Uni staff dedicated to the maintenance
• Two independent clusters (2CEs, 2x2 SEs, 2 SW servers)• All main services have raid1 and memory and disks have
also been upgraded• They are in the same rack, attached to different PDUs• Services can be restarted from remote• All services and worker nodes are installed and maintained
with kickstart+cfengine which allows to reinstall the system within an hour – Anything that cannot go in cfengine goes in YAIM
pre/local/post in an effort to eliminate any forgettable manual steps
• All services are monitored• Backup system of all databases is in place
Manchester (cont)
• We lack protocols and procedures for dealing in the same way when a situation occurs– Started to write from things as simple as switching off
machines for maintenance• Disaster recovery happening only when a disaster happens• Irregular maintenance periods brought to clashes with
generators routine tests• RT system used for comunication with users but also to log
everything that is done in the T2– Bad comunication between sys admins has been a
major problem
Sheffield
The main weak point for Sheffield is the limited physical access to the cluster. We have it 9-17 weekdays only.
• We use quite expensive SCSI disk for exp-software, as it's expensive we do not have a spare disk in the case of failure. So we need some time to order it plus to write all experimental software back
• CE and the Mon Box have only one power supply and only one disk each.
• In future perhaps RAID1 system with 2 PSUs for CE and the Mon box. It would be good to have UPS.
• DPM head node already has 2 PSUs and RAID5 system with extra disk.
• We have similar WN's, CE and MonBox, so can find spare parts. We managed to have quite stable reliability
General Status (1)
17%4.525.6182.5DPMyesyesSL4Glite3.1Sheffield
15%39.2142/104/
202160
dcache/DPM/xrootdyesyesSL4Glite3.1
Manchester
10%13.7130559Dcache -> DPMyes
yesSL4Glite3.1
Liverpool
19%39.62001040DPMyesyesSL4Glite3.1Lancaster
Storage usage %
Used Storage(TB)
Storage (TB)
CPU (kSI2K)
SRM brand
Space Tokens
SRM2.2OS
MiddlewareSite
General Status (2)
General Status (3)
General Status (4)
Conclusions
• As it was written on the building sites of the Milan 3rd underground line: We are working for you!