H->bb Weekly Meeting Ricardo Gonçalo (RHUL) HSG5 H->bb Weekly Meeting, 15 March 2011.
Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.
-
Upload
leah-arnold -
Category
Documents
-
view
220 -
download
1
Transcript of Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.
Northgrid Status
Alessandra FortiGridpp24 RHUL15 April 2010
Outline
• Apel pies• Lancaster status• Liverpool status• Manchester status• Sheffield• Conclusions
Apel pie (1)
Apel pie (2)
Apel pie (3)
Lancaster
• All WN moved to tarball• Moving all nodes to SL5 solved “sub-cluster”
problems.• Deployed and decommissioned a test SCAS.
– Will install glexec when user demand it
• In the middle of deploying CREAM CE• Finished tendering for the HEC facility
– Will give us access to 2500 cores– Extra 280 TB of storage– Shared Facility has Roger Jones as director so
we have a strong voice for GridPP interests
Lancaster
• Older storage nodes are being re-tasked• Tarball WN are working well but YAIM is
suboptimal to configure them• Maui continues to be weird for us
– Jobs blocking other jobs– Confused by multiple queues– Jobs don't use their reservations when they are
blocked
• Problems trying to use the same NFS server for experiment software and tarballs.– Now they have been split
Liverpool
• What we did (we were supposed to do)– Major hardware procurement
• 48TB unit with 4Gbit bonded link• 7X4X8 units = 224 cores, 3GB mem, 2x1TB
disk
– Scrapped some 32bit nodes– CREAM test CE running
• Other things we did– General guide to capacity publishing– Horizontal job allocation– Improved use of Vms– Grid use of slack local HEP nodes
Liverpool
• Things in progress– Put CREAM in GOCDB (ready)– Scrap all 32 bit nodes (gradually)– Production runs of central computer cluster
(other dept involved)
• Problems– Obsolete equipment– WMS/ICE fault at RAL
• What's next– Install/deploy newly procured storage and CPU
hardware– Achieve production runs of central computing
cluster
Manchester
• Since last time– Upgraded WN to SL5– Eliminated all dcache setup from the nodes– Raid0 on internal disks– Increased scratch area– Unified two DPM instances– 106 TB/84 dedicated to atlas– Upgraded to 1.7.2– Changed network configuration of data servers– Installed squid cache– Installed Cream CE (still in test phase)– Last HC test in March 99% efficiency
Manchester
• Major UK site in atlas production 2 or 3 after RAL and Glasgow
• Last HC in March had 99% efficiency• 80 TB almost empty
– Not many jobs– But from the stats of the past few days also
real users seem also fine.
96%
Manchester
• Tender– European Tender submitted 15/9/2009– Vendors replies should be in 16/04/2010 (in
two days)– Additional GridPP3 money can be added
• Included a clause for increased budget
– Minimum requirements 4400 HEPSPEC/240TB• Can be exceeded• Buying only nodes
– Talking to Uni for Green funding to replace what we can't replace• Not easy
Sheffield
• Storage Upgrade– Storage moved to physics: 24/7 access– All nodes running SL5, DPM 1.7.3– 4x25TB disk pools, 2TB disks, RAID5, 4
cores– Memory will be upgaded to 8GB on all nodes– 95% reserved for atlas– Xfs crashed, problem solved with additional
kernel module
– Sw server 1TB (raid1)– Squid server
Sheffield
• Worker Nodes– 200 old 2.4GHz, 2GB, SL5– 72 TB of local disk per 2 cores– Lcg-CE and MONBOX on SL4– Additional 32 amp ring has been
added – Fiber link between CICS and physics
• Availability– 97-98% since January 2008– 94.5% efficiency in atlas
Sheffield Plans
• Additional storage– 20TB → bring total 120TB for atlas
• Cluster integration– Local HEP and UKI-NORTHGRID-SHEF-HEP
will have joint Wns– 128 CPU + 72 new nodes ???– Torque server from local cluster and lcg-CE
from grid cluster– Need 2 days DT waiting for atlas approval– CREAM CE installed waiting to complete
cluster integration