Tier1 Andrew Sansum GRIDPP 10 June 2004. GRIDPP10 June 2004Tier1A2 Production Service for HEP...

download Tier1 Andrew Sansum GRIDPP 10 June 2004. GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP (2001-2004). –“ GridPP will enable testing.

If you can't read please download the document

Transcript of Tier1 Andrew Sansum GRIDPP 10 June 2004. GRIDPP10 June 2004Tier1A2 Production Service for HEP...

  • Slide 1

Tier1 Andrew Sansum GRIDPP 10 June 2004 Slide 2 GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP (2001-2004). GridPP will enable testing of a prototype Grid of significant scale, providing resources for the LHC experiments ALICE, ATLAS, CMS and LHCb, the US-based experiments BaBar, CDF and D0, and lattice theorists from UKQCD Tier1 provides access to large scale compute resources for experiments Tier A service for Babar physics analysis LHC data challanges Support wide range of prototype PP GRID software (eg Certificate Auth) Close involvement in European Datagrid Project EDG (many testbeds) GRIDPP2 (2004-2007) From Prototype to Production Close engagement in LCG project, preparing for LHC startup Continue to provide Tier A centre for Babar EGEE Resource and member of EGEE Testbed Ramp-up to a production quality GRID service Gradually move to only GRID access Slide 3 GRIDPP10 June 2004Tier1A3 Tier1 in GRIDPP2 (2004-2007) The Tier-1 Centre will provide GRIDPP2 with a large computing resource of a scale and quality that can be categorised as an LCG Regional Computing Centre January 2004 GRIDPP2 confirm RAL to host Tier1 Service GRIDPP2 to commence September 2004 Tier1 Hardware budget: 2.3M over 3 years Staff Increase from 12.1 to 13.5(+3 CCLRC) by September Slide 4 GRIDPP10 June 2004Tier1A4 So What Exactly is a Tier1 The Tier1 will differentiate itself from the Tier2s by: Providing data management at high QoS able to host Primary/Master copies of data Providing state-of-the-art network bandwidth Contributing to collaborative services/core infrastructure Providing high quality technical support Responding rapidly to service faults Being able to make long term service commitments Slide 5 GRIDPP10 June 2004Tier1A5 Tier1 Staffing Manage Project/Planning/Policy/Finance.. Disks Severs/Filesystems CPU Farms Farm systems Tapes Robot and Interfaces Core Critical systems.. Oracle/Mysql/AFS/Home/Monitoring.. 1FTE2FTE1.5FTE2.5FTE Operations Machine rooms/Tape ops/Interventions Network Site Infrastructure/Tier1 LAN Support Support Experiments and their services Deploy Tier1 and UK GRID interfaces Hardware Fix systems/Hardware support Slide 6 GRIDPP10 June 2004Tier1A6 Current Tier1 Hardware CPU 350 dual Processor Intel PIII and Xeon servers mainly rack mounts About 400KSI2K Disk Service mainly standard configuration Dual Processor Server Dual channel SCSI interconnect External IDE/SCSI RAID arrays (Accusys and Infortrend) ATA drives (mainly Maxtor) About 80TB disk Cheap and (fairly) cheerful Tape Service STK Powderhorn 9310 silo with 8 9940B drives Slide 7 GRIDPP10 June 2004Tier1A7 Network Firewall Site Router Production SubnetTest Subnet Superjanet Servers Workers Test network (eg MBNG) Server Servers WorkersProductionVLAN TestVLAN SiteRoutableNetwork Rest of Site Slide 8 GRIDPP10 June 2004Tier1A8 Network Firewall Site Router Tier1 Network Superjanet Servers Workers Test network (eg MBNG) Server Servers Workers TestVLAN ProductionVLAN Rest of Site Slide 9 GRIDPP10 June 2004Tier1A9 UKlight Connection to RAL in September Funded to end 2005 after which probably merges with SuperJanet 5 2.5Gb/s now 10Gb/s from 2006 Effectively dedicated lightpath to CERN Probably not for Tier1 production but suitable for LCG Data challenges etc, building experience for Superjanet upgrade. Slide 10 GRIDPP10 June 2004Tier1A10 New Hardware Arrives 7 th June CPU Capacity (500 KSI2K) 256 dual processor 2.8GHz Xeons 2/4GB Memory 120GB HDA Disk Capacity (140TB) Infortrend Eonstore SATA/SCSI RAID Arrays 16*250GB Western Digital SATA per array Two arrays per server Slide 11 GRIDPP10 June 2004Tier1A11 Planned Ramp up Slide 12 GRIDPP10 June 2004Tier1A12 Next Delivery Need in production by end of year Original schedule of December delivery seems late Will have to start very soon Less chance for testing / new technology Exact proportions not agreed, but 400 KSI2K (300-400 CPUs) 160TB disk 120TB tape?? Network infrastructure? Core servers (H/A??) Redhat? Long range plan needs reviewing also need long range experiment requirements Slide 13 GRIDPP10 June 2004Tier1A13 CPU Capacity Slide 14 GRIDPP10 June 2004Tier1A14 Tier1 Disk Capacity (TB) Slide 15 GRIDPP10 June 2004Tier1A15 Forthcoming Challanges Simplify service less duplication Improve storage management Deploy new Fabric Management Redhat Enteprise 3 upgrade Network upgrade/reconfigure???? Another procurement/install Meet challenge of LCG professionalism LCG Data Challenges Slide 16 GRIDPP10 June 2004Tier1A16 Clean up Spaghetti Diagram Simplify Interfaces: Less GRIDS More is not always better How to phase out Classic service.. Slide 17 GRIDPP10 June 2004Tier1A17 Storage: Plus and Minus ATA and SATA drives External RAID arrays SCSI interconnect Ext2 filesystem Linux O/S NFS/Xrootd/http/gridftp/bbftp/srb/. NO SAN No management layer NO HSM 2.5% failure per annum - OK Good architecture, choose well Surprisingly unreliable: change OK but need journal: XFS? Move to Enterprise 3 Must have SRM Need SAN (Fibre or iSCSI ) Need virtualisation/DCACHE.. ???? Slide 18 GRIDPP10 June 2004Tier1A18 Fabric Management Currently run: Kickstart cascadingconfig files SURE exception monitoring Automate automatic interventions Running out of steam with old systems Only 800 systems but many, many flavours Evaluating Quator no obvious alternatives probably deploy Less convinced by Lemon bit early running Nagios in parallel Slide 19 GRIDPP10 June 2004Tier1A19 Conclusions After several years of relative stability must start re-engineering many Tier1 components. Must start to rationalise support limited set of interfaces, operating systems, testbeds simplify so we can do less better LCG becoming a big driver Service commitments Increase resilience and availability Data challenges and move to steady state Major reality check in 2007!