Prague Tier-2 operationsgrid2012.jinr.ru/docs/Kouba_Prague_T2_operations.pdf · 2012. 7. 16. ·...

20
Prague Tier-2 operations Tomáš Kouba Miloš Lokajíček GRID 2012, Dubna 16.7.2012

Transcript of Prague Tier-2 operationsgrid2012.jinr.ru/docs/Kouba_Prague_T2_operations.pdf · 2012. 7. 16. ·...

  • Prague Tier-2 operations

    ● Tomáš KoubaMiloš Lokajíček

    ● GRID 2012, Dubna● 16.7.2012

  • Outline

    ● Who we are, our users● New HW● Services● Internal network● External connectivity● IPv6 testbed

  • ● Who we are– Regional Computing Centre for Particle Physics,

    Insitute of Physics Academy of Sciences of the Czech Republic

    – basic research in particle physics, solid state physics and optics

    ● Our users– scientists from our and other institutes of the

    Academy– Charles University, Czech Technical University– WLCG (ATLAS, ALICE), EGI (AUGER, CTA), D0 grid

    Who we are, Our users

  • WLCG grid structure

    Backup Tier1:TaipeiBNL(FNAL)

    MFFTIER3

    FJFITIER3

    ÚJFTIER3

    CERNTIER0/1

    KITTIER1 Prague

    TIER2

    Other Tier2 over Internet

    Disk space and computing capacity

    Next year goal:• support for Tier3

    centers• user support

  • Capacities over time

    HEPSPEC2006 % TB disk %2009 10 340 1862010 19 064 100 427 1002011 23 484 100 1 714 100

    D0 90331 40 35 2ATLAS 6 796 29 1316(16 MFF) 77

    ALICE 7 357 31 363(60 Řež) 21

    2012 29 192 100 2521 100D0 9 980 34 35 1

    ATLAS 11 600 40 1880 (16 MFF) 74

    ALICE 7 612 26 606(100 Řež) 24

  • New HW in 2012

    • Worker nodes:– 23 nodes SGI Rackable C1001-G13– 2x (Opteron 6274 16 cores) 64 GB RAM, 2x SAS 300 GB– 374 W (full load), more than 5000 HEPSpec in total– delivered in water-cooled rack

    • Disk servers– 4 Supermicro nodes (4 servers + 3 JBODs)– 837TB in total (400TB still delayed because of floods)

    • Infrastracture servers– 2x DL360 G7 (HyperV server, NFS server)

    • UPS PowerWare 9390 (aka Eaton) – 2x100 KW, energy saving mode (offline => 98% efficiency)

  • good sealing crucial

    diskservers on off (divider added)

    dis

    kser

    vers

    worker nodes

    rubus01

  • Services

    ● Batch system: Torque/MUI

    ● UMD services

    – 2x CreamCE

    – MONBox

    – SE DPM (1x head node, 15 disk nodes)

    ● VO specific

    – AUGER dashboard

    – squid (for cvmfs and frontier – ATLAS)

    – VOBOX (ALICE)

    – 2x SAM station (D0)

    ● All nodes installed automatically over network (PXE, kickstart, simple script to end installation)

    ● All further configuration performed by CFengine (version 2)

    – We are evaluating puppet● New services in 2012:

    – CVMFS (problem with full disks, direct access to CERN stratum 1)

    – UMD worker nodes

    – perfsonar

  • Monitoring

    ● Nagios – Health of hradware, systems, SW, syslog

    monitor, SNMP traps– Important errors by e-mail and SMS, rest by

    consolidated mails 3 times per day– 7000 services on 466 hosts– WLCG data transfers, job execution– Multisite –alternative user interface, massive

    opreations with group of nodes

  • Multisite Nagios UI

  • Netflow – network monitoring

    ● Flowtracker, Flowgrapher● Useful for troubleshooting problems in the past

    – e.g. reason of poor Alice efficiency at our site:

  • Internal network

    ● CESNET upgraded our main CISCO router– 6506 -> 6509– supervisor SUP720 -> SUP2T– new 8x 10G X2 card– planned upgrade of power supplies 2x3kW -> 2x6 kW– (2 cards 48x1 Gbps, 1 card 4x10 Gbps, FW service module)

    ● FWSM upgraded to support IPv6● MTU increased to 9000 during spring

    – experienced problems with ATLAS data transfers– fragmentation ICMP messages were suppressed– fixed on the main router

  • Central router (Cisco 6509)

  • External connectivity

    ● Exclusive: 1 Gbps (to FZK) + 10 Gbps (CESNET)● Shared: 10 Gbps (PASNET – GEANT)

    FZU -> FZK FZK -> FZU PASNET link

    • Not enough for ATLAS T2D limit (5 MB/s to/from T1s)• Perfsonar installed:

  • External connectivity

  • LHCONE - LHC Open Network Environment

    ● New concept to connect T2 to other T1s and T2s● Tier1 (11), Tier2 (130), Tier3 allover the world● Initially hierarchical model: T2 communicates to one T1● T1s interconnected with private redundant optical LHCOPN● Change from hierarchical to flat model

    T1

    T2

    T2

    T1

    T2

    T2

    T1

    T1

  • LHCONE cont.

    ● LHCONE complementary to well working LHCOPN● LHCONE only for LHC data● Realization via L3 VPN using VRF● Under construction

    – Esnet, Internet2, Geant+NREN, Nordunet, USLHCnet, Surfnet, ASGC, CERN

    ● Evaluation and new improvements in 2013● Our implementation and HW requirements are being discussed

    with CESNET

  • IPv6 testing

    ● We participate in Hepix IPv6 testbed (we focus on IPv6-only setup)

    ● HW status (so far tested)

    – switches have no problem with IPv6 (only 2 of them can be managed over IPv6)

    – firewall upgrade was needed

    – no management interfaces of our servers support IPv6

    – no facility monitored by SNMP supports IPv6 (air condition, thermometers, UPS, water cooling unit)

    – none of the disk arrays management interfaces support IPv6

    ● DNS, DHCPv6 running fine

    ● NTP server runs fine (lack of stratum 1 NTP servers with IPv6 connectivity)

    ● Many problems with automatic installation (SL5 is simply not ready for IPv6)

  • IPv6 testing cont.

    ● Running middleware needs regular CRL updates– we developed a tool to test CRLs availability over IPv6

    ● IPv6 testing project was partially supported by CESNET, project number 416R1/2011.

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20