WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar...

27
WLCG Tier-2 site in Prague: a WLCG Tier-2 site in Prague: a little bit of history, current little bit of history, current status and future perspectives status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas Kouba, Milos Lokajicek, Jan Svec Prague 04 .09. 2014

Transcript of WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar...

Page 1: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

WLCG Tier-2 site in Prague: a little bit of WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectiveshistory, current status and future perspectives

Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas Kouba, Milos Lokajicek, Jan Svec

Prague 04 .09. 2014

Page 2: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

OutlineOutline Introducing the WLCG Tier-2 site in Prague

A couple of history flashbacks we celebrate the 10th anniversary

Current issues

Summary and Outlook

Page 3: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

HEP Computing in Prague: site praguelcg2 HEP Computing in Prague: site praguelcg2 (a.k.a. the farm (a.k.a. the farm GOLIGOLIAS)AS)

• A national computing center for processing data from various HEP experiments– Located in the Institute of Physics (FZU) in Prague – Basic infrastructure already in 2002, but

– OFFICIALLY STARTED IN 2004 10th ANNIVERSARY THIS YEAR

• Certified as a Tier2 center of LHC Computing Grid (praguelcg2) – Collaboration with several Grid projects.

• April 2008, WLCG MoU signed by Czech republic (ALICE+ATLAS).

• Excellent network connectivity: Multiple dedicated 1 – 10 Gb/s connections to collaborating institutions. Connected to LHCONE.

• Provides computing services for ATLAS + ALICE, D0, Solid state physics, Auger, Star ...

• Started in 2002 with:• 32 dual PIII 1.2GHz, 1 GB RAM, 18 GB SCSI HDD, 100 Mb/s Ethernet rack servers …. (29 of these

decommissioned in 2009)• Storage - disk array 1TB: HP server TC4100

Page 4: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

History: 2002 -> 2014History: 2002 -> 2014

Page 5: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Current numbersCurrent numbers

5

•1 batch system (torque + maui)•2 main WLCG VOs: ALICE, ATLAS

– FNAL's D0 (dzero) user group– Other VOs: Auger, Star

• ~ 4000 cores published in the Grid• ~ 3 PB on new disk servers (DPM, XRootD, NFS)•Regular yearly upscale of resources on the basis of various financial supports, mainly the academic grants. •The WLCG services include:

– Apel publisher, Argus Authorization service, BDII, several UIs, Alice VOBOX, Cream CEs, Storage Elements

The use of virtualization at the site is quite extensive.

•ALICE disk XRootD Storage Element ALICE::Prague::SE– ~ 1.113 PB of disk space in total– Redirector/client + 3 clients @ FZU, 5 clients @ NPI Rez a distributed storage cluster

Page 6: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Site UsageSite Usage

ATLAS and ALICE – continuous productionother projects – shorter campaigns

ALICEALICE

ATLASATLAS

Page 7: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Some history flashbacks Some history flashbacks (celebrating the 10(celebrating the 10thth anniversary) anniversary)

Page 8: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

88

ALICE PDC 2004 ALICE PDC 2004 resource statistics: resource statistics: 14 sites14 sites

ALICE 2014 resource statistics: 74 sites

Page 9: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

99

ALICE PDC ALICE PDC resources statistics resources statistics - 2005- 2005

25 sites in operation

Running jobs (8 November 2005)

Farm Min Avg Max

Sum 1160 1651 1771

CCIN2P3 134 210 231

CERN-L 268 286 304

CNAF 255 362 394

FZK 0 531 600

Houston 0 3 14

Münster 2 58 81

Prague 43 61 71

Sejong 2 2 2

Torino 33 41 43

Page 10: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

20062006• ALICE vobox set-up

– fixing problems with the vobox proxy (unwanted expirations)

– AliEn services set up– manually changing the RBs used by the JAs successful participation in the ALICE PDC'06: – Prague site delivered ~ 5% of total computing

resources (6 Tier1s, 30 Tier2s)

• Problems with the fair-share of the site local batch system (then PBSPro)

20072007• still problems with functioning of the ALICE vobox-

proxy • during the PDC'07 problems with job submission due

to malfunctions of the default RBs the failover submission configured

• Prague site delivered ~ 2.6% of total computing resources (significant increase of the number of Tier-2s)

• migration to gLite3.1 ALICE vobox on 64-bit SLC4 machine

• upgrade of the local CE serving ALICE to lcg-CE 3

• repeating problems with job submission through RB's in Oct. the site re-configured for the WMS submission

• migration to the Torque batch system on a part of the site: some WNs on 32bit in PBS and some on 64bit in Torque

• installation and tuning of the creamCE • hybrid state:

– ‘glite’ vobox and WNs, 32bit– ‘cream’ vobox submitting JAs directly to creamCE

Torque, 64bit– Dec: ALICE jobs submitted only to the creamCE

20082008 20092009

Page 11: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

20102010• creamCE 1.6 / gLite 3.2/sl5 64bit installed in Prague we were the first ALICE Tier-2 where cream1.6 was tested and put in

production• NGI_CZ set in operation

20112011• Start of IPv6 implementation• The site router got an IPv6 address• Routing set-up in special VLANs• ACLs directly implemented in the router• IPv6 address configuration: DHCPv6• Set-up of an IPv6 testbed

Page 12: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

20122012• Optimization of the ALICE XRootD storage cluster performance• an extensive tuning of the cluster motivated by a remarkably different

performance of the individual machines:– data was migrated from the machine to be tuned to free disk arrays at another machine of the cluster.– the migration procedure done so that the data was accessible all the time.– the empty machine re-configured.– number of disks in one array reduced.– set-up of disk failure monitoring.– raid controller cache carefully configured.– readahead option set to a multiple of (stripe_unit * stripe_width) of the underlying RAID array.– no partition table used to ensure proper alignment of the file systems: they were created with right

geometry options ("-d su=X, w=YYk“ mmkfs.xfs switches).– mounting performed with the noatime option.

Parameters of one of the optimized XRootD servers before and after tuning

20132013Almost all machines migrated to SL6CVMFS installed on all machinesConnected to LHCONE

Page 13: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

praguelcg2 contribution to WLCG Tier-2 praguelcg2 contribution to WLCG Tier-2 ATLAS+ALICE computing resourcesATLAS+ALICE computing resources

• http://accounting.egi.eu/• A long-term slide down due to problems with financial support

Page 14: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Current issues Current issues

Page 15: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Monitoring issuesMonitoring issues• A number of monitoring tools in use: NAGIOS, MUNIN, GANGLIA, MRTG,

NETFLOW, Gstat, MonALISA• Nagios:

– IPv6-only or IPv4-only servers connected to the central Dual stack node via Livestatus– Some checks can be run form IPv4-only or IPv6-only Nagios nodes

• MUNIN2:– current version 2.0.19– IPv6 in testing

• Ganglia:

– problems if the proper gai.conf is not present– gmetad doesn’t bind to IPv6 address on aggregators

• NetFlow:– plan to switch from v5 to v9 to use nfdump + nfsen

• Some new sensors are needed to fully deploy IPv6, some additional work necessary

• MonALISA REPOSITORY:– A simple test version installed, plans for future development

Page 16: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Network monitoring – weathermap

LHCONE link is heavily utilized(capacity 10 Gbps)

Nagios for alerts

Page 17: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Network architecture at FZUNetwork architecture at FZU

Page 18: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.
Page 19: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Outgoing IPv4 local traffic from DPM servers

Outgoing IPv6 local traffic from DPM servers

IPv6 deploymentIPv6 deployment

•Currently on Dual-stack: dpm headnode, all production disk nodes, all but 2 subclusters of WNs•Over IPv6 goes: dpns between disknodes and headnode, srm between WNs and headnode, actual data transfer via gridftp•IPv6 enabled on the ALICE vobox

Page 20: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Site services managementSite services management

• Since 2008 services management done with CFEngine version 2– cfagent Nagios sensor developed: a python script checking CFEngine logs for

fresh records (error signals if the log is too old)

• CFEngine v2 used for production

• Puppet used for IPv6 testbed

• Migration to the overall Puppet management in progress

Page 21: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

NGI_CZNGI_CZ

• Since 2010, NGI_CZ is recognized and in operation: https://wiki.metacentrum.cz/metawiki/NGI_CZ#Farma_golias_aka_praguelcg2

• all the events and relevant information about praguelcg2• 2 sites involved: praguelcg2 and prague_cesnet_lcg2• significant part of the services provided by the praguelcg2 team

Services provided by NGI_CZ for the EGI infrastructure: Accounting (APEL, DGAS, Cesga portal) Resources database (GOC DB) Operations - https://operations-portal.egi.eu/

ROD (Regional Operator on Duty) Top level BDII

VOMS servers Meta VO User support (GGUS/RT) - https://rt4.cesnet.cz/rt/

Middleware versions: UMD 3.0.0, EMI 3.0

Page 22: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Use of external resourcesUse of external resources

• Not much really to choose from• Longer term usage of the cluster ‘skurut’ in Prague: site

prague_cesnet_lcg2, courtesy of CESNET association– a long-time established cooperation

• NGI_CZ provided a single opportunity to use ~ 35 TB disk storage in Pilsen– for testing purposes mostly– dCache manager used– Evaluating the effect of switching/tuning TTreeCache , dCap RA– Not much of help as an extension of home resources

Page 23: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Summary and OutlookSummary and Outlook Prague Tier-2 site was performing as a distinguished member of the WLCG

collaboration for 10 years nowA stable upscale of resourcesHigh-level accessibility, reliable delivery of services, fast response to problems

Into the upcoming years, we will do our best to keep up the reliability and performance level of the services

Crucial is the high-capacity, state-of-the-art network infrastructure provided by CESNET

However, the future LHC runs will require a huge upscale of resources which will be impossible for us to achieve with the expected flat budget

As everybody else these days, we are in a search for external resources: got some help from CESNET but need more. As widely recommended, we very likely will try to collaborate with non-HEP scientific projects to get access to additional resources in the future

Page 24: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

A couple of current plots A couple of current plots

Page 25: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

RUNNING ALICE JOBS IN PRAGUE in 2013/2014:Average = 996, maximum = 2227Total number of processed jobs: ~ 5 millions

GRID for ALICE in Prague – Monitoring jobs (MonALISA)GRID for ALICE in Prague – Monitoring jobs (MonALISA)

Page 26: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

2626

ALICE Disk Storage Elements – 62 endpoints, ~ 34 PBALICE Disk Storage Elements – 62 endpoints, ~ 34 PBPrague scores with the largest Tier-2 storagePrague scores with the largest Tier-2 storage

Page 27: WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

NETWORK TRAFFIC ON PRAGUE ALICE STORAGE CLUSTER in 2013/2014:

(Total disk space capacity 1.113 PB)

Max total traffic IN/write: 195 MB/sMax total traffic OUT/read: 1.05 GB/s

Total data OUT/read : 5.322 PB

GRID for ALICE in Prague – Monitoring storage (MonALISA)GRID for ALICE in Prague – Monitoring storage (MonALISA)