Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08...
-
Upload
blaise-garrett -
Category
Documents
-
view
215 -
download
2
Transcript of Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08...
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 1
PIC T1 and the Spain/Portuguese region
in CCRC'08/phase-2
Josep Flix (PIC-CIEMAT)
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 2
CMS :: Plan 2008
http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=23563
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 3
CMS :: CCRC’08 Data Transfers
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 4
CMS :: CCRC’08 Data Transfers
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 5
CMS :: CCRC’08 CERNT1s
FZK (Prod)
FNAL (furloughs?)
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 6
PIC Tier1::Data Transfers 1• T0 PIC:
~90%
+
• All days CERNPIC above CRC’08 target + extra-target achieved.
• Rate ~70 MB/s (4 T1 sites at this level) + Good transfer quality! ~90%.
Prod+Debug
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 7
PIC Tier1::Data Transfers 2• PIC T1s: AOD replication exercise (20th23th May)
Aggressive skimming activities at PIC (see next slides)
• Source Dataset replicated < 4 days(latency) + extra-target (rate) achieved.
• Latency degraded due to PIC LAN saturation by skimming activities.
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 8
PIC Tier1::Data Transfers 3• T1T2s Goal: exercise the full matrix of regional and non-regional transfers in Prod.
- Only links commissioned according to DDT criteria are used.- Regional and non-regional T2s enter in the same way - with different target rates though.
• Transfer metric:- Latency (target based): for the regional T2s - transfer 100% (95% acceptable) of the dataset in 24h (adjusted to different dataset sizes).- Participation: have 5 out of 8 T2s pass the exercise (regional T2 should be one of them).- Rate (2008 target based): 100% (green) 75% (acceptable) <75% (failed)
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 9
PIC Tier1::Data Transfers 4
PIC T2s(30 sites)
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 10
PIC Tier1::Data Transfers 5
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 11
PIC Tier1::Data Transfers 6
PIC non-OPN traffic May 2008
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 12
PIC Tier1::Data Transfers 7
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 13
PIC Tier1::Data Transfers 8
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 14
PIC Tier1::Data Transfers 9• PIC T2s: Second rotation cycle (28th May)
fake spike
- Transfers from PIC FNAL were blocking
PIC-STAR channel.
- FNAL used PIC FTS for these transfers, instead
of FNAL FTS server.
- Need FTS to balance endpoints on STAR channels.
- Transfers to FNAL were suspended and cancelled.
- We need to know at any time if sites are following
or not CMS transfer policies:
a) Proper FTS servers usage.
b) Improper use of srmcp commands.
c) This needs to be checked weekly and reported
in FacOps and to sites.
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 15
PIC Tier1::Data Transfers 10
5 MB/s/link
• Rate (2008 target based): 100% (green) 50% (acceptable) <50% (failed)- 3 days in a row
- Rates are aggregate per T2 region
- Based on mega-table average rates
• T2 T1s (regional):
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 16
PIC Tier1::Data Transfers 11• T2-->PIC uplinks:
• 2008 target 3.9 MB/s > 3 days in a row
• OK• Note: T2_PT_LIP_Lisbon not
transferring: storage issues
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 17
PIC Tier1::T1 Workflows 1
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 18
PIC Tier1::T1 Workflows 2
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 19
PIC Tier1::T1 Workflows 3
next
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 20
PIC Tier1::T1 Workflows 4
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 21
PIC Tier1::T1 Workflows 5
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 22
PIC Tier1::T1 Workflows 6
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 23
PIC Tier1::T1 Workflows 7
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 24
PIC Tier1::T1 Workflows 8
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 25
PIC Tier1::T1 Workflows 9
/store/mc/CSA08/JetET110/GEN-SIM-RAW/ vo-cms.t1d0csa08JetET110RAW/store/mc/CSA08/JetET110/GEN-SIM-RECO/ vo-cms.t1d0csa08JetET110RECO
/store/mc/CSA08/MuonBeamHalo/GEN-SIM-RAW/ vo-cms.t1d0csa08MuonBeamHaloRAW/store/mc/CSA08/MuonBeamHalo/GEN-SIM-RECO/ vo-cms.t1d0csa08MuonBeamHaloRECO
/store/mc/CSA08/TkBeamHalo/GEN-SIM-RAW/ vo-cms.t1d0csa08TkBeamHalo/store/mc/CSA08/TkBeamHalo/GEN-SIM-RECO/ vo-cms.t1d0csa08TkBeamHalo
+ other sub-directories…
• Pin/pre-stage custodial RAW data before launching processing jobs:
- dccp -P to pre-stage & srm-bring-online to PIN data.
- Reported bug in srm-get-metadata:
https://hypernews.cern.ch/HyperNews/CMS/get/sc4/950.html
• Tape families setup:
- At PIC we did not followed a fine-grained approach:
Only 5 file families
- No operational issues here.
- BUT: creation of file families needs an admin to do it (=latency).
- … and some creation of file families were announced too on the edge (=errors).
- We need to think how this can be automatized…
Small dataset
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 26
PIC Tier1::T1 Workflows 10
• Custodial and non-custodial data: T1_ES_PIC_MSS and T1_ES_PIC_Disk
- TFC hacks to handle non-custodial samples Can this work on the long run?
- Space-tokens is orthogonal to paths: we can store custodial and non-custodial
under the same path.
- These TFC hacks can be difficult to handle in the future (?).
- It is not error-free: one can subscribe/approve data before setup is ready…
- We have run into some ‘errors’ this challenge: from subscriptions, approved
subscriptions and non-custodial deletion of data on MSS, not Disk…
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 27
PIC Tier1::T1 Workflows 11
• Skimming jobs overloading PIC’s LAN:
Subject: [T1 contacts] Skimming and Reconstruction activities at T1s -- please, feedback *is* needed
https://hypernews.cern.ch/HyperNews/CMS/get/dataops/328.html ← 100 mails!
Traffic to WNs
CMS running jobs
DCACHE_READ_AHEAD: default buffer == 1 MB.Throughput at high as 40-50 MB/s/job.
DCACHE_READ_AHEAD: buffer fixed to 128k.Throughputs are now round 1-2 MB/s/job.
- Problem was afecting all dcache sites.
- PIC exports degraded as LAN saturated.
- CMPRD MAXPROCS reduced.
- Problem fixed in CMSSW_2_0_8.
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 28
PIC Tier1::T1 Workflows 12• Skimming jobs overloading PIC’s LAN (revisited):
- skimming activity saturating LAN we forced down slots for cmprd users.
- However, if different activities (skimming, reprocessing) are run with different
DNs, then we can limit slot access to specific user *without* penalizing other
production activities. That was not the case in CCRC’08.
- Also, this would help local monitoring of different activities.
• User Access to T1 resources:
- Begginning CCRC’08: 90% production and 10% users (share).
- However, share is not good: if farm is empty, can it be fully occupied by users.
- Now: users are limited to use only 5% of slots @ PIC move users to T2s.
- The copy of Datasets at T1s to local user disks via srmcp is in use:
https://hypernews.cern.ch/HyperNews/CMS/get/sc4/947.html
- We need to protect T1 centers upon indiscriminate usage of their resources
by non-production and un-organized activities (this we learnt in CCRC’08).
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 29
PIC Tier1::T1 Workflows 13• Throughput drops on disk servers:
- If active movers > 150 (for PIC servers, SUN fire x4500 with solaris10):
- Also observed in FZK (Jos van Wezel mail to dcache-user-forum).
- Maybe it's a hardware/OS/Java limit or application limit (PIC runs
dCache 1.8.0-12p2 and Java 1.5.0_13 in production dCache pools).
- Relevant for CMS as our jobs keep *lots* of long-lived active movers on pools.
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 30
PIC Tier1::SAM tests
Scheduled downtime
Reliability ~90-95%
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 31
T2_ES_CIEMAT in CCRC’08 1• Around 40k jobs: challenge pre-production, user analysis, CCRC’08 physics group analysis tests (CCRCPG):
• Success rate 90-95%: • Slot usage: all slots used 1st half-challenge
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 32
T2_ES_CIEMAT in CCRC’08 2• All links from T1s commissioned and used in production w/rather high quality:
• Transfer rates for users in good shape: 10 TBs of data downloaded to CIEMAT for CSA08 analysis activities (from 6 T1s):
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 33
T2_ES_IFCA in CCRC’08 1• CCRC’08 useful to stress IFCA and discover/solve bottlenecks and limits at the infrastructural level (both HW and GRID SW).
• Data transfers: metrics satisfied, but some StoRM failures under high stress and problems with GPFS configuration.
- Load from WNs to SE higher than the one generated by transfers. At the LAN level,
IFCA saturated their 1 Gbps and some work was needed to reconfigure their LAN netwok.
- A dataset was attempted to be transferred from IFCA to FNAL, while the same dataset
was present at other T1s and FNAL regional T2s. This generated a big overload on IFCA.
Why this routing? T1s should not overload T2s in this way, if there are routing alternatives.
• Analysis tests: adequate response. IFCA touched their harcoded limit at Torque/Maui, capable to handle 1024 jobs. Limit increased.
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 34
T2_ES_IFCA in CCRC’08 2
• Physicists from IFCA and Oviedo University participated in chaotic job submission to IFCA.
• CRAB not able to handle stageout to user area using SRMv2 and a port different than 8443. DPM uses 8446 and StoRM 8444. IFCA lately participated on stageout T2 tests due to this problem. Full history in:
https://hypernews.cern.ch/HyperNews/CMS/get/crabFeedback/1229.html
This was discussed in Commissioning and a CMSSW version including lcg-cp stageout was released. More sites affected (indeed those without SRMv2 clients on WNs)
• Up to 5k jobs/day from analysis + users. 30k jobs queued killed their CE. They needed to re-install it and upgrade it.
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 35
T2_PT_* in CCRC’08
• Participation of the Portuguese Tier-2 in the CCRC08 was limited in Lisbon and Coimbra sites by the lack of available disk space.
• The minimal data set that was used in the analysis exercise (~1TB) was by itself larger than the free space available for CMS Grid activities at Lisbon or Coimbra.
• Regarding data transfers, Coimbra passed successfully the T1-T2 tests with CERN and PIC. Lisbon with problems on the disk server machine. To be replaced soon.
• In the next weeks we will have a quantum jump in our storage (~200TB) and computing(~0.4MSI2K) capacity .
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 36
SWE region::Data Transfers
SWE sites as sources SWE sites as destinations
& PIC CIEMAT IFCA Coimbra Lisbon
• Overall: good transfer quality and rates on SWE region.
• Lisbon still with some problems on disk server.
• IFCA export at 40% quality: some StoRM instabilities and GPFS.
configuration problems. These issues are under control atm.
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 37
SWE Site Availability 1
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 38
SWE Site Availability 2
T2_CIEMATT1_PIC
T2_COIMBRA
T2_LISBON
T2_IFCA
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 39
CMS Summary Activities 1
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 40
CMS Summary Activities 2
Josep Flix (PIC-Ciemat) PIC T1 and the Spain/Portuguese region in CCRC'08/phase-2 PIC CCRC’08 internal meeting, 19 June 2008 41
CMS Summary Activities 3