WLCG past, present, future: the CMS perspective · P. Kreuzer RWTH Reducing Boundaries • One of...
Transcript of WLCG past, present, future: the CMS perspective · P. Kreuzer RWTH Reducing Boundaries • One of...
WLCG past, present, future: the CMS perspective
Peter Kreuzer,CMS Computing Resource Manager
FAPESP Workshop, Sao Paulo, Aug 28, 2013
P. KreuzerRWTH
Outline• The Computing Model so far• The Resource Challenge 2015• Short term evolution
–Multicore scheduling, multithreaded processing–CPU Technology
• Long term evolution–Data archiving vs disk storage–Reorganization of the Tiered functionality–Wide Area Data Access–Tier-1 Storage Cloud–Networking–Resource Provisioning
2
P. KreuzerRWTH
Computing Model so far• Computing models are
based roughly on the MONARC model –Developed more than a
decade ago–Foresaw Tiered Computing
Facilities to meet the needs of the LHC Experiments
• Assumed poor networking at the very origin
• Site Readiness was an issue
• Hierarchy of functionality and capability
Tier-0
Tier-1
Tier-2 Tier-2 Tier-2
Tier-1 Tier-1
3
P. KreuzerRWTH
The CMS example
4
CAF 450MB/s
(300Hz)
30-300MB/s (ag. 800MB/s)
~150k jobs/day
~50k jobs/day
50-500MB/s
10-20MB/s
CMS detector
WLCG Computing
Grid Infrastructure
TIER‐3TIER‐3
TIER‐3TIER‐3
100MB/s
Prompt ProcessingCalibration
Archival Storage
Organized Processing
StorageData Serving
Organized Event
SimulationChaotic Analysis
• 1 Tier-0, 7 Tier-1 and 50 Tier-2 centers• So far the resource Needs were met via a “flat” budget planing
~100k jobs/day
~150k jobs/day
400Hz prompt
+600Hz parked
+ Event Simulation
P. KreuzerRWTH
Resource Distribution vs Tier
5
Tier-0
Tier-1
Tier-2
• The contribution of CERN in 2012 was: - CPU : 21%- Disk : 13%- Tape : 33%
• The typical CPU fraction of main workflows is: - 20% Reco/Digi-Reco, 40% Analysis, 40% Simulation
0
175
350
525
700
2009 2010 2011 2012 2013
CMS CPU Pledge vs Tier [kHS06]
0
15
30
45
60
2009 2010 2011 2012 2013
CMS Disk Pledge vs Tier [PB]
= LS1
P. KreuzerRWTH
Tier-0 Performance 2012• Large Dedicated
Tier-0 CPU farm ~4k job slots + public queues at CERN (max. 7.5k slots)
6
• Large CAF “EOS” Disk : 6PB useable space• Dedicated
LHCOPN Network between CERN and Tier-1, easily handling the needs (up to 800 MB/s)
P. KreuzerRWTH
Tier-1 Performance 2012• Tier-1 centers have performed very well, in particular
7
• Since 2011, the CPU utilization was beyond 100% also thanks to production workflows
Tier-1 Wall Clock Hours in 2012
P. KreuzerRWTH
Tier-2 Performance 2012• Tier-2 centers have been used beyond pledge
8
• Since the start of the LHC run, the Analysis contribution to Tier-2 processing kept increasing, to reach 70% in 2012
Tier-2 Wall Clock Hours in 2012
Jobs Slots Used at Tier-2 per Week Analysis Jobs at Tier-2 per Week
P. KreuzerRWTH
CMS Tier-2 Model• Main CMS Workflows
–User Analysis (70%)–Central MC production (30%)
• 49 single Tier-2 centers–grouped in 28 Federations or
Countries• CMS Tier-2 Data Model
–A typical CMS Tier-2 is allocating disk space for CMS production and CMS PH groups
–This space is centrally managed–The Tier-2 is “rewarded” with
ESP or authorship credit for the amount of allocated disk space
9
x
xx
x
xxx
x
x
xx
x
xxx
xx
P. KreuzerRWTH
Site Readiness Monitoring• CMS is monitoring sites via a number of
Availability Metrics and Workflow Success Rates, regularly evaluated and published
• Results into an overall Readiness metric that is monitored and used to rank sites
10
P. KreuzerRWTH
SPRACE Contribution to CMS (I)• Site Readiness performance:
–Since January 2013, the average SPRACE Readiness is 98.5% is in the top-10 of all CMS Tier-2 centers !
• This is remarkable !• Thank you to all SPRACE
admins for their contribution to CMS !
11
Tier-2 Readiness Ranking Jan 1 - Aug 15, 2013
P. KreuzerRWTH
SPRACE Contribution to CMS (II)• The resource fraction of SPRACE compared to
all CMS Tier-2‘s is at the 3% level• In terms of CPU Consumption at SPRACE, the
fraction of CMS analysis jobs in 2012 was of 78%
12
CPU Consumption at SPRACE in 2012
Analysis
Production
Analysis
Production
Hou
rs
P. KreuzerRWTH
SPRACE Contribution to CMS (III)• Fraction of Analysis Jobs at SPRACE over the
total in all CMS Tier-2‘s
13
0%
1%
2%
3%
4%
5%
Jul 6, 2009
Jan 4, 2010
Jul 5, 2010
Jan 2, 2011
Jun 20, 2011
Dec 12, 2011
Jun 11, 2012
Dec 10, 2012
Jun 10, 2013
PERCENTAGE OF ANALYSIS JOBS RUN AT SPRACE/ALL T2 SITES
Transfer Rates into SPRACE since 2007
Higgs Discovery period
P. KreuzerRWTH
SPRACE Contribution to CMS (IV)
14
• Data Transfer Rates into SPRACE
Transfer Rates into SPRACE since 2007
160 Mb/s
480 Mb/s-The figure on the right shows the average transfer rates over 1 week bins- Peak performances are around 6.5 Gb/s
P. KreuzerRWTH
SPRACE Contribution to CMS (V)• SPRACE also contributes to CMS Service Work
– Automated mapping of all CMS contact people (in SiteDB) to a central ticketing service
– Design of CMS Computing Shift Procedures– Participation to CMS Computing Shifts
15
RockefellerTTU
UERJCBPF
UFLUNESP
MississippiFIU
Univ. MarylandUniv. Colorado
0 0.50 1.00 1.50 2.00 2.50 3.00 3.50
Achieved / Expected [%] (9 Months, Americas)
–CMS Core Computing jobs can open doors to our younger people, also in industry !
P. KreuzerRWTH
Upgrades• The Large Hadron Collider is
currently being upgraded. Back in 2015, with Energy closer to the 14TeV design and the Luminosity higher than design: higher rate of interesting events !
• There is also two years of CMS detector upgrade
Computing is challenged by ambitious resource needs.
The evolution of the computing model may help to mitigate some of these needs
16
P. KreuzerRWTH
Increased CPU Needs in 2015• Event Pile-up is expected to grow with increased
instantaneous luminosity–Roughly a factor 2.5 increase in reconstruction time
with the best code we currently have• Trigger rates at similar thresholds than 2012 are
expected to result in 800 - 1200 kHz prompt reconstruction rate–This is just the core sample and results in a factor
2.5 increase • Machine will move from 50ns to 25ns bunch
spacing–Increases the reconstruction time by ~factor 2
• The combination of all these effects is a factor 1217
P. KreuzerRWTH
How to mitigate CPU Needs 2015 ?• Assume that the reconstr. speed of going to 25ns can
be solved– confirmed by early indications, so gaining back a factor 2
• We will move ~1/2 the prompt reconstruction to Tier-1– reduces CPU needs at Tier-0 level by a factor 2
• We assume that the operations model in 2015 will be more like 2012 than 2010– More organized processing at Tier-1 centers, that will
reconstruct MC and Prompt primarily, while only 1 big reprocessing pass over the full sample will occur at the end of the year (also using HLT). This is expected to provide another 50% improvement.
• These pretty aggressive plans for improvements bring a factor 6, but still a factor 2 increase needed for Tier-0 and Tier-1 processing
18
P. KreuzerRWTH
Resource Request Summary
19
P. KreuzerRWTH
The future• October’13 RRB : 2015 computing resource
request will be presented and discussed–for CMS, the 2015 request can be accommodated
under a flat budget in average, but does require people to plan 2 years together
• A Review of the Computing Model is in progress and will be presented in a document in common with the other LHC experiments, by the end of summer 2013.
• In this presentation we concentrate on the evolution of the CMS SW, the technology and the CMS Computing Model.
20
P. KreuzerRWTH
Complicated Environments• The LHC is running at a higher number of
interactions per crossing than design already–Experiments have high trigger rates due to a higher
fraction of interesting physics–Has required capacity and efficiency and constant
improvements.–We hope this continues, but improvements are hard
fought
21
CMS Preparation for 8TeV, ICHEP 2012, J-R Vlimant, CERN5
Software Development for 2012● Trying to cope with ever-improving LHC operation, more luminosity, more added pile-up
events, more complex events● Improvement in computing performances of the CMS software● Physics performance of event reconstruction unaffected under technical modifications;
Improved under algorithm development.✔ Phase 1 in 2011 to cope with increased luminosity✔ Phase 2 early 2012 to prepare for increasing luminosity and favor increased trigger rate.
● Main gain was achieve in tracking algorithm optimization● Algorithm optimization and redesign, compiler architecture, memory management
improvements, root version all played a constructive interference role
Event processing time remained constant with increased pile-up (<30s/evt)
x~9
÷~3
÷~3
CMS
P. KreuzerRWTH
Multicore scheduling• May be generalized before the 2015 LHC startup
– reduces the memory requirements on WN– stabilizes the scaling, e.g. number of files or number
of jobs, which is expected to drop by a factor 4 to 8• Need to be coordinated at the WLCG level, in
order to assure coherency among experiments and hence provide clear recommendations to sites– whole-node or partial node scheduling
• CMS Software readiness for Multicore Processing– CMSSW “forking-mode” ready since 1.5 years– CMSSW reengineered as multi-threaded framework,
to be deployed in Fall, as baseline for the future 22
P. KreuzerRWTH
CPU Processor Evolution• Major shift in the nature of processors has taken
place– the performances of single
sequential applications has roughly stalled due to limits on power consumption
• The new trend is to concentrate on improving the “performance per WATT”– ARM processors, Intel Xeon Phi, GPGPU’s, ...
• One of the challenges for sites will be to handle a potentially heterogeneous environment (e.g. x86 and ARM) in terms of scheduling/grid software
23
P. KreuzerRWTH
ARM Processors, and others...• The new emphasis on performance/power in the server market
and the ever growing mobile market provides ARM an opportunity to challenge Intel in that space– Several ARMv7-based (32 bit) server products are now available,
but the most interesting development are ARMv8 (64 bit), expected by 2014
• CMS has successfully tested Monte-Carlo production workflows based on ARMv7 (32 bit)– ATLAS and LHCb have also investigated ARM– This is promising. A combined effort is needed on future processor
technology, in order to guide WLCG sites– Need to avoid that GRID SW/configuration/scheduling issues
prevent us from using such new technologies • As for Xeon Phi and GPGPU’s, the times scale would be
beyond 2016• This assumes “heterogeneous” hardware solution can be
deployed at sites, with appropriate scheduling/grid software solutions
24
P. KreuzerRWTH
The Evolution of the Model
• Over the development the evolution of the WLCG Production grid has oscillated between structure and flexibility–In general in the first run it was very deterministic –Jobs generally went to data that was placed there
by operators and later by services
25
CERN
Tier2
Lab a
Uni a
Lab c
Uni n
Lab m
Lab b
Uni bUni y
Uni x
Physics
Department
!
"
#
Desktop
Germany
Tier 1
USA
FermiLab
UK
France
Italy
NL
USA
Brookhaven
……….
Open
Science
Grid
Tier-2
Tier-2
Tier-2
Uni x
Uni u
Uni z
Tier-2
ALICE RemoteAccess
PD2P/PopularityCMS Full
Mesh
P. KreuzerRWTH
Reducing Boundaries • One of the first changes we are seeing is a flattening of the
tiered structure of LHC computing– the functional differences of what each layer can do are
being reduced and we have a desire to use the system as a distributed system, and not a collection of sites
• One concrete action of this is to separate the archival functionality from the other site functions
• Status: Deployed at 2 Tier-1
26
P. KreuzerRWTH
Impact• Once you have split the archives there is no
reason for a strict one to one mapping of disk and archive at Tier-1s–Archives could be used to stage datasets to any
disk facility• The quantum of data we let the archive manage is dataset.
(TBs rather than files GBs)
• Need to ask the question how many archival facilities do you need?–More than 1, but probably less than 10
27
P. KreuzerRWTH
Changes how we think of tiers• Once you introduce the concept of an archival
services that is decoupled from the Tier-1–The functional difference between Tier-1 and TIer-2
is based more on availability and support than size of services• Difference between Tier-1 and Tier-2 from a functional
perspective is small –Model begins to look less Monarc-like
28
CERN
Tier2
Lab a
Uni a
Lab c
Uni n
Lab m
Lab b
Uni bUni y
Uni x
Physics
Department
!
"
#
Desktop
Germany
Tier 1
USA
FermiLab
UK
France
Italy
NL
USA
Brookhaven
……….
Open
Science
Grid
Tier-2
Tier-2
Tier-2
Uni x
Uni u
Uni z
Tier-2
P. KreuzerRWTH
Stretches into Other elements• After Long Shutdown 1, CMS will likely
reconstruct about half the data the first time at Tier-1s in close to real time–Very little unique about the functionality of the
Tier-0• Some prompt calibration work that uses Express data, but
even that could probably be exported
29
T0
T1 T1 T1
ATLAS Computing - Ueda I. - ICHEP 2012.07.07.
Storage FederationThe current system is based on the “Data Grid” concept
• Jobs go to data -- access via LAN• Replicate data for higher accessibility
‣ transfer the whole dataset
• Jobs to be re-assigned when the data there is not available
“Storage Federation” provides new access modes & redundancy
• Jobs access data on shared storage resources via WAN• Analysis jobs may not need all the information / all the files
‣ Transfer a part of the dataset‣ File and Event Level Caching
• System of Xrootd ‘redirectors’ is the possible working solution today‣
11
‣ Work in past year within US ATLAS Computing Facility to develop the concept and test performance
‣ Test being extended from regional to global
P. KreuzerRWTH
Wide Area Access• All experiments will have the capability of wide
area access to data by the restart–3 will use Xrootd based data federations.
• Development work in http based ongoing–Similar in concept to a content delivery network.
Generally referred to as a data federation
30
Sudhir'Malik' ICHEP'2012,'Melbourne,'Australia'–'4>11'July'2012' 12!
FederaHon!
• Remote access gives us data for one site • We need a federation to access all sites across all CMS sites
- Status: • has been used in production• 85% sites configured Fall Back to open request to Xrootd Federation• 75% of sites have joined xrootd Federation
P. KreuzerRWTH
The Tier-1 Storage “Cloud”• Once wide area access to
data exists the boundaries between sites are reduced. LHC is negotiating to make more use of the LHC-OPN– Negotiations with sites for
moving worker resources inside the OPN domain
– Calculate and test the amount of access that could be sustained with our share
31
• Allows the collection of Tier-1s to be used as a single processing facility and work on the same data.
• Instead of failing back to archive, we would fall over to xrootd if the data was accessible on another Tier-1
P. KreuzerRWTH
Flexibility• There is some skepticism because so much effort
was invested in getting the data close• Computing intensive tasks like reprocessing can
be sustained reading data from remote storage– Input size is small compared to the time of the
application – 50kB/s is enough to sustain the CMS application per
slot– Even thousands of cores can be reasonably fed with
Gb/s• Works for analysis can be supported as well as
long as the data format allows only the objects needed to be read
• Intelligent IO and high capacity networks have changed the rules
32
P. KreuzerRWTH
Networking
• CERN is deploying a remote computing facility in Budapest–200Gb/s of networking between the centers at
35ms ping time–As experiments we cannot really tell the difference
where resources are installed
33
CERN Budapest100Gb/s
100Gb/s
P. KreuzerRWTH
Networks• These 100Gb/s links are the first in production
for WLCG–Will be the first of many
• We have reduced the differences in site functionality. Then reduced the difference in data accessibility. Then reduced the difference in even the perception that two sites are separate
• We can begin to think of the facility as a big center and not a cluster of centers
34
CERN
Tier2
Lab a
Uni a
Lab c
Uni n
Lab m
Lab b
Uni bUni y
Uni x
Physics
Department
!
"
#
Desktop
Germany
Tier 1
USA
FermiLab
UK
France
Italy
NL
USA
Brookhaven
……….
Open
Science
Grid
Tier-2
Tier-2
Tier-2
Uni x
Uni u
Uni z
Tier-2
P. KreuzerRWTH
Resource Provisioning• During the
evolution the low level services are largely the same
• Most of the changes come from the actions and expectations of the experiments
CE
SE
Information System
FTS
BDII
WMS
Lower Level ServicesProviding Consistent Interfaces to Facilities
Higher Level Services
VOMSExp
erim
ent S
ervi
ces
Site
Connection to batch (Globus and CREAM based)
Connection to storage (SRM or xrootd)
35
P. KreuzerRWTH
Changing the Services• The WLCG service architecture has been
reasonably stable for over a decade–This is beginning to change with new Middleware
for resource provisioning• A variety of places are opening their resources
to “Cloud” type of provisioning–From a site perspective this is often chosen for
cluster management and flexibility reasons–Everything is virtualized and services are put on top
• There is nothing that prevents a site from bringing up exactly the same environment currently deployed for the WLCG, but maybe it’s not needed
36
P. KreuzerRWTH
Evolving the Infrastructure
• In the new resource provisioning model the pilot infrastructure communicates with the resource provisioning tools directly–Requesting groups of machines for periods of time
37
Resource Provisioning
Resource Provisioning
Pilots
ResourceRequests
Cloud Interface
CE
VM with Pilots
VM with Pilots
VM with Pilots
VM with Pilots
VM with Pilots
VM with Pilots
VM with Pilots
Batch Queue
WN with Pilots
WN with Pilots
WN with Pilots
WN with Pilots
WN with Pilots
WN with Pilots
WN with Pilots
P. KreuzerRWTH
Trying this out• CMS trying to provision resources like this with
the High Level Trigger farms–Open Stack interfaced to the Pilot systems
• CMS got to a production workflow running on ~3k cores and the facility looks like another destination, though no grid CE exists.– May become an important additional resource in the
future
• Already several sites have requested similar connections to local resources
38
P. KreuzerRWTH
Outlook• The efficient support by sites for data
processing, analysis and storage is essential to the collaboration
• The LHC VO’s need to adapt their SW to multi-threaded processing and new technologies
• The general outlook for the Evolution of the CMS Computing Model is –a breaking of the boundaries between sites–less separation of the functionality–the system will be more capable of being treated
like a single large facility, rather than a cluster of nodes
• This will be more flexible and efficient, and able to incorporate other types of resources
39
P. KreuzerRWTH
Backup Slides
40
P. KreuzerRWTH
CMS Resource planning 5 years
41
0
37.5
75.0
112.5
150.0
2009 2010 2011 2012 2013
CMS Tier-1 CPU Pledge [kHS06]
0
7.5
15.0
22.5
30.0
2009 2010 2011 2012 2013
CMS Tier-1 Disk Pledge [PB]
0
12.5
25.0
37.5
50.0
2009 2010 2011 2012 2013
CMS Tier-1 Tape Pledge [PB]
0
7.5
15.0
22.5
30.0
2009 2010 2011 2012 2013
CMS Tier-2 Disk Pledge [PB]
0
100
200
300
400
2009 2010 2011 2012 2013
CMS Tier-2 CPU Pledge [kHS06]
0
7.5
15.0
22.5
30.0
2009 2010 2011 2012 2013
CMS Tier-0 Tape Pledge [PB]
0
1.75
3.50
5.25
7.00
2009 2010 2011 2012 2013
CMS Tier-0 Disk Pledge [PB]
0
37.5
75.0
112.5
150.0
2009 2010 2011 2012 2013
CMS Tier-0 CPU Pledge [kHS06]
= LS1
P. KreuzerRWTH
WLCG Resources 5 years
42
0
20
40
60
80
2009 2010 2011 2012 2013
Tier-0 Tape Pledge [PB]
0
37.5
75.0
112.5
150.0
2009 2010 2011 2012 2013
Tier-1 Tape Pledge [PB]
0
7.5
15.0
22.5
30.0
2009 2010 2011 2012 2013
Tier-0 Disk Pledge [PB]
0
20
40
60
80
2009 2010 2011 2012 2013
Tier-1 Disk Pledge [PB]
0
22.5
45.0
67.5
90.0
2009 2010 2011 2012 2013
Tier-2 Disk Pledge [PB]
0
100
200
300
400
2009 2010 2011 2012 2013
Tier-0 CPU Pledge [kHS06]
0
175
350
525
700
2009 2010 2011 2012 2013
Tier-1 CPU Pledge [kHS06]
0
250
500
750
1000
2009 2010 2011 2012 2013
Tier-2 CPU Pledge [kHS06]
Alice
ATLAS
CMS
LHCb
P. KreuzerRWTH
WLGG Resource Distribution vs Tier
43
0
750
1500
2250
3000
2009 2010 2011 2012 2013
CPU Pledge vs Tier [kHS06]
0
50
100
150
200
2009 2010 2011 2012 2013
Disk Pledge vs Tier [PB]
Tier-0
Tier-1
Tier-2
• The contribution of CERN in 2012 was: - CPU : 24%- Disk : 19%- Tape : 40%
P. KreuzerRWTH
Production Performance 2012• 7 Billion simulated events at Tier-1 and Tier-2• Reconstruction and PileUp-combination at
Tier-1 centers typically happens more than once, this is why the red fields are larger below –Successful migration to AOD reduced data volume
44
Monthly Number of Simulated events in 2012 Monthly Size of Simulated events in 2012 [TB]
P. KreuzerRWTH
Data Transfer Performance 2012• CMS successfully followed the “full mesh” data transfer
philosophy since several years– 3290 links “commissioned” between sites (Dec 2012)– Tier-2-to-Tier-2 traffic is 50% of the Tier-1-to-Tier-2 traffic
45
0.1
1.0
10.0
100.0
1,000.0
2004-01 2005-01 2006-01 2007-01 2008-01 2009-01 2010-01 2011-01 2012-01
Terab
ytes p
er da
y
Average data transfer volume
DC04 SC2 SC3 SC4 Load test CSA06 CCRC08 General Beam Heavy Ion Beam Debug
Tier-1 to Tier-2 Transfer Rate in 2012 Tier-2 to Tier-2 Transfer Rate in 2012
+ on-going project to have dedicated links also to/from Tier-2 sites (LHCONE)
Average Data Transfer Volume
Tera
byte
s pe
r day
1000.0
100.0
10.0
1.0
300 MB/s600 MB/s
P. KreuzerRWTH
Tier-2 Needs / Storage Needs 2015• Increase of processing needs at Tier-2 is smaller,
because the Tier-2s are sized to process 1 years data, and we assume the concentration is on 2015 data.–Tier-2 processing needs will increase in 2016.
• Storage increases are smaller, because the storage is expected to deal with the total collected data, and we assume the concentration is on the new 2015 data. –storage needs will increase in 2016
• The following table provides a summary of the CMS resource request for 2014/15, in comparison with those from the previous couple years
46
P. KreuzerRWTH
CMS Data/Workflow Management
47
P. KreuzerRWTH
CMSSW Reengineering• To use multicore processing efficiently, three
things are required on the CMSSW application level–reengineering of algorithms for parallel execution–reengineering of data structures to promote locality–reengineering of algorithms to factorize
computational intensive kernels and appropriate data structures
• Tracking is the obvious first place in the reconstruction to begin investigations (already started)
• Arguably such change may be in production for Run2 already
48
P. KreuzerRWTH
CMS Workflow Management• All centrally processed CMS workflow tools are
based on WMAgent+GlideInWMS (pilot) – Main evolution will be the deployment of a global
GlideIn Queue, for all CMS jobs, with a workflow-aware scheduling that can prioritize and distribute work according to the type of workflow
• CRAB3– Is able to submit to both Panda and GlindeInWMS for
scheduling.– Many improvements expected:
• Improved scalability, error tracking and monitoring• Asynchronous stageout (stageout largest failure source in
CRAB-2)• Automatic publication• “thin client”
– Latest Status: Beta-testing phase 49