WLCG past, present, future: the CMS perspective · P. Kreuzer RWTH Reducing Boundaries • One of...

WLCG past, present, future: the CMS perspective

Peter Kreuzer,CMS Computing Resource Manager

FAPESP Workshop, Sao Paulo, Aug 28, 2013

P. KreuzerRWTH

Outline• The Computing Model so far• The Resource Challenge 2015• Short term evolution

–Multicore scheduling, multithreaded processing–CPU Technology

• Long term evolution–Data archiving vs disk storage–Reorganization of the Tiered functionality–Wide Area Data Access–Tier-1 Storage Cloud–Networking–Resource Provisioning

2

P. KreuzerRWTH

Computing Model so far• Computing models are

based roughly on the MONARC model –Developed more than a

decade ago–Foresaw Tiered Computing

Facilities to meet the needs of the LHC Experiments

• Assumed poor networking at the very origin

• Site Readiness was an issue

• Hierarchy of functionality and capability

Tier-0

Tier-1

Tier-2 Tier-2 Tier-2

Tier-1 Tier-1

3

P. KreuzerRWTH

The CMS example

4

CAF 450MB/s

(300Hz)

30-300MB/s (ag. 800MB/s)

~150k jobs/day

~50k jobs/day

50-500MB/s

10-20MB/s

CMS detector

WLCG Computing

Grid Infrastructure

TIER‐3TIER‐3

TIER‐3TIER‐3

100MB/s

Prompt ProcessingCalibration

Archival Storage

Organized Processing

StorageData Serving

Organized Event

SimulationChaotic Analysis

• 1 Tier-0, 7 Tier-1 and 50 Tier-2 centers• So far the resource Needs were met via a “flat” budget planing

~100k jobs/day

~150k jobs/day

400Hz prompt

+600Hz parked

+ Event Simulation

P. KreuzerRWTH

Resource Distribution vs Tier

5

Tier-0

Tier-1

Tier-2

• The contribution of CERN in 2012 was: - CPU : 21%- Disk : 13%- Tape : 33%

• The typical CPU fraction of main workflows is: - 20% Reco/Digi-Reco, 40% Analysis, 40% Simulation

0

175

350

525

700

2009 2010 2011 2012 2013

CMS CPU Pledge vs Tier [kHS06]

0

15

30

45

60

2009 2010 2011 2012 2013

CMS Disk Pledge vs Tier [PB]

= LS1

P. KreuzerRWTH

Tier-0 Performance 2012• Large Dedicated

Tier-0 CPU farm ~4k job slots + public queues at CERN (max. 7.5k slots)

6

• Large CAF “EOS” Disk : 6PB useable space• Dedicated

LHCOPN Network between CERN and Tier-1, easily handling the needs (up to 800 MB/s)

P. KreuzerRWTH

Tier-1 Performance 2012• Tier-1 centers have performed very well, in particular

7

• Since 2011, the CPU utilization was beyond 100% also thanks to production workflows

Tier-1 Wall Clock Hours in 2012

P. KreuzerRWTH

Tier-2 Performance 2012• Tier-2 centers have been used beyond pledge

8

• Since the start of the LHC run, the Analysis contribution to Tier-2 processing kept increasing, to reach 70% in 2012

Tier-2 Wall Clock Hours in 2012

Jobs Slots Used at Tier-2 per Week Analysis Jobs at Tier-2 per Week

P. KreuzerRWTH

CMS Tier-2 Model• Main CMS Workflows

–User Analysis (70%)–Central MC production (30%)

• 49 single Tier-2 centers–grouped in 28 Federations or

Countries• CMS Tier-2 Data Model

–A typical CMS Tier-2 is allocating disk space for CMS production and CMS PH groups

–This space is centrally managed–The Tier-2 is “rewarded” with

ESP or authorship credit for the amount of allocated disk space

9

x

xx

x

xxx

x

x

xx

x

xxx

xx

P. KreuzerRWTH

Site Readiness Monitoring• CMS is monitoring sites via a number of

Availability Metrics and Workflow Success Rates, regularly evaluated and published

• Results into an overall Readiness metric that is monitored and used to rank sites

10

P. KreuzerRWTH

SPRACE Contribution to CMS (I)• Site Readiness performance:

–Since January 2013, the average SPRACE Readiness is 98.5% is in the top-10 of all CMS Tier-2 centers !

• This is remarkable !• Thank you to all SPRACE

admins for their contribution to CMS !

11

Tier-2 Readiness Ranking Jan 1 - Aug 15, 2013

P. KreuzerRWTH

SPRACE Contribution to CMS (II)• The resource fraction of SPRACE compared to

all CMS Tier-2‘s is at the 3% level• In terms of CPU Consumption at SPRACE, the

fraction of CMS analysis jobs in 2012 was of 78%

12

CPU Consumption at SPRACE in 2012

Analysis

Production

Analysis

Production

Hou

rs

P. KreuzerRWTH

SPRACE Contribution to CMS (III)• Fraction of Analysis Jobs at SPRACE over the

total in all CMS Tier-2‘s

13

0%

1%

2%

3%

4%

5%

Jul 6, 2009

Jan 4, 2010

Jul 5, 2010

Jan 2, 2011

Jun 20, 2011

Dec 12, 2011

Jun 11, 2012

Dec 10, 2012

Jun 10, 2013

PERCENTAGE OF ANALYSIS JOBS RUN AT SPRACE/ALL T2 SITES

Transfer Rates into SPRACE since 2007

Higgs Discovery period

P. KreuzerRWTH

SPRACE Contribution to CMS (IV)

14

• Data Transfer Rates into SPRACE

Transfer Rates into SPRACE since 2007

160 Mb/s

480 Mb/s-The figure on the right shows the average transfer rates over 1 week bins- Peak performances are around 6.5 Gb/s

P. KreuzerRWTH

SPRACE Contribution to CMS (V)• SPRACE also contributes to CMS Service Work

– Automated mapping of all CMS contact people (in SiteDB) to a central ticketing service

– Design of CMS Computing Shift Procedures– Participation to CMS Computing Shifts

15

RockefellerTTU

UERJCBPF

UFLUNESP

MississippiFIU

Univ. MarylandUniv. Colorado

0 0.50 1.00 1.50 2.00 2.50 3.00 3.50

Achieved / Expected [%] (9 Months, Americas)

–CMS Core Computing jobs can open doors to our younger people, also in industry !

P. KreuzerRWTH

Upgrades• The Large Hadron Collider is

currently being upgraded. Back in 2015, with Energy closer to the 14TeV design and the Luminosity higher than design: higher rate of interesting events !

• There is also two years of CMS detector upgrade

Computing is challenged by ambitious resource needs.

The evolution of the computing model may help to mitigate some of these needs

16

P. KreuzerRWTH

Increased CPU Needs in 2015• Event Pile-up is expected to grow with increased

instantaneous luminosity–Roughly a factor 2.5 increase in reconstruction time

with the best code we currently have• Trigger rates at similar thresholds than 2012 are

expected to result in 800 - 1200 kHz prompt reconstruction rate–This is just the core sample and results in a factor

2.5 increase • Machine will move from 50ns to 25ns bunch

spacing–Increases the reconstruction time by ~factor 2

• The combination of all these effects is a factor 1217

P. KreuzerRWTH

How to mitigate CPU Needs 2015 ?• Assume that the reconstr. speed of going to 25ns can

be solved– confirmed by early indications, so gaining back a factor 2

• We will move ~1/2 the prompt reconstruction to Tier-1– reduces CPU needs at Tier-0 level by a factor 2

• We assume that the operations model in 2015 will be more like 2012 than 2010– More organized processing at Tier-1 centers, that will

reconstruct MC and Prompt primarily, while only 1 big reprocessing pass over the full sample will occur at the end of the year (also using HLT). This is expected to provide another 50% improvement.

• These pretty aggressive plans for improvements bring a factor 6, but still a factor 2 increase needed for Tier-0 and Tier-1 processing

18

P. KreuzerRWTH

Resource Request Summary

19

P. KreuzerRWTH

The future• October’13 RRB : 2015 computing resource

request will be presented and discussed–for CMS, the 2015 request can be accommodated

under a flat budget in average, but does require people to plan 2 years together

• A Review of the Computing Model is in progress and will be presented in a document in common with the other LHC experiments, by the end of summer 2013.

• In this presentation we concentrate on the evolution of the CMS SW, the technology and the CMS Computing Model.

20

P. KreuzerRWTH

Complicated Environments• The LHC is running at a higher number of

interactions per crossing than design already–Experiments have high trigger rates due to a higher

fraction of interesting physics–Has required capacity and efficiency and constant

improvements.–We hope this continues, but improvements are hard

fought

21

CMS Preparation for 8TeV, ICHEP 2012, J-R Vlimant, CERN5

Software Development for 2012● Trying to cope with ever-improving LHC operation, more luminosity, more added pile-up

events, more complex events● Improvement in computing performances of the CMS software● Physics performance of event reconstruction unaffected under technical modifications;

Improved under algorithm development.✔ Phase 1 in 2011 to cope with increased luminosity✔ Phase 2 early 2012 to prepare for increasing luminosity and favor increased trigger rate.

● Main gain was achieve in tracking algorithm optimization● Algorithm optimization and redesign, compiler architecture, memory management

improvements, root version all played a constructive interference role

Event processing time remained constant with increased pile-up (<30s/evt)

x~9

÷~3

÷~3

CMS

P. KreuzerRWTH

Multicore scheduling• May be generalized before the 2015 LHC startup

– reduces the memory requirements on WN– stabilizes the scaling, e.g. number of files or number

of jobs, which is expected to drop by a factor 4 to 8• Need to be coordinated at the WLCG level, in

order to assure coherency among experiments and hence provide clear recommendations to sites– whole-node or partial node scheduling

• CMS Software readiness for Multicore Processing– CMSSW “forking-mode” ready since 1.5 years– CMSSW reengineered as multi-threaded framework,

to be deployed in Fall, as baseline for the future 22

P. KreuzerRWTH

CPU Processor Evolution• Major shift in the nature of processors has taken

place– the performances of single

sequential applications has roughly stalled due to limits on power consumption

• The new trend is to concentrate on improving the “performance per WATT”– ARM processors, Intel Xeon Phi, GPGPU’s, ...

• One of the challenges for sites will be to handle a potentially heterogeneous environment (e.g. x86 and ARM) in terms of scheduling/grid software

23

P. KreuzerRWTH

ARM Processors, and others...• The new emphasis on performance/power in the server market

and the ever growing mobile market provides ARM an opportunity to challenge Intel in that space– Several ARMv7-based (32 bit) server products are now available,

but the most interesting development are ARMv8 (64 bit), expected by 2014

• CMS has successfully tested Monte-Carlo production workflows based on ARMv7 (32 bit)– ATLAS and LHCb have also investigated ARM– This is promising. A combined effort is needed on future processor

technology, in order to guide WLCG sites– Need to avoid that GRID SW/configuration/scheduling issues

prevent us from using such new technologies • As for Xeon Phi and GPGPU’s, the times scale would be

beyond 2016• This assumes “heterogeneous” hardware solution can be

deployed at sites, with appropriate scheduling/grid software solutions

24

P. KreuzerRWTH

The Evolution of the Model

• Over the development the evolution of the WLCG Production grid has oscillated between structure and flexibility–In general in the first run it was very deterministic –Jobs generally went to data that was placed there

by operators and later by services

25

CERN

Tier2

Lab a

Uni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni x

Physics

Department

!

"

#

Desktop

Germany

Tier 1

USA

FermiLab

UK

France

Italy

NL

USA

Brookhaven

……….

Open

Science

Grid

Tier-2

Tier-2

Tier-2

Uni x

Uni u

Uni z

Tier-2

ALICE RemoteAccess

PD2P/PopularityCMS Full

Mesh

P. KreuzerRWTH

Reducing Boundaries • One of the first changes we are seeing is a flattening of the

tiered structure of LHC computing– the functional differences of what each layer can do are

being reduced and we have a desire to use the system as a distributed system, and not a collection of sites

• One concrete action of this is to separate the archival functionality from the other site functions

• Status: Deployed at 2 Tier-1

26

P. KreuzerRWTH

Impact• Once you have split the archives there is no

reason for a strict one to one mapping of disk and archive at Tier-1s–Archives could be used to stage datasets to any

disk facility• The quantum of data we let the archive manage is dataset.

(TBs rather than files GBs)

• Need to ask the question how many archival facilities do you need?–More than 1, but probably less than 10

27

P. KreuzerRWTH

Changes how we think of tiers• Once you introduce the concept of an archival

services that is decoupled from the Tier-1–The functional difference between Tier-1 and TIer-2

is based more on availability and support than size of services• Difference between Tier-1 and Tier-2 from a functional

perspective is small –Model begins to look less Monarc-like

28

CERN

Tier2

Lab a

Uni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni x

Physics

Department

!

"

#

Desktop

Germany

Tier 1

USA

FermiLab

UK

France

Italy

NL

USA

Brookhaven

……….

Open

Science

Grid

Tier-2

Tier-2

Tier-2

Uni x

Uni u

Uni z

Tier-2

P. KreuzerRWTH

Stretches into Other elements• After Long Shutdown 1, CMS will likely

reconstruct about half the data the first time at Tier-1s in close to real time–Very little unique about the functionality of the

Tier-0• Some prompt calibration work that uses Express data, but

even that could probably be exported

29

T0

T1 T1 T1

ATLAS Computing - Ueda I. - ICHEP 2012.07.07.

Storage FederationThe current system is based on the “Data Grid” concept

• Jobs go to data -- access via LAN• Replicate data for higher accessibility

‣ transfer the whole dataset

• Jobs to be re-assigned when the data there is not available

“Storage Federation” provides new access modes & redundancy

• Jobs access data on shared storage resources via WAN• Analysis jobs may not need all the information / all the files

‣ Transfer a part of the dataset‣ File and Event Level Caching

• System of Xrootd ‘redirectors’ is the possible working solution today‣

11

‣ Work in past year within US ATLAS Computing Facility to develop the concept and test performance

‣ Test being extended from regional to global

P. KreuzerRWTH

Wide Area Access• All experiments will have the capability of wide

area access to data by the restart–3 will use Xrootd based data federations.

• Development work in http based ongoing–Similar in concept to a content delivery network.

Generally referred to as a data federation

30

Sudhir'Malik' ICHEP'2012,'Melbourne,'Australia'–'4>11'July'2012' 12!

FederaHon!

• Remote access gives us data for one site • We need a federation to access all sites across all CMS sites

- Status: • has been used in production• 85% sites configured Fall Back to open request to Xrootd Federation• 75% of sites have joined xrootd Federation

P. KreuzerRWTH

The Tier-1 Storage “Cloud”• Once wide area access to

data exists the boundaries between sites are reduced. LHC is negotiating to make more use of the LHC-OPN– Negotiations with sites for

moving worker resources inside the OPN domain

– Calculate and test the amount of access that could be sustained with our share

31

• Allows the collection of Tier-1s to be used as a single processing facility and work on the same data.

• Instead of failing back to archive, we would fall over to xrootd if the data was accessible on another Tier-1

P. KreuzerRWTH

Flexibility• There is some skepticism because so much effort

was invested in getting the data close• Computing intensive tasks like reprocessing can

be sustained reading data from remote storage– Input size is small compared to the time of the

application – 50kB/s is enough to sustain the CMS application per

slot– Even thousands of cores can be reasonably fed with

Gb/s• Works for analysis can be supported as well as

long as the data format allows only the objects needed to be read

• Intelligent IO and high capacity networks have changed the rules

32

P. KreuzerRWTH

Networking

• CERN is deploying a remote computing facility in Budapest–200Gb/s of networking between the centers at

35ms ping time–As experiments we cannot really tell the difference

where resources are installed

33

CERN Budapest100Gb/s

100Gb/s

P. KreuzerRWTH

Networks• These 100Gb/s links are the first in production

for WLCG–Will be the first of many

• We have reduced the differences in site functionality. Then reduced the difference in data accessibility. Then reduced the difference in even the perception that two sites are separate

• We can begin to think of the facility as a big center and not a cluster of centers

34

CERN

Tier2

Lab a

Uni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni x

Physics

Department

!

"

#

Desktop

Germany

Tier 1

USA

FermiLab

UK

France

Italy

NL

USA

Brookhaven

……….

Open

Science

Grid

Tier-2

Tier-2

Tier-2

Uni x

Uni u

Uni z

Tier-2

P. KreuzerRWTH

Resource Provisioning• During the

evolution the low level services are largely the same

• Most of the changes come from the actions and expectations of the experiments

CE

SE

Information System

FTS

BDII

WMS

Lower Level ServicesProviding Consistent Interfaces to Facilities

Higher Level Services

VOMSExp

erim

ent S

ervi

ces

Site

Connection to batch (Globus and CREAM based)

Connection to storage (SRM or xrootd)

35

P. KreuzerRWTH

Changing the Services• The WLCG service architecture has been

reasonably stable for over a decade–This is beginning to change with new Middleware

for resource provisioning• A variety of places are opening their resources

to “Cloud” type of provisioning–From a site perspective this is often chosen for

cluster management and flexibility reasons–Everything is virtualized and services are put on top

• There is nothing that prevents a site from bringing up exactly the same environment currently deployed for the WLCG, but maybe it’s not needed

36

P. KreuzerRWTH

Evolving the Infrastructure

• In the new resource provisioning model the pilot infrastructure communicates with the resource provisioning tools directly–Requesting groups of machines for periods of time

37

Resource Provisioning

Resource Provisioning

Pilots

ResourceRequests

Cloud Interface

CE

VM with Pilots

VM with Pilots

VM with Pilots

VM with Pilots

VM with Pilots

VM with Pilots

VM with Pilots

Batch Queue

WN with Pilots

WN with Pilots

WN with Pilots

WN with Pilots

WN with Pilots

WN with Pilots

WN with Pilots

P. KreuzerRWTH

Trying this out• CMS trying to provision resources like this with

the High Level Trigger farms–Open Stack interfaced to the Pilot systems

• CMS got to a production workflow running on ~3k cores and the facility looks like another destination, though no grid CE exists.– May become an important additional resource in the

future

• Already several sites have requested similar connections to local resources

38

P. KreuzerRWTH

Outlook• The efficient support by sites for data

processing, analysis and storage is essential to the collaboration

• The LHC VO’s need to adapt their SW to multi-threaded processing and new technologies

• The general outlook for the Evolution of the CMS Computing Model is –a breaking of the boundaries between sites–less separation of the functionality–the system will be more capable of being treated

like a single large facility, rather than a cluster of nodes

• This will be more flexible and efficient, and able to incorporate other types of resources

39

P. KreuzerRWTH

Backup Slides

40

P. KreuzerRWTH

CMS Resource planning 5 years

41

0

37.5

75.0

112.5

150.0

2009 2010 2011 2012 2013

CMS Tier-1 CPU Pledge [kHS06]

0

7.5

15.0

22.5

30.0

2009 2010 2011 2012 2013

CMS Tier-1 Disk Pledge [PB]

0

12.5

25.0

37.5

50.0

2009 2010 2011 2012 2013

CMS Tier-1 Tape Pledge [PB]

0

7.5

15.0

22.5

30.0

2009 2010 2011 2012 2013


0

100

200

300

400

2009 2010 2011 2012 2013


0

7.5

15.0

22.5

30.0

2009 2010 2011 2012 2013

CMS Tier-0 Tape Pledge [PB]

0

1.75

3.50

5.25

7.00

2009 2010 2011 2012 2013


0

37.5

75.0

112.5

150.0

2009 2010 2011 2012 2013


= LS1

P. KreuzerRWTH

WLCG Resources 5 years

42

0

20

40

60

80

2009 2010 2011 2012 2013

Tier-0 Tape Pledge [PB]

0

37.5

75.0

112.5

150.0

2009 2010 2011 2012 2013

Tier-1 Tape Pledge [PB]

0

7.5

15.0

22.5

30.0

2009 2010 2011 2012 2013

Tier-0 Disk Pledge [PB]

0

20

40

60

80

2009 2010 2011 2012 2013


0

22.5

45.0

67.5

90.0

2009 2010 2011 2012 2013


0

100

200

300

400

2009 2010 2011 2012 2013

Tier-0 CPU Pledge [kHS06]

0

175

350

525

700

2009 2010 2011 2012 2013


0

250

500

750

1000

2009 2010 2011 2012 2013


Alice

ATLAS

CMS

LHCb

P. KreuzerRWTH

WLGG Resource Distribution vs Tier

43

0

750

1500

2250

3000

2009 2010 2011 2012 2013

CPU Pledge vs Tier [kHS06]

0

50

100

150

200

2009 2010 2011 2012 2013

Disk Pledge vs Tier [PB]

Tier-0

Tier-1

Tier-2

• The contribution of CERN in 2012 was: - CPU : 24%- Disk : 19%- Tape : 40%

P. KreuzerRWTH

Production Performance 2012• 7 Billion simulated events at Tier-1 and Tier-2• Reconstruction and PileUp-combination at

Tier-1 centers typically happens more than once, this is why the red fields are larger below –Successful migration to AOD reduced data volume

44

Monthly Number of Simulated events in 2012 Monthly Size of Simulated events in 2012 [TB]

P. KreuzerRWTH

Data Transfer Performance 2012• CMS successfully followed the “full mesh” data transfer

philosophy since several years– 3290 links “commissioned” between sites (Dec 2012)– Tier-2-to-Tier-2 traffic is 50% of the Tier-1-to-Tier-2 traffic

45

0.1

1.0

10.0

100.0

1,000.0

2004-01 2005-01 2006-01 2007-01 2008-01 2009-01 2010-01 2011-01 2012-01

Terab

ytes p

er da

y

Average data transfer volume

DC04 SC2 SC3 SC4 Load test CSA06 CCRC08 General Beam Heavy Ion Beam Debug

Tier-1 to Tier-2 Transfer Rate in 2012 Tier-2 to Tier-2 Transfer Rate in 2012

+ on-going project to have dedicated links also to/from Tier-2 sites (LHCONE)

Average Data Transfer Volume

Tera

byte

s pe

r day

1000.0

100.0

10.0

1.0

300 MB/s600 MB/s

P. KreuzerRWTH

Tier-2 Needs / Storage Needs 2015• Increase of processing needs at Tier-2 is smaller,

because the Tier-2s are sized to process 1 years data, and we assume the concentration is on 2015 data.–Tier-2 processing needs will increase in 2016.

• Storage increases are smaller, because the storage is expected to deal with the total collected data, and we assume the concentration is on the new 2015 data. –storage needs will increase in 2016

• The following table provides a summary of the CMS resource request for 2014/15, in comparison with those from the previous couple years

46

P. KreuzerRWTH

CMS Data/Workflow Management

47

P. KreuzerRWTH

CMSSW Reengineering• To use multicore processing efficiently, three

things are required on the CMSSW application level–reengineering of algorithms for parallel execution–reengineering of data structures to promote locality–reengineering of algorithms to factorize

computational intensive kernels and appropriate data structures

• Tracking is the obvious first place in the reconstruction to begin investigations (already started)

• Arguably such change may be in production for Run2 already

48

P. KreuzerRWTH

CMS Workflow Management• All centrally processed CMS workflow tools are

based on WMAgent+GlideInWMS (pilot) – Main evolution will be the deployment of a global

GlideIn Queue, for all CMS jobs, with a workflow-aware scheduling that can prioritize and distribute work according to the type of workflow

• CRAB3– Is able to submit to both Panda and GlindeInWMS for

scheduling.– Many improvements expected:

• Improved scalability, error tracking and monitoring• Asynchronous stageout (stageout largest failure source in

CRAB-2)• Automatic publication• “thin client”

– Latest Status: Beta-testing phase 49

WLCG past, present, future: the CMS perspective · P. Kreuzer RWTH Reducing Boundaries • One of...

Documents

Transcript of WLCG past, present, future: the CMS perspective · P. Kreuzer RWTH Reducing Boundaries • One of...