Download - 6-October-2005AICA@Udine L.Perini 1 Grid Computing and High Energy Physics LHC and the Experiments The LHC Computing Grid The Data Challenges and some.

6-October-2005 AICA@Udine L.Perini

1

Grid Computing and

High Energy Physics

LHC and the Experiments

The LHC Computing Grid

The Data Challenges and some first results from the ATLAS experiment


2

The Experiments and the challenges of their computing

• GRID is a “novel” technology– Many High Energy Physics experiments are now using

it at some level, but…

– The experiments at the new LHC collider at CERN • Start of data taking foreseen for mid 2007

are the first ones with a computing model designed from the beginning for the GRID

• This talk will concentrate on the GRID as planned and used by the LHC experiments– And in the end on one experiment (ATLAS) specifically


3

CERN

CERN (founded 1954) = “Conseil Européen pour la Recherche Nucléaire”

“European Organisation for Nuclear Reseach”

Particle Physics

CERN:Annual budget: ~1000 MSFr (~700 M€)Staff members: 2650Member states: 20

+ 225 Fellows, + 270 Associates

+ 6000 CERN users

27 km circumference tunnel

Large

Hadron

Collider

LCG Particle Physics

Establish a periodic system of the fundamental building blocks

andunderstandforces

LCGLHC: protons colliding at E=14 TeV/c2

Creating conditions similar to the Big Bang

The most powerful microscope


6

Particle physics datae+

e-

f

fZ0

Basic physics

Results

Fragmentation,Decay,Physics analysis

Interaction withdetector materialPattern,recognition,Particleidentification

Detectorresponseapplycalibration,alignment

2037 2446 1733 16994003 3611 952 13282132 1870 2093 32714732 1102 2491 32162421 1211 2319 21333451 1942 1121 34293742 1288 2343 7142

Raw data

Convert tophysics quantities

Reconstruction

Simulation (Monte-Carlo)

Analysis

_

From raw data to physics results


7

Challenge 1: Large, distributed community

CMSATLAS

LHCb~ 5000 Physicistsaround the world- around the clock

“Offline” software effort:

1000 person-yearsper experiment

Software life span: 20 years

8

LHC users and participants Istitutes

Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users

LCG: The worldwide Grid project


9

ATLAS is not one experimentATLAS is not one experimentATLAS is not one experimentATLAS is not one experiment

HiggsHiggsHiggsHiggs

Extra Extra DimensionsDimensions

Extra Extra DimensionsDimensions

Heavy Ion Heavy Ion PhysicsPhysics

Heavy Ion Heavy Ion PhysicsPhysics

QCDQCDQCDQCD

ElectroweakElectroweakElectroweakElectroweak

B physicsB physicsB physicsB physics

SUSYSUSYSUSYSUSY

First physics analysis expected to start in 2008


10

Challenge 2: Data VolumeAnnual data storage:

12-14 PetaBytes/year

Concorde(15 Km)

Balloon(30 Km)

CD stack with1 year LHC data!(~ 20 Km)

Mt. Blanc(4.8 Km)

50 CD-ROM

= 35 GB

6 cm


11

Challenge 3: Find the Needle in a Haystack

9 o

rder

s o

f m

agn

itu

de!

The HIGGS

All interactions

Rare phenomena - Huge backgroundComplex events


12

Therefore: Provide mountains of CPU

For LHC computing,some 100 Million SPECint2000 are needed!

1 SPECint2000 = 0.1 SPECint95 = 1 CERN-unit = 4 MIPS- a 3 GHz Pentium 4 has ~ 1000 SPECint2000

CalibrationReconstructionSimulationAnalysis

Produced by Inteltoday in ~6 hours


13

The CERN Computing Centre

Even with technology-driven improvementsin performance and costs – CERN can provide nowhere near enough capacity for LHC!

~2,400 processors~200 TBytes of disk~12 PB of magnetic tape


14

LCG: the LHC Computing Grid

• A single BIG computing center is not the best solution for the challenges we have seen– Single points of failure– Difficult to handle costs

• Countries dislike paying checks without having local returns and sharing of responsibilities…

• Use the Grid idea and plan a really distributed computing system: LCG– In June 2005 the Grid based Computing Technical

Design Reports of the 4 experiments and of LCG have been published


15

The LCG Project• Approved by the CERN Council in September 2001

– Phase 1 (2001-2004):

– Development and prototyping a distributed production prototype at CERN and elsewhere that will be operated as a platform for the data challenges

– Leading to a Technical Design Report, which will serve as a basis for agreeing the relations between the distributed Grid nodes and their co-ordinated deployment and exploitation.

– Phase 2 (2005-2007):Installation and operation of the full world-wide initial production Grid system, requiring continued manpower efforts and substantial material resources.

• A Memorandum of Understanding– Has been developed defining the Worldwide LHC Computing Grid

Collaboration of CERN as host lab and the major computing centres.

– Defines the organizational structure for Phase 2 of the project.


16

What is the Grid?

• Resource Sharing

– On a global scale, across the labs/universities

• Secure Access

– Needs a high level of trust

• Resource Use

– Load balancing, making most efficient use

• The “Death of Distance”

– Requires excellent networking

• Open Standards

– Allow constructive distributed

development

• There is not (yet) a single Grid

5.44 Gbps

1.1 TB in 30 min.

6.25 Gbps

20 April 2004


17

The GRID middleware:• Finds convenient places for the scientists “job” (computing task) to be run

• Optimises use of the widely dispersed resources

• Organises efficient access to scientific data

• Deals with authentication to the different sites that the scientists will be using

• Interfaces to local site authorisation and resource allocation policies

• Runs the jobs

• Monitors progress

• Recovers from problems

… and ….

Tells you when the work is complete and transfers the result back!

How will it work?


18

The LHC Computing Grid Project - LCGCollaboration

LHC Experiments

Grid projects: Europe, US

Regional & national centres

ChoicesAdopt Grid technology.

Go for a “Tier” hierarchy.

Use Intel CPUs in standard PCs

Use LINUX operating system.

GoalPrepare and deploy the computing environment to help the experiments analyse the data from the LHC detectors.

grid for a physicsstudy group

Tier3physics

department

Desktop

Germany

Tier 1

USAUK

France

Italy

Taipei

CERNTier 1

JapanCERN Tier 0

Tier2

Lab aUni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni xgrid for a regional group


19

Cooperation with other projects• Grid Software

– Globus, Condor and VDT have provided key components of the middleware used. Key members participate in OSG and EGEE

– Enabling Grids for E-sciencE (EGEE) includes a substantial middleware activity.

• Grid Operational Groupings– The majority of the resources used are made available as part of the

EGEE Grid (~140 sites, 12,000 processors). EGEE also supports Core Infrastructure Centres and Regional Operations Centres.

– The US LHC programmes contribute to and depend on the Open Science Grid (OSG). Formal relationship with LCG through US-Atlas and US-CMS computing projects.

• Network Services– LCG will be one of the most demanding applications of national

research networks such as the pan-European backbone network, GÉANT

– The Nordic Data Grid Facility (NDGF) will begin operation in 2006. Prototype work is based on the NorduGrid middleware ARC.

AICA@Udine L.Perini

20

Grid ProjectsGrid Projects

Until deployments provide interoperabilitythe experiments must provide it themselves

ATLAS must span 3 major Griddeployments

21

EGEE •Proposal submitted to EU IST 6th framework

•Project started April 1rst 2004•Total budget approved of approximately 32 M€ over 2 years activities•Deployment and operation of Grid Infrastructure (SA1)•Re-Engineering of grid middleware (WSRF environment) (JRA1)•Dissemination, Training and Applications (NA4)•Italy take part to all 3 area of activities with a global financing of 4.7 M€ •EGEE2 project with similar funding submitted for further 2 years work

11 regional federations covering 70 partners in

26 countries

22

EGEE Activities

JRA1: Middleware Engineering and Integration

JRA2: Quality Assurance

JRA3: Security

JRA4: Network Services Development

SA1: Grid Operations, Support and Management

SA2: Network Resource Provision

NA1: Management

NA2: Dissemination and Outreach

NA3: User Training and Education

NA4: Application Identification and Support

NA5: Policy and International Cooperation

24% Joint Research 28% Networking

48% ServicesEmphasis in EGEE is on operating a productiongrid and supporting the end-users

Starts 1st April 2004 for 2 years (1st phase) with EU funding of ~32M€

23

LCG/EGEE coordination

LCG Project Leader in EGEE Project Management

Board

Most of the other members are representatives of

HEP funding agencies and CERN

EGEE Project Director in LCG Project Overview

Board

Middleware and Operations are common to both

LCG and EGEE

Cross-representation on Project Executive Boards

EGEE Technical director in LCG PEB

EGEE HEP applications hosted in CERN/EP division


24

The Hierarchical “Tier” Model• Tier-0 at CERN

– Record RAW data (1.25 GB/s ALICE)

– Distribute second copy to Tier-1s

– Calibrate and do first-pass reconstruction

• Tier-1 centres (11 defined)– Manage permanent storage – RAW, simulated, processed

– Capacity for reprocessing, bulk analysis

• Tier-2 centres (>~ 100 identified)– Monte Carlo event simulation

– End-user analysis

• Tier-3– Facilities at universities and laboratories

– Access to data and processing in Tier-2s, Tier-1s

– Outside the scope of the project


25

Tier-1s

Tier-1 Centre

Experiments served with priority

ALICE ATLAS CMSLHC

b

TRIUMF, Canada X

GridKA, Germany X X X X

CC, IN2P3, France X X X X

CNAF, Italy X X X X

SARA/NIKHEF, NL X X X

Nordic Data Grid Facility (NDGF)

X X X

ASCC, Taipei X X

RAL, UK X X X X

BNL, US X

FNAL, US X

PIC, Spain X X X


26

Tier-2s

~100 identified – number still growing


27

The EventflowRate

[Hz]

RAW

[MB]

ESDrDSTRECO[MB]

AOD

[kB]

MonteCarlo

[MB/evt]

MonteCarlo

% of real

ALICE HI 100 12.5 2.5 250 300 100

ALICE pp

100 1 0.04 4 0.4 100

ATLAS 200 1.6 0.5 100 2 20

CMS 150 1.5 0.25 50 2 100

LHCb 2000

0.025

0.025 0.5 20

50 days running in 2007107 seconds/year pp from 2008 on ~109 events/experiment106 seconds/year heavy ion


28

CPU Requirements

0

50

100

150

200

250

300

350

2007 2008 2009 2010Year

MS

I200

0

LHCb-Tier-2

CMS-Tier-2

ATLAS-Tier-2

ALICE-Tier-2

LHCb-Tier-1

CMS-Tier-1

ATLAS-Tier-1

ALICE-Tier-1

LHCb-CERN

CMS-CERN

ATLAS-CERN

ALICE-CERN

CE

RN

Tie

r-1

Tie

r-2

58%

pled

ged


29

Disk Requirements

0

20

40

60

80

100

120

140

160

2007 2008 2009 2010Year

PB

LHCb-Tier-2

CMS-Tier-2

ATLAS-Tier-2

ALICE-Tier-2

LHCb-Tier-1

CMS-Tier-1

ATLAS-Tier-1

ALICE-Tier-1

LHCb-CERN

CMS-CERN

ATLAS-CERN

ALICE-CERN

CE

RN

Tie

r-1

Tie

r-2

54%

pled

ged


30

Tape Requirements

CE

RN

Tie

r-1

0

20

40

60

80

100

120

140

160

2007 2008 2009 2010Year

PB

LHCb-Tier-1

CMS-Tier-1

ATLAS-Tier-1

ALICE-Tier-1

LHCb-CERN

CMS-CERN

ATLAS-CERN

ALICE-CERN

75%

pled

ged


31

Experiments’ Requirements• Single Virtual Organization (VO) across the Grid• Standard interfaces for Grid access to Storage Elements (SEs) and

Computing Elements (CEs)• Need of a reliable Workload Management System (WMS) to

efficiently exploit distributed resources.• Non-event data such as calibration and alignment data but also

detector construction descriptions will be held in data bases – read/write access to central (Oracle) databases at Tier-0 and read access at

Tier-1s with a local database cache at Tier-2s• Analysis scenarios and specific requirements are still evolving

– Prototype work is in progress (ARDA)• Online requirements are outside of the scope of LCG, but there

are connections:– Raw data transfer and buffering– Database management and data export– Some potential use of Event Filter Farms for offline processing


32

Architecture – Grid services• Storage Element

– Mass Storage System (MSS) (CASTOR, Enstore, HPSS, dCache, etc.)– Storage Resource Manager (SRM) provides a common way to access MSS,

independent of implementation– File Transfer Services (FTS) provided e.g. by GridFTP or srmCopy

• Computing Element– Interface to local batch system e.g. Globus gatekeeper.– Accounting, status query, job monitoring

• Virtual Organization Management– Virtual Organization Management Services (VOMS)– Authentication and authorization based on VOMS model.

• Grid Catalogue Services– Mapping of Globally Unique Identifiers (GUID) to local file name– Hierarchical namespace, access control

• Interoperability– EGEE and OSG both use the Virtual Data Toolkit (VDT)– Different implementations are hidden by common interfaces


33

Prototypes

• It is important that the hardware and software systems developed in the framework of LCG be exercised in more and more demanding challenges– Data Challenges have now been done by all experiments.

– Though the main goal was to validate the distributed computing model and to gradually build the computing systems, the results have been used for physics performance studies and for detector, trigger, and DAQ design.

– Limitations of the Grids have been identified and are being addressed.

• Presently, a series of Service Challenges aim to realistic end-to-end testing of experiment use-cases over in extended period leading to stable production services.


34

Data Challenges• ALICE

– PDC04 using AliEn services native or interfaced to LCG-Grid. 400,000 jobs run producing 40 TB of data for the Physics Performance Report.

– PDC05: Event simulation, first-pass reconstruction, transmission to Tier-1 sites, second pass reconstruction (calibration and storage), analysis with PROOF – using Grid services from LCG SC3 and AliEn

• ATLAS– Using tools and resources from LCG, NorduGrid, and Grid3 at 133

sites in 30 countries using over 10,000 processors where 235,000 jobs produced more than 30 TB of data using an automatic production system in 2004

– In 2005 Production for Physics Workshop in Rome – next slides• CMS

– 100 TB simulated data reconstructed at a rate of 25 Hz, distributed to the Tier-1 sites and reprocessed there.

• LHCb– LCG provided more than 50% of the capacity for the first data

challenge 2004-2005. The production used the DIRAC system.

AICA@Udine L.Perini

35

ATLAS Production System ATLAS Production System

LCG NG Grid3 LSF

LCGexe

LCGexe

NGexe

G3exe

LSFexe

super super super super super

ProdDBData Man.

System

RLS RLS RLS

jabber jabber soap soap jabber

Don Quijote

Windmill

Lexor

AMI

CaponeDulcinea

A big problem is data management

Must cope with >= 3 Grid catalogues

Demands even greater for analysis

6-October-2005AICA@Udine L.Perini 36

ATLAS Massive productions on 3 Grids

July-September 2004: DC2 Geant-4 simulation (long jobs)

40% on LCG/EGEE Grid, 30% on Grid3 and 30% on NorduGrid

October-December 2004: DC2 digitization and reconstruction (short jobs)

February-May 2005: Rome production (mix of jobs as digitization and reconstruction was started as soon as samples had been simulated)

65% on LCG/EGEE Grid, 24% on Grid3, 11% on NorduGrid

CPU Consumption for the simulation CPU-intensive phase (till may 20th)Grid3: 80 kSI2K.years NorduGrid: 22 kSI2k.yearsLCG-tot: 178 kSI2K.yearsTotal: 280 kSI2K.years

Note: this CPU was almost fully consumed in 40 days, and the results were used for the real physics analysis presented in the Workshop at Rome, with the participation of > 400 ATLAS physicists.


Number of Jobs

Grid324%

LCG34%

LCG-CG31%

NorduGrid11%

Grid3

LCG

LCG-CG

NorduGrid

Rome production statistics

73 data sets containing 6.1M events simulated and reconstructed (without pile-up)

Total simulated data: 8.5M events

Pile-up done later (for 1.3M events done, 50K reconstructed)


This is the first successful use of the grid by a largeuser community, which has however also revealed several shortcomings which need now to be fixed as LHC turn-on is onlytwo years ahead!

Very instructive comments from the user feedback have been presented at the Workshop (obviously this was one of the main themes and purposes of the meeting)

All this is available on the Web


ATLAS Rome production: countries (sites)

• Austria (1)• Canada (3)• CERN (1)• Czech Republic (2)• Denmark (3)• France (4) • Germany (1+2)• Greece (1)• Hungary (1)• Italy (17)

• Netherlands (2)

• Norway (2)

• Poland (1)

• Portugal (1)

• Russia (2)

• Slovakia (1)

• Slovenia (1)

• Spain (3)

• Sweden (5)

• Switzerland (1+1)

• Taiwan (1)

• UK (8)

• USA (19)

22 countries84 sites

17 countries; 51 sites

7 countries; 14 sites


Status and plans for ATLAS production on LCG

The global efficiency of the ATLAS production for Rome was good in WLMS area ( >95%), while improvements are still needed in the Data Management area (~75%)

WLMS however speed needs however improvement

ATLAS is ready to test new EGEE mw components as soon as they are released from the internal certification process

The File Transfer Service and the LCG File Catalogue, together with the new ATLAS Data Management layer

The new (gLite) version of the WMLS with support for bulk submission, task queue and full model

Accounting, monitoring and priority ( VOMS role and group based) systems are expected to be in production use for mid 2006 new big production rounds


41

Conclusions• The HEP experiments at the LHC collider are

committed to a GRID based computing• The LHC Computing Grid Project is providing the

common effort needed for supporting them• EU and US funded Grid projects develop, mantain

and deploy the middleware• In the last year the Data Challenge have

demonstrated the feasibility of huge real productions– Still much work need to be done in the next 2 years for

meeting the challenge of the real data to be analyzed