6-October-2005 AICA@Udine L.Perini
1
Grid Computing and
High Energy Physics
LHC and the Experiments
The LHC Computing Grid
The Data Challenges and some first results from the ATLAS experiment
6-October-2005 AICA@Udine L.Perini
2
The Experiments and the challenges of their computing
• GRID is a “novel” technology– Many High Energy Physics experiments are now using
it at some level, but…
– The experiments at the new LHC collider at CERN • Start of data taking foreseen for mid 2007
are the first ones with a computing model designed from the beginning for the GRID
• This talk will concentrate on the GRID as planned and used by the LHC experiments– And in the end on one experiment (ATLAS) specifically
6-October-2005 AICA@Udine L.Perini
3
CERN
CERN (founded 1954) = “Conseil Européen pour la Recherche Nucléaire”
“European Organisation for Nuclear Reseach”
Particle Physics
CERN:Annual budget: ~1000 MSFr (~700 M€)Staff members: 2650Member states: 20
+ 225 Fellows, + 270 Associates
+ 6000 CERN users
27 km circumference tunnel
Large
Hadron
Collider
LCG Particle Physics
Establish a periodic system of the fundamental building blocks
andunderstandforces
LCGLHC: protons colliding at E=14 TeV/c2
Creating conditions similar to the Big Bang
The most powerful microscope
6-October-2005 AICA@Udine L.Perini
6
Particle physics datae+
e-
f
fZ0
Basic physics
Results
Fragmentation,Decay,Physics analysis
Interaction withdetector materialPattern,recognition,Particleidentification
Detectorresponseapplycalibration,alignment
2037 2446 1733 16994003 3611 952 13282132 1870 2093 32714732 1102 2491 32162421 1211 2319 21333451 1942 1121 34293742 1288 2343 7142
Raw data
Convert tophysics quantities
Reconstruction
Simulation (Monte-Carlo)
Analysis
_
From raw data to physics results
6-October-2005 AICA@Udine L.Perini
7
Challenge 1: Large, distributed community
CMSATLAS
LHCb~ 5000 Physicistsaround the world- around the clock
“Offline” software effort:
1000 person-yearsper experiment
Software life span: 20 years
8
LHC users and participants Istitutes
Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users
LCG: The worldwide Grid project
6-October-2005 AICA@Udine L.Perini
9
ATLAS is not one experimentATLAS is not one experimentATLAS is not one experimentATLAS is not one experiment
HiggsHiggsHiggsHiggs
Extra Extra DimensionsDimensions
Extra Extra DimensionsDimensions
Heavy Ion Heavy Ion PhysicsPhysics
Heavy Ion Heavy Ion PhysicsPhysics
QCDQCDQCDQCD
ElectroweakElectroweakElectroweakElectroweak
B physicsB physicsB physicsB physics
SUSYSUSYSUSYSUSY
First physics analysis expected to start in 2008
6-October-2005 AICA@Udine L.Perini
10
Challenge 2: Data VolumeAnnual data storage:
12-14 PetaBytes/year
Concorde(15 Km)
Balloon(30 Km)
CD stack with1 year LHC data!(~ 20 Km)
Mt. Blanc(4.8 Km)
50 CD-ROM
= 35 GB
6 cm
6-October-2005 AICA@Udine L.Perini
11
Challenge 3: Find the Needle in a Haystack
9 o
rder
s o
f m
agn
itu
de!
The HIGGS
All interactions
Rare phenomena - Huge backgroundComplex events
6-October-2005 AICA@Udine L.Perini
12
Therefore: Provide mountains of CPU
For LHC computing,some 100 Million SPECint2000 are needed!
1 SPECint2000 = 0.1 SPECint95 = 1 CERN-unit = 4 MIPS- a 3 GHz Pentium 4 has ~ 1000 SPECint2000
CalibrationReconstructionSimulationAnalysis
Produced by Inteltoday in ~6 hours
6-October-2005 AICA@Udine L.Perini
13
The CERN Computing Centre
Even with technology-driven improvementsin performance and costs – CERN can provide nowhere near enough capacity for LHC!
~2,400 processors~200 TBytes of disk~12 PB of magnetic tape
6-October-2005 AICA@Udine L.Perini
14
LCG: the LHC Computing Grid
• A single BIG computing center is not the best solution for the challenges we have seen– Single points of failure– Difficult to handle costs
• Countries dislike paying checks without having local returns and sharing of responsibilities…
• Use the Grid idea and plan a really distributed computing system: LCG– In June 2005 the Grid based Computing Technical
Design Reports of the 4 experiments and of LCG have been published
6-October-2005 AICA@Udine L.Perini
15
The LCG Project• Approved by the CERN Council in September 2001
– Phase 1 (2001-2004):
– Development and prototyping a distributed production prototype at CERN and elsewhere that will be operated as a platform for the data challenges
– Leading to a Technical Design Report, which will serve as a basis for agreeing the relations between the distributed Grid nodes and their co-ordinated deployment and exploitation.
– Phase 2 (2005-2007):Installation and operation of the full world-wide initial production Grid system, requiring continued manpower efforts and substantial material resources.
• A Memorandum of Understanding– Has been developed defining the Worldwide LHC Computing Grid
Collaboration of CERN as host lab and the major computing centres.
– Defines the organizational structure for Phase 2 of the project.
6-October-2005 AICA@Udine L.Perini
16
What is the Grid?
• Resource Sharing
– On a global scale, across the labs/universities
• Secure Access
– Needs a high level of trust
• Resource Use
– Load balancing, making most efficient use
• The “Death of Distance”
– Requires excellent networking
• Open Standards
– Allow constructive distributed
development
• There is not (yet) a single Grid
5.44 Gbps
1.1 TB in 30 min.
6.25 Gbps
20 April 2004
6-October-2005 AICA@Udine L.Perini
17
The GRID middleware:• Finds convenient places for the scientists “job” (computing task) to be run
• Optimises use of the widely dispersed resources
• Organises efficient access to scientific data
• Deals with authentication to the different sites that the scientists will be using
• Interfaces to local site authorisation and resource allocation policies
• Runs the jobs
• Monitors progress
• Recovers from problems
… and ….
Tells you when the work is complete and transfers the result back!
How will it work?
6-October-2005 AICA@Udine L.Perini
18
The LHC Computing Grid Project - LCGCollaboration
LHC Experiments
Grid projects: Europe, US
Regional & national centres
ChoicesAdopt Grid technology.
Go for a “Tier” hierarchy.
Use Intel CPUs in standard PCs
Use LINUX operating system.
GoalPrepare and deploy the computing environment to help the experiments analyse the data from the LHC detectors.
grid for a physicsstudy group
Tier3physics
department
Desktop
Germany
Tier 1
USAUK
France
Italy
Taipei
CERNTier 1
JapanCERN Tier 0
Tier2
Lab aUni a
Lab c
Uni n
Lab m
Lab b
Uni bUni y
Uni xgrid for a regional group
6-October-2005 AICA@Udine L.Perini
19
Cooperation with other projects• Grid Software
– Globus, Condor and VDT have provided key components of the middleware used. Key members participate in OSG and EGEE
– Enabling Grids for E-sciencE (EGEE) includes a substantial middleware activity.
• Grid Operational Groupings– The majority of the resources used are made available as part of the
EGEE Grid (~140 sites, 12,000 processors). EGEE also supports Core Infrastructure Centres and Regional Operations Centres.
– The US LHC programmes contribute to and depend on the Open Science Grid (OSG). Formal relationship with LCG through US-Atlas and US-CMS computing projects.
• Network Services– LCG will be one of the most demanding applications of national
research networks such as the pan-European backbone network, GÉANT
– The Nordic Data Grid Facility (NDGF) will begin operation in 2006. Prototype work is based on the NorduGrid middleware ARC.
AICA@Udine L.Perini
20
Grid ProjectsGrid Projects
Until deployments provide interoperabilitythe experiments must provide it themselves
ATLAS must span 3 major Griddeployments
21
EGEE •Proposal submitted to EU IST 6th framework
•Project started April 1rst 2004•Total budget approved of approximately 32 M€ over 2 years activities•Deployment and operation of Grid Infrastructure (SA1)•Re-Engineering of grid middleware (WSRF environment) (JRA1)•Dissemination, Training and Applications (NA4)•Italy take part to all 3 area of activities with a global financing of 4.7 M€ •EGEE2 project with similar funding submitted for further 2 years work
11 regional federations covering 70 partners in
26 countries
22
EGEE Activities
JRA1: Middleware Engineering and Integration
JRA2: Quality Assurance
JRA3: Security
JRA4: Network Services Development
SA1: Grid Operations, Support and Management
SA2: Network Resource Provision
NA1: Management
NA2: Dissemination and Outreach
NA3: User Training and Education
NA4: Application Identification and Support
NA5: Policy and International Cooperation
24% Joint Research 28% Networking
48% ServicesEmphasis in EGEE is on operating a productiongrid and supporting the end-users
Starts 1st April 2004 for 2 years (1st phase) with EU funding of ~32M€
23
LCG/EGEE coordination
LCG Project Leader in EGEE Project Management
Board
Most of the other members are representatives of
HEP funding agencies and CERN
EGEE Project Director in LCG Project Overview
Board
Middleware and Operations are common to both
LCG and EGEE
Cross-representation on Project Executive Boards
EGEE Technical director in LCG PEB
EGEE HEP applications hosted in CERN/EP division
6-October-2005 AICA@Udine L.Perini
24
The Hierarchical “Tier” Model• Tier-0 at CERN
– Record RAW data (1.25 GB/s ALICE)
– Distribute second copy to Tier-1s
– Calibrate and do first-pass reconstruction
• Tier-1 centres (11 defined)– Manage permanent storage – RAW, simulated, processed
– Capacity for reprocessing, bulk analysis
• Tier-2 centres (>~ 100 identified)– Monte Carlo event simulation
– End-user analysis
• Tier-3– Facilities at universities and laboratories
– Access to data and processing in Tier-2s, Tier-1s
– Outside the scope of the project
6-October-2005 AICA@Udine L.Perini
25
Tier-1s
Tier-1 Centre
Experiments served with priority
ALICE ATLAS CMSLHC
b
TRIUMF, Canada X
GridKA, Germany X X X X
CC, IN2P3, France X X X X
CNAF, Italy X X X X
SARA/NIKHEF, NL X X X
Nordic Data Grid Facility (NDGF)
X X X
ASCC, Taipei X X
RAL, UK X X X X
BNL, US X
FNAL, US X
PIC, Spain X X X
6-October-2005 AICA@Udine L.Perini
26
Tier-2s
~100 identified – number still growing
6-October-2005 AICA@Udine L.Perini
27
The EventflowRate
[Hz]
RAW
[MB]
ESDrDSTRECO[MB]
AOD
[kB]
MonteCarlo
[MB/evt]
MonteCarlo
% of real
ALICE HI 100 12.5 2.5 250 300 100
ALICE pp
100 1 0.04 4 0.4 100
ATLAS 200 1.6 0.5 100 2 20
CMS 150 1.5 0.25 50 2 100
LHCb 2000
0.025
0.025 0.5 20
50 days running in 2007107 seconds/year pp from 2008 on ~109 events/experiment106 seconds/year heavy ion
6-October-2005 AICA@Udine L.Perini
28
CPU Requirements
0
50
100
150
200
250
300
350
2007 2008 2009 2010Year
MS
I200
0
LHCb-Tier-2
CMS-Tier-2
ATLAS-Tier-2
ALICE-Tier-2
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
CE
RN
Tie
r-1
Tie
r-2
58%
pled
ged
6-October-2005 AICA@Udine L.Perini
29
Disk Requirements
0
20
40
60
80
100
120
140
160
2007 2008 2009 2010Year
PB
LHCb-Tier-2
CMS-Tier-2
ATLAS-Tier-2
ALICE-Tier-2
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
CE
RN
Tie
r-1
Tie
r-2
54%
pled
ged
6-October-2005 AICA@Udine L.Perini
30
Tape Requirements
CE
RN
Tie
r-1
0
20
40
60
80
100
120
140
160
2007 2008 2009 2010Year
PB
LHCb-Tier-1
CMS-Tier-1
ATLAS-Tier-1
ALICE-Tier-1
LHCb-CERN
CMS-CERN
ATLAS-CERN
ALICE-CERN
75%
pled
ged
6-October-2005 AICA@Udine L.Perini
31
Experiments’ Requirements• Single Virtual Organization (VO) across the Grid• Standard interfaces for Grid access to Storage Elements (SEs) and
Computing Elements (CEs)• Need of a reliable Workload Management System (WMS) to
efficiently exploit distributed resources.• Non-event data such as calibration and alignment data but also
detector construction descriptions will be held in data bases – read/write access to central (Oracle) databases at Tier-0 and read access at
Tier-1s with a local database cache at Tier-2s• Analysis scenarios and specific requirements are still evolving
– Prototype work is in progress (ARDA)• Online requirements are outside of the scope of LCG, but there
are connections:– Raw data transfer and buffering– Database management and data export– Some potential use of Event Filter Farms for offline processing
6-October-2005 AICA@Udine L.Perini
32
Architecture – Grid services• Storage Element
– Mass Storage System (MSS) (CASTOR, Enstore, HPSS, dCache, etc.)– Storage Resource Manager (SRM) provides a common way to access MSS,
independent of implementation– File Transfer Services (FTS) provided e.g. by GridFTP or srmCopy
• Computing Element– Interface to local batch system e.g. Globus gatekeeper.– Accounting, status query, job monitoring
• Virtual Organization Management– Virtual Organization Management Services (VOMS)– Authentication and authorization based on VOMS model.
• Grid Catalogue Services– Mapping of Globally Unique Identifiers (GUID) to local file name– Hierarchical namespace, access control
• Interoperability– EGEE and OSG both use the Virtual Data Toolkit (VDT)– Different implementations are hidden by common interfaces
6-October-2005 AICA@Udine L.Perini
33
Prototypes
• It is important that the hardware and software systems developed in the framework of LCG be exercised in more and more demanding challenges– Data Challenges have now been done by all experiments.
– Though the main goal was to validate the distributed computing model and to gradually build the computing systems, the results have been used for physics performance studies and for detector, trigger, and DAQ design.
– Limitations of the Grids have been identified and are being addressed.
• Presently, a series of Service Challenges aim to realistic end-to-end testing of experiment use-cases over in extended period leading to stable production services.
6-October-2005 AICA@Udine L.Perini
34
Data Challenges• ALICE
– PDC04 using AliEn services native or interfaced to LCG-Grid. 400,000 jobs run producing 40 TB of data for the Physics Performance Report.
– PDC05: Event simulation, first-pass reconstruction, transmission to Tier-1 sites, second pass reconstruction (calibration and storage), analysis with PROOF – using Grid services from LCG SC3 and AliEn
• ATLAS– Using tools and resources from LCG, NorduGrid, and Grid3 at 133
sites in 30 countries using over 10,000 processors where 235,000 jobs produced more than 30 TB of data using an automatic production system in 2004
– In 2005 Production for Physics Workshop in Rome – next slides• CMS
– 100 TB simulated data reconstructed at a rate of 25 Hz, distributed to the Tier-1 sites and reprocessed there.
• LHCb– LCG provided more than 50% of the capacity for the first data
challenge 2004-2005. The production used the DIRAC system.
AICA@Udine L.Perini
35
ATLAS Production System ATLAS Production System
LCG NG Grid3 LSF
LCGexe
LCGexe
NGexe
G3exe
LSFexe
super super super super super
ProdDBData Man.
System
RLS RLS RLS
jabber jabber soap soap jabber
Don Quijote
Windmill
Lexor
AMI
CaponeDulcinea
A big problem is data management
Must cope with >= 3 Grid catalogues
Demands even greater for analysis
6-October-2005AICA@Udine L.Perini 36
ATLAS Massive productions on 3 Grids
July-September 2004: DC2 Geant-4 simulation (long jobs)
40% on LCG/EGEE Grid, 30% on Grid3 and 30% on NorduGrid
October-December 2004: DC2 digitization and reconstruction (short jobs)
February-May 2005: Rome production (mix of jobs as digitization and reconstruction was started as soon as samples had been simulated)
65% on LCG/EGEE Grid, 24% on Grid3, 11% on NorduGrid
CPU Consumption for the simulation CPU-intensive phase (till may 20th)Grid3: 80 kSI2K.years NorduGrid: 22 kSI2k.yearsLCG-tot: 178 kSI2K.yearsTotal: 280 kSI2K.years
Note: this CPU was almost fully consumed in 40 days, and the results were used for the real physics analysis presented in the Workshop at Rome, with the participation of > 400 ATLAS physicists.
6-October-2005AICA@Udine L.Perini 37
Number of Jobs
Grid324%
LCG34%
LCG-CG31%
NorduGrid11%
Grid3
LCG
LCG-CG
NorduGrid
Rome production statistics
73 data sets containing 6.1M events simulated and reconstructed (without pile-up)
Total simulated data: 8.5M events
Pile-up done later (for 1.3M events done, 50K reconstructed)
6-October-2005AICA@Udine L.Perini 38
This is the first successful use of the grid by a largeuser community, which has however also revealed several shortcomings which need now to be fixed as LHC turn-on is onlytwo years ahead!
Very instructive comments from the user feedback have been presented at the Workshop (obviously this was one of the main themes and purposes of the meeting)
All this is available on the Web
6-October-2005AICA@Udine L.Perini 39
ATLAS Rome production: countries (sites)
• Austria (1)• Canada (3)• CERN (1)• Czech Republic (2)• Denmark (3)• France (4) • Germany (1+2)• Greece (1)• Hungary (1)• Italy (17)
• Netherlands (2)
• Norway (2)
• Poland (1)
• Portugal (1)
• Russia (2)
• Slovakia (1)
• Slovenia (1)
• Spain (3)
• Sweden (5)
• Switzerland (1+1)
• Taiwan (1)
• UK (8)
• USA (19)
22 countries84 sites
17 countries; 51 sites
7 countries; 14 sites
6-October-2005AICA@Udine L.Perini 40
Status and plans for ATLAS production on LCG
The global efficiency of the ATLAS production for Rome was good in WLMS area ( >95%), while improvements are still needed in the Data Management area (~75%)
WLMS however speed needs however improvement
ATLAS is ready to test new EGEE mw components as soon as they are released from the internal certification process
The File Transfer Service and the LCG File Catalogue, together with the new ATLAS Data Management layer
The new (gLite) version of the WMLS with support for bulk submission, task queue and full model
Accounting, monitoring and priority ( VOMS role and group based) systems are expected to be in production use for mid 2006 new big production rounds
6-October-2005 AICA@Udine L.Perini
41
Conclusions• The HEP experiments at the LHC collider are
committed to a GRID based computing• The LHC Computing Grid Project is providing the
common effort needed for supporting them• EU and US funded Grid projects develop, mantain
and deploy the middleware• In the last year the Data Challenge have
demonstrated the feasibility of huge real productions– Still much work need to be done in the next 2 years for
meeting the challenge of the real data to be analyzed
Top Related