Download - ATLAS Collaboration

Alexandre Vaniachine (ANL)

ATLAS Collaboration

Invited talk at ACAT’2002, Moscow, RussiaJune 25, 2002

Alexandre Vaniachine (ANL)[email protected]

Data Challengesin ATLAS Computing


Outline & Acknowledgements

World Wide computing model Data persistency Application framework Data Challenges: Physics + Grid Grid integration in Data Challenges Data QA and Grid validation

Thanks to all ATLAS collaborators whose contributions I used in my talk


Core Domains in ATLAS Computing

ATLAS Computing is right in the middle of first period of Data Challenges

Data Challenge (DC) for software is analogous to Test Beam for detector: many components have to be brought together to work

Application

GridData

Separation of the data and the algorithms in ATLAS software architecture determines our core domains:

•Persistency solutions for event data storage •Software framework for data processing algorithms•Grid computing for the data processing flow


World Wide Computing Model

The focus of my presentation is on the integration of these three core software domains in ATLAS Data Challenges towards a highly functional software suite, plus a World Wide computing model which gives all ATLAS equal and equal quality of access to ATLAS data


ATLAS Computing Challenge

The emerging World Wide computing model is an answer to the LHC computing challenge:

For ATLAS the raw data itself constitute 1.3 PB/year adding “reconstructed” events and Monte Carlo data results in a

~10 PB/year (~3 PB on disk)The required CPU estimates including analysis are ~1.6M

SpecInt95CERN alone can handle only a fraction of these resources

Computing infrastructure, which was centralized in the past, now will be distributed (in contrast to the reverse trend for the experiments that were more distributed in the past)

Validation of the new Grid computing paradigm in the period before the LHC requires Data Challenges of increasing scope and complexity

These Data Challenges will use as much as possible the Grid middleware being developed in Grid projects around the world


Ensuring that the ‘application’ software is independent of underlying persistency technology is one of the defining characteristics of the ATLAS software architecture (“transient/persistent” split)

Integrated operation of framework & database domains demonstrated the capability of• switching between persistency technologies• reading the same data from different frameworks

Implementation: data description (persistent dictionary) is stored together with the data, application framework uses transient data dictionary for transient/persistent conversion

Grid integration problem is very similar to the transient/persistent issue, since all objects become just the bytestream either on disk or on the net

Technology Independence


ATLAS Database Architecture

Independent of underlying persistency technology

Ready for Grid integration

Data description

stored together with

the data


For some time ATLAS has had both a ‘baseline’ technology (Objectivity) and a baseline evaluation strategy• We implemented persistency in Objectivity for DC0• A ROOT-based conversion service (AthenaROOT) provides

the persistence technology for Data Challenge 1Technology strategy is to adopt LHC-wide LHC Computing Grid

(LCG) common persistence infrastructure (hybrid relational and ROOT-based streaming layer) as soon as this is feasible

ATLAS is committed to ‘common solutions’ and look forward to LCG being the vehicle for providing these in an effective way

Changing the persistency mechanism (e.g. Objectivity -> Root I/O) requires a change of “converter”, but of nothing else

The ‘ease’ of the baseline change demonstrates benefits of decoupling transient/persistent represenations

Our architecture, in principle, is capable to provide language independence (in the long-term)

Change of Persistency Baseline


Athena Software Framework

ATLAS Computing is steadily progressing towards a highlyfunctional software suite and implementing World Wide model(Note that a legacy software suite was produced and still

exists and is used: so it can be done for ATLAS detector!)

Athena Software Framework is used in Data Challenges for:

•generator events production• fast simulation•data conversion•production QA• reconstruction (off-line and High Level Trigger)

Work in progress: integrating detector simulationsFuture Directions: Grid integration


Athena Architecture Features

•Separation of data and algorithms•Memory management•Transient/Persistent separation

Athena has a common code base with GAUDI framework

(LHCb)


ATLAS Detector Simulations

Scale of the problem:25,5 millions distinct volume copies23 thousands different volume objects4,673 different volume typesmanaging up to few hundred

pile-up eventsone million hits per event on average


MC event(HepMC)

MC event(HepMC)

MC event(HepMC)

MC event(HepMC)

Universal Simulation Box

With all interfaces clearly defined, simulations become “Geant-neutral”

You can in principle run G3, G4, Fluka, parameterized simulation with no effect on the end users

G4 robustness test completed in DC0

Detector simulationprogram MC event

(HepMC)

MC event(HepMC)

MCTruthMCTruth

HitsHits

DigitisationDigitisation

DetDescription


Data Challenges

Data Challenges prompted increasing integration of grid components in ATLAS software

DC0 used to test the software readiness and the production pipeline continuity/robustness

Scale was limited to < 1 M eventsPhysics oriented: output for leptonic channels analysis

and legacy Physics TDR dataDespite the centralized production in DC0 we started

deployment of our DC infrastructure (organized in 13 work packages) covering in particular areas related to Grid like:• production tools• Grid tools for metadata bookkeeping and replica

management

We started distributed production on the Grid in DC1


DC0 Data Flow

Multiple production pipelines Independent data transformation steps Quality Assurance procedures


Data Challenge 1Reconstruction & analysis on a large scale: exercise data model, study ROOT I/O

performance, identify bottlenecks, exercise distributed analysis, …

Produce data for High Level Trigger (HLT) TDR & Physics groups• Study performance of Athena and algorithms for use in High Level

Trigger • Test of ‘data-flow’ through HLT: byte-stream -> HLT-> algorithms -> recorded

data• High statistics needed (background rejection study)

• Scale ~10M simulated events in 10-20 days, O(1000) PC’s

Exercising LHC Computing model: involvement of CERN & outside-CERN sites

Deployment of ATLAS Grid infrastructure: outside sites essential for this event scale

Phase 1 (started in June)• ~10oM Generator particles events (all data produced at CERN) • ~10M simulated detector response events (June – July)• ~10M reconstructed objects events

Phase 2 (September –December)• Introduction and use of new Event Data Model and Detector Description• More Countries/Sites/Processors • Distributed Reconstruction • Additional samples including pile-up• Distributed analyses• Further tests of GEANT4


DC1 Phase 1 Resources

Organization & infrastructure is in place lead by CERN ATLAS group

2000 processors, 1.5.1011 SI95sec

adequate for ~ 4*107 simulated events2/3 of data produced outside of CERNproduction on a global scale: Asia, Australia, Europe and North

America17 countries, 26 production sites

AustraliaMelbourne

CanadaAlbertaTriumf

Czech RepublicPrague

DenmarkCopenhagen

FranceCCIN2P3 Lyon

Switzerland

CERNTaiwan

Academia Sinica

UK

RAL

Lancaster

Liverpool (MAP)USA

BNL. . .

Germany

Karlsruhe

Italy: INFN

CNAF

Milan

Roma1

NaplesJapan

TokyoNorway

Oslo

PortugalFCUL Lisboa

Russia: RIVK BAKJINR DubnaITEP MoscowSINP MSU MoscowIHEP Protvino

SpainIFIC Valencia

SwedenStockholm


Data Challenge 2Schedule: Spring-Autumn 2003 Major physics goals:

• Physics samples have ‘hidden’ new physics• Geant4 will play a major role• Testing calibration and alignment procedures

Scope increased to what has been achieved in DC0 & DC1• Scale at a sample of 108 events • System at a complexity ~50% of 2006-2007 system

Distributed production, simulation, reconstruction and analysis:• Use of GRID testbeds which will be built in the context of the

Phase 1 of the LHC Computing Grid Project, • Automatic ‘splitting’, ‘gathering’ of long jobs, best available

sites for each job• Monitoring on a ‘gridified’ logging and bookkeeping system,

interface to a full ‘replica catalog’ system, transparent access to the data for different MSS system

• Grid certificates


Grid Integration in Data Challenges

Grid and Data Challenge Communities -overlapping objectives:

Grid middleware– testbed deployment, packaging, basic sequential

services, user portals Data management

– replicas, reliable file transfers, catalogs Resource management

– job submission, scheduling, fault tolerance Quality Assurance

– data reproducibility, application and data signatures, Grid QA


Grid Middleware ?


Grid Middleware !


ATLAS Grid Testbeds

US-ATLAS Grid

Testbed

EU DataGrid

NorduGrid

For more information see presentations by Roger Jones and Aleksandr Konstantinov


Interfacing Athena to the GRID

Areas of work:Data access (persistency), Event Selection,GANGA (job configuration & monitoring, resource estimation &

booking, job scheduling, etc.),Grappa - Grid User Interface for Athena

Athena/GAUDI Application

GANGA/GrappaGU

I

Virtual DataAlgorithms

GRIDServices

HistogramsMonitoringResults

Making the Athena framework working in the GRID environment requires:Architectural design & components making use of the Grid services


Data Management Architecture

AMI

ATLAS Metatda

ta Interface

MAGDA

MAnageMAnager for r for Grid-Grid-based based DAtaDAta

VDC

Virtual Data

Catalog


AMI Architecture

Data warehousing principle (star architecture)


MAGDA Architecture

Component-based architecture emphasizing fault-tolerance


VDC Architecture

Two-layer architecture


Introducing Virtual Data

Recipes for producing the data (jobOptions, kumacs) has to be fully tested, the produced data has to be validated through a QA step

Preparation production recipes takes time and efforts, encapsulating considerable knowledge inside. In DC0 more time has been spent to assemble the proper recipes than to run the production jobs

When you got the proper recipes, producing the data is straightforward

After the data have been produced, what do we have to do with the developed recipes? Do we really need to save them?

Data are primary, recipes are secondary


Virtual Data Perspective

GriPhyN project (www.griphyn.org) provides a different perspective:

recipes are as valuable as the data production recipes are the Virtual Data

If you have the recipes you do not need the data (you can reproduce them)recipes are primary, data are secondary

Do not throw away the recipes,save them (in VDC)

From the OO perspective:Methods (recipes) are encapsulated together with

the data in Virtual Data Objects


VDC-based Production System

High-throughput features:• scatter-gather data processing architecture

Fault tolerance features:• independent agents• pull-model for agent tasks assignment (vs push)• local caching of output and input data (except Objy input)

ATLAS DC0 and Dc1 parameter settings for simulations are recorded in the Virtual Data Catalog database using normalized components: parameter collections structured “orthogonally”

• Data reproducibility• Application complexity• Grid location

Automatic “garbage collection” by the job scheduler:• Agents pull the next derivation from VDC• After the data has been materialized agents register “success”

in VDC• When previous invocation has not been completed within the

specified timeout period, it is invoked again


AthenaGenerators

HepMC.root

digis.zebra

atlsimAthena

conversiondigis.root Athena

reconrecon.root

QA.ntuple

geometry.zebraAthena QA

AthenaAtlfast

filtering.ntuple

geometry.root

Athenaconversion

QA.ntuple

Athena QA

Atlfast.root

Atlfastrecon

recon.root

Exercising rich possibilities for data processing comprised of multiple independent data transformation steps

Tree-like Data Flow


Data Reproducibility

The goal is to validate DC samples productions by insuring the reproducibility of simulations run at different sites

We need the tool capable to establish the similarity or the identity of two samples produced in different conditions, e.g at different sites

A very important (and sometimes overlooked) component for the Grid computing deployment

It is complementary to the software and/or data digital signatures approaches that are still in the R&D phase


Grid Production Validation

Simulations are run in different conditions;

for instance, same generation input but different production sites

For each sample, Reconstruction, i.e Atrecon is run to produce standard CBNT ntuples

The validation application launches specialized independent analyses for ATLAS subsystems

For each sample standard histograms are produced


Comparison Procedure

Test sample

Reference sample

Superimposed Samples

Contributions to 2


Comparison procedure endswith a 2 -bar chart summary

Give a pretty nice overview

of how samples compare:

Summary of Comparison


Example of Finding

Comparing energy in calorimeters for Z 2l samples DC0, DC1

It works!

Difference caused by the cut at generation


Summary

ATLAS computing is in the middle of first period of Data Challenges of increasing scope and complexity and is steadily progressing towards a highly functional software suite, plus a World Wide computing model which gives all ATLAS equal and equal quality of access to ATLAS data

These Data Challenges are executed at the prototype tier centers and use as much as possible the Grid middleware being developed in Grid projects around the world

In close collaboration between the Grid and Data Challenge communities ATLAS is testing large-scale testbed prototypes, deploying prototype components to integrate and test Grid software in a production environment, and running Data Challenge 1 production in 26 prototype tier centers in 17 countries on four continents

Quite promising start for ATLAS Data Challenges!