Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed...

Computing on the Grid and in the Clouds Roco Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group Overview The computational problem The challenge Grid computing The WLCG Operational experience Future perspectives 2 The Computational Problem Where does it come from? 3 The Source of all Data Delivering collisions at 40MHz A Collision An Event Raw data: Was a detector element hit? ADC counts Time signals Reconstructed data: Momentum of tracks (4-vectors) Origin Energy in clusters (jets) Particle type Calibration information Data Acquisition 1 GB/s MB/s Data flow to permanent storage: 4-6 GB/sec 1-2 GB/s Data Flow up to 4 GB/s Reconstruction and Archival First Reconstruction, Data Quality, Calibration An events lifetime Anna Sfyrla Summer Student Lecture: From Raw Data to Physics MC takes as much computing as the rest!!! The Computing Challenge Scale Data Computing Complexity 11 Data Volume & Rates 30+PB per year + simulation Preservation for Processing 340k + cores Log scale Understood when we started Big Data! Duplicate raw data Simulated data Many derived data products Recreate as software gets improved Replicate to allow physicists to access it Few PB of raw data becomes ~100 PB! 13 Large Distributed Community And all have Computers and Storage And all have Computers and Storage LHC UsersComputer Centres Overview The computational problem The challenge Grid computing The WLCG Operational experience Future perspectives 16 Go Distributed! Why? Technical and political/financial reasons No single center could provide ALL the computing Buildings, Power, Cooling, Cost, The community is distributed Computing already available at all institutes Funding for computing is also distributed How do you distribute all? With big data With hundreds of computing centers With a global user community And there is always new data! The Grid Coordinated resource sharing and problem solving in dynamic, multi- institutional virtual organizations Ian Foster and Karl Kesselman Share Computing resources Storage resources Many computers act together as a single one! 18 Main Ideas Multi-institutional organizations Site1 Site2 Site3 Different Services Different Policies Different AAA (Authentication Authorisation Accounting) Different AAA (Authentication Authorisation Accounting) Different Scale and Expertise Virtual Organizations The Users from A and B create a Virtual Organization Users have a unique identity but also the identity of the VO Organizations A and B support the Virtual Organization Place grid interfaces at the organizational boundary These map the generic grid functions/information/credentials To the local security functions/information/credentials Multi-institutional e-Science Infrastructures Organization B Organization A Virtual Organization 20 The Grid - Multi-institutional organizations Sites have to trust each other VOs have to trust sites Sites have to trust VOs For simplicity: Sites deal with VO permissions VOs deal with users Sites can override VO decisions Trust each other? Security!! Site 1 Site2 Site3 VO How to exchange secret keys? 340 Sites (global) With hundreds of nodes each? 200 User Communities (non local) Users (global) And keep them secret!!! Public Key Based Security Multi-institutional organizations Security How does all of this work? - Middleware! The Grid Site 1 Site2 Site3 VO Middleware Software in the middle making possible the communication between users and services! Sophisticated and divers back-end services Potential simple, heterogeneous, front end services Deals with the diversity of services Storage systems, batch systems Integrated across multiple organizations Lack of centralized control Geographical distribution Different policy environments International issues Services Users Middleware 24 Original Grid Services Data Management Services Job Management Services Security Services Information Services Certificate Management Service VO Membership Service Authentication Service Authorization Service Information System Messaging Service Site Availability Monitor Accounting Service Monitoring tools: experiment dashboards; site monitoring Storage Element File Catalogue Service File Transfer Service Grid file access tools GridFTP service Database and DB Replication Services POOL Object Persistency Service Compute Element Workload Management Service VO Agent Service Application Software Install Service Experiments invested considerable effort into integrating their software with grid services; and hiding complexity from users Pilot Factory Monitoring tools: experiment dashboards; site monitoring 25 Managing Jobs on the Grid Workload Management Batch System Computing Element Schedules Submits Job Submit Job Every VO/Experiment 26 Worker Node Batch System Computing Element Request Job Schedules Submits Pilot Experiment/VO Workload Management Send Job Worker Node Pilot Factory Task queue The Brief History of WLCG MONARC project Defined the initial hierarchical architecture Growing interest in Grid technology HEP community main driver in launching the DataGrid project EU DataGrid project Middleware & testbed for an operational grid LHC Computing Grid Deploying the results of DataGrid for LHC experiments EU EGEE project phase 1 A shared production infrastructure building upon the LCG EU EGEE project phase 2 Focus on scale, stability Interoperations/Interoperability EU EGEE project phase 3 Efficient operations with less central coordination x EGI and EMI Sustainability CERN WLCG Worldwide LHC Computing Grid An International collaboration to distribute, store and analyze LHC data Links computer centers worldwide that provide computing and storage resources into a single infrastructure accessible by all LHC physicists Biggest scientific Grid project in the world 28 WLCG EGI OSG NDGF A Tiered Architecture 40% 15% 45% Tier-0 (CERN): (15%) Data recording Initial data reconstruction Data distribution Tier-1 (13 centres): (40%) Permanent storage Re-processing Analysis Connected 10 Gb fibres Tier-2 (~160 centres): (45%) Simulation End-user analysis LHC Networking Relies upon OPN, GEANT, ESNet NRENs & other national & international providers Computing Model Evolution Today: Bandwidths Gb/s, not limited to the hierarchy Flatter, mostly a mesh Sites contribute based on capability Greater flexibility and efficiency More fully utilize available resources Original model: Static strict hierarchy Multi-hop data flows Lesser demands on Tier 2 networking Virtue of simplicity Designed for 80 sites 100 k Number of concurrent ATLAS jobs Jan-July 2012 > 1500 distinct ATLAS users do analysis on the GRID Available resources fully used/stressed (beyond pledges in some cases) Massive production of 8 TeV Monte Carlo samples Very effective and flexible Computing Model and Operation team accommodate high trigger rates and pile-up, intense MC simulation, analysis demands from worldwide users (through e.g. dynamic data placement) Conclusions Grid Computing and WLCG have proven themselves during the first run of data-taking of LHC Grid Computing works for our community and has a future Model changed from Tree to Mesh structure networks improved much faster than CPUs Shift from resource provider to user community new tasks, new responsibilities, new tool-chains Lots of challenges for our generation!

Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed...

Documents

Transcript of Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed...