Grids Today, Clouds on the Horizon [ As tools for computational physics ] Jamie.Shiers@cern.ch Grid...

Grids Today, Clouds on the Horizon

[ As tools for computational physics ]

Jamie.Shiers@cern.chJamie.Shiers@cern.ch

Grid Support Group, IT Department, CERN Grid Support Group, IT Department, CERN ~~~

CCP, Ouro Preto, August 2008

Abstract

• By the time of CCP 2008, the LHC should have been cooled down to its operating temperature of 1.9K

• Collisions (pp) at 5 + 5 TeV should be a matter of weeks away, with 7 + 7 TeV running in 2009 and beyond [ news at end! ]

In order to process the data from this machine we have put our “Higgs in one basket” – the Grid

After many years of preparation, we have run a final series of readiness challenges, aimed at 2008 needs (CCRC’08)

But change – as always – is on the horizon• Funding models for Grids – EGEE (EELA etc.) EGI• New paradigms – Clouds versus Grids?• Multi-core h/w – impact on deployment models• A comparison with LEP – hints for future challenges…

Background

• Neither “Grids” nor “Clouds” are new concepts in the area of computing• Some people even claim that they are synonymous• Others say that “Clouds = Grids + Business model” (or

extension)

• Using the concrete example of Computing for the experiments at the Large Hadron Collider at CERN, I will attempt to define:• What is meant by a production Grid• Some fundamental differences between (at least today’s)

clouds and Grids• Why these are of interest to a wide range of applications,

possibly including yours…

So What is the Grid?

• According to Michel Vetterli, the chair of the WWLLCCGG Collaboration Board, the Grid is

• “The Star Trek Computer”

• Another vision is that it should work as “the Ultimate Application Accelerator”

On the down-site, one must also consider the complications of a vast, distributed system…

• Issues related to the deployment and operation of such a system must be carefully balanced against the benefits…

WLCG Service Deployment – Lessons Learned 4

What’s Special about “the Grid”

From the viewpoint of a “large consumer”:

Grids have proven to be an excellent way of federating resources across computer centres of varying sizes into much larger quasi-homogeneous infrastructures.

This matches well with the needs of international science, allowing resources at participating institutes to meet the needs of the entire collaboration.

This in turn adds value to the individual sites, leading to a positive feedback situation.

[Legendary] Grid Classification…

• Grid Computing (potentially) offers value to a wide range of applications, broadly classified as follows:

Provisioned• Large scale, long term “Grand Challenge”• e.g. LHC (“space microscopes”), space telescopes, ….

Scheduled• Require large resources for short periods• Far too expensive to provision for a single ‘application’ Not (always) time critical – disaster response?

Opportunistic• Which includes the above but also other areas which are less “real

time”

• You can find numerous examples of “Mission Critical” applications in each of these categories (e.g. EGEE User Forum!)• “Mission Critical” as in “Life or Death”

Common Misconceptions

• Talking of a middleware component / release / distribution as if it were as service

• Believing / assuming that service reliability can be added post facto

• Thinking that robust and reliable services take more manpower than crummy ones – or are more expensive

• That more monitoring (alone) makes services more reliable…

• “The Grid” certainly adds significant extra complexity, as per Ian Foster’s 3 laws of Gridability…

Ticklist for a new service

• User support procedures (GGUS)• Troubleshooting guides + FAQs• User guides

• Operations Team Training• Site admins• CIC personnel• GGUS personnel

• Monitoring• Service status reporting• Performance data

• Accounting• Usage data

• Service Parameters • Scope - Global/Local/Regional• SLAs• Impact of service outage• Security implications

• Contact Info• Developers• Support Contact• Escalation procedure to developers

• Interoperation• ???

• First level support procedures• How to start/stop/restart service• How to check it’s up• Which logs are useful to send to

CIC/Developers• and where they are

• SFT Tests• Client validation; • Server validation• Procedure to analyse these

• error messages and likely causes

• Tools for CIC to spot problems• GIIS monitor validation rules (e.g.

only one “global” component)• Definition of normal behaviour

• Metrics• CIC Dashboard

• Alarms

• Deployment Info• RPM list• Configuration details (for yaim)• Security audit

Pros & Cons – Managed Services

Predictable service level and interventions; fewer interventions, lower stress level and more productivity, good match of expectations with reality, steady and measurable improvements in service quality, more time to work on the physics, more and better science, …

Stress, anger, frustration, burn-out, numerous unpredictable interventions, including additional corrective interventions, unpredictable service level, loss of service, less time to work on physics, less and worse science, loss and / or corruption of data, …

September 2, 2007 M.Kasemann

WLCG CCRC’08 – Motivation and Goals

What when:• The LHC is operating and experiments take data? (15PB/year after

reduction)• All experiments want to use the computing infrastructure simultaneously?• The data rates and volumes to be handled at the Tier0, the Tier1 and Tier2

centers are the sum of ALICE, ATLAS, CMS and LHCb as specified in the experiments computing model

• Each experiment has done data challenges, computing challenges, tests, dress rehearsals, …. at a schedule defined by the experiment

This will stop: we will no longer be the master of our schedule……. Once LHC starts to operate.

We need to prepare for this … together ….

A combined challenge by all Experiments should be used to demonstrate the readiness of the WLCG Computing infrastructure before start of data taking at a scale comparable to the data taking in 2008.

This should be done well in advance of the start of data taking on order to identify flaws, bottlenecks and allow to fix those.

We must do this challenge as WWLLCCGG collaboration: Centers and Experiments

HEPiX Rome 05apr06

les.robertson@cern.ch

simulation

reconstruction

analysis

interactivephysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for

Physics Analysisevent filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

HEPiX Rome 05apr06

les.robertson@cern.ch

LCG Service Hierarchy

Tier-0 – the accelerator centreTier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Data Distribution to Tier-1 centres

Canada – Triumf (Vancouver)

France – IN2P3 (Lyon)

Germany –Karlsruhe

Italy – CNAF (Bologna)

Netherlands – NIKHEF/SARA (Amsterdam)

Nordic countries – distributed Tier-1

Spain – PIC (Barcelona)

Taiwan – Academia SInica (Taipei)

UK – CLRC (Oxford)

US – FermiLab (Illinois)

– Brookhaven (NY)

Tier-1 – “online” to data Tier-1 – “online” to data acquisition process acquisition process high high availabilityavailability

Managed Mass Storage – grid-enabled data service

All re-processing passes Data-heavy analysis National, regional support

Tier-2 – ~100 centres in ~40 countriesTier-2 – ~100 centres in ~40 countries Simulation End-user analysis – batch and interactive Services, including Data Archive and Delivery, from Tier-1s

23.01.07 G. Dissertori 15

LHC: One Ring to Bind them…LHC: One Ring to Bind them…

Introduction

Status of

Conclusions

LHC : 27 km long100m underground

General Purpose,pp, heavy ions

CMS+TOTEM

Heavy ions, pp

pp, B-Physics,CP Violation

Status of the LHC in May

And now…

CCRC’08 – Objectives

• Primary objective (next) was to demonstrate that we (sites, experiments) could run together at 2008 production scale This includes testing all “functional blocks”:

• Experiment to CERN MSS; CERN to Tier1; Tier1 to Tier2s etc.

• Two challenge phases were foreseen:1.February : not all 2008 resources in place – still adapting to

new versions of some services (e.g. SRM v2.2) & experiment s/w2.May: all 2008 resources in place – full 2008 workload, all

aspects of experiments’ production chains

N.B. failure to meet target(s) would necessarily result in discussions on de-scoping!

• Fortunately, the results suggest that this is not needed, although much still needs to be done before, and during, May!

19becla@slac.stanford.edu – CHEP2K - Padua

Startup woes – BaBar Startup woes – BaBar experience experience

How We Measured Our Success

• Agreed up-front on specific targets and metrics – these were 3-fold and helped integrate different aspects of the service (CCRC’08 wiki):1. Explicit “scaling factors” set by the experiments for each

functional block: discussed in detail together with sites to ensure that the necessary resources and configuration were in place;

2. Targets for the lists of “critical services” defined by the experiments – those essential for their production, with an analysis of the impact of service degradation or interruption (WLCG Design, Implementation & Deployment standards)

3. WLCG “Memorandum of Understanding” (MoU) targets – services to be provided by sites, target availability, time to intervene / resolve problems …

Clearly some rationalization of these would be useful – significant but not complete overlap

Critical Service Follow-up

• Targets (not commitments) proposed for Tier0 services• Similar targets requested for Tier1s/Tier2s• Experience from first week of CCRC’08 suggests targets for

problem resolution should not be too high (if ~achievable)

• The MoU lists targets for responding to problems (12 hours for T1s)

¿ Tier1s: 95% of problems resolved <1 working day ?¿ Tier2s: 90% of problems resolved < 1 working day ?

Post-mortem triggered when targets not met!

Time Interval Issue (Tier0 Services) TargetEnd 2008 Consistent use of all WLCG Service Standards 100%

30’ Operator response to alarm / call to x5011 / alarm e-mail 99%

1 hour Operator response to alarm / call to x5011 / alarm e-mail 100%

4 hours Expert intervention in response to above 95%

8 hours Problem resolved 90%24 hours Problem resolved 99%

Week-3: all experiments

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

InternetServices

Week-2: Results

All days (throughput) All days (errors)

On WLCG Readiness

• The service runs smoothly – most of the time

• Problems are typically handled rather rapidly, with a decreasing number that require escalation

• We have a well-proven “Service Model” that allows us to handle anything from “Steady State” to “Crisis” situations

• We have repeatedly proven that we can – typically rather rapidly – work through even the most challenging “Crisis Situation”

• Typically short-term work-arounds with longer term solutions

Activities Results

1. Crisis & problems Stress, burn-out, fire fighting, crisis management

2. Planning, new opportunities

Vision, balance, control, discipline

3. Interruptions, e-mail, … Out of control, victimised

4. Trivia, time wasting Irresponsible, …

Power & Cooling – Post CCRC’08

Site Comments

Had a serious problem this w/e with A/C. Had to stop about 300 WNs - waiting for action this week to repair A/C machine. Keep info posted on website.

INFN CNAF - suffered serious problem. UPS too heavy & floor collapsed!

• IMHO, a “light-weight” post-mortem, as prepared by various sites for events during the May run of CCRC’08, should be circulated for both of these cases.

• I believe that this was in fact agreed (by the WLCG MB) during the February run of CCRC’08

I think we should assume that power & cooling problems are part of life and plan accordingly

Julia Andreeva, CERN, 12.06.2008 CCRC08 Postmortem Workshop WLCG workshop 29

Jumping to the conclusions

• The main monitoring sources for the challenge were experiment specific monitoring tools.

• For activities at CERN (Tier0, CAF) Lemon was widely used. • SAM and SLS were used by all experiments for monitoring of the

status of the services and sites • In general worked quite well and provided enough information to

follow the challenge, to see whether the targets are met, to spot the problem rather quickly

• In most cases the problems were triggered by people on shifts using the monitoring UIs, alarms are not yet common practice.

• We do not yet have a straight forward way to show what is going on in the experiments for people external to the VO and even for users inside the VO (non-experts).

For performance measurements except Lemon for CERN related activities and T0-T1 transfer display in GridView, nothing else was provided to show the combined picture of experiments metrics sharing the same resources.

• Sites are still a bit disoriented. They do not have clear idea how to to understand their own role/performance and whether they are

serving the VOs wellWork is ongoing to address the last points

CCRC’08 – Conclusions (LHCC referees)

• The WLCGWLCG service is running (reasonably) smoothly

• The functionality matches: what has been tested so far – and what is (known to be) required

We have a good baseline on which to build

• (Big) improvements over the past year are a good indication of what can be expected over the next!

• (Very) detailed analysis of results compared to up-front metrics – in particular from experiments!

Strengths

• CCRC’08 and accompanying experiment “dress rehearsals” have in most cases demonstrated that the services / sites / experiments are ready for higher loads than are expected from 2008 pp data taking

• The middleware process is working well!

• The database services are working well!

• We have a well tested service model and have demonstrated steady improvement over a long time

Weaknesses

• Some of the services – including but not limited to storage / data management – are still not sufficiently robust. We have (so far) failed to define and regularly update a clear table of versions + release + patch level. This is nevertheless the target, with a weekly update at the joint EGEE-OSG-WLCG operations meeting

• Communication is still an issue / concern. This requires work / attention from everybody – it is not a one-way flow.

• Not all activities (e.g. reprocessing, chaotic end-user analysis) were fully demonstrated even in May, nor was there sufficient overlap between all experiments (and all activities). Work continues (July and beyond)…

• There were a large number of Tier0 service upgrades in June – not always well scheduled and / or motivated. We must balance stability with needed fixes

Opportunities

• There is no technical reason why we cannot solve the non-technical problems in the storage area (i.e. define recommended versions that have been released and tested)

• Communication – certainly no silver bullet expected. Need solutions that scale to the number of sites / players involved, that can adapt to constraints of time zones and affordable technology (audio & video conferencing, for example…)

• Improvements in monitoring and automation to reduce human expert involvement to a sustainable level (medium – long-term?)

• We still need to maintain a high(-er) level view of the overall WLCG service – a purely component view is not compatible with a highly complex service with many inter-dependencies

Threats

• The biggest threat we see is to fall back from reliable service mode into “fire-fighting” at the first sign of (major?) problems.

• This in the past has been accompanied by memos to the highest level, triggering time and effort consuming response / post-mortems, but is not sustainable and is much less efficient than the proven service mode.

• This requires close collaboration and concerted effort – as has been the case through many years of data and service challenges, and as we have seen at previous machines. We will use a daily operations meeting as a focus / dispatching point plus constant interactions with experiments / sites.

A Comparison with LEP…

• In January 1989, we were expecting e+e- collisions in the summer of that year…

• The “MUSCLE” report was 1 year old and “Computing at CERN in the 1990s” was yet to be published (July 1989)

It took quite some time for the offline environment (CERNLIB+experiment s/w) to reach maturity

• Some key components had not even been designed!• Not one single line of “FATMEN” had been written by this stage!

Major changes in the computing environment were about to strike!

• We had just migrated to CERNVM – the Web was around the corner, as was distributed computing (SHIFT)

• (Not to mention OO & early LHC computing!) HEPiX was two years from being born!

Summary

• CCRC’08 has proven to be a very valuable exercise for demonstrating readiness for 2008 data taking, including identifying (and fixing) holes in the service

• With justification, we can be confident of our readiness – from steady operation through to unexpected “crises” (which we will quickly defuse & resolve…)

• Communication & coordination have been key

• It has been – at least at times – very hard work, but also extremely rewarding!

• We are looking forward to collisions

CCRC ’09 - Outlook

• SL(C)5• CREAM• Oracle 11g• SRM v2.2++ Other DM fixes…• SCAS• [ new authorization

framework ]• …

• 2009 resources• 2009 hardware• Revisions to Computing

Models• EGEE III transitioning to

more distributed operations• Continued commissioning,

7+7 TeV, transitioning to normal(?) data-taking (albeit low luminosity?)

• New DG, …

But what of Grids vs Clouds?• An issue that was already a concern at the time of LEP

was the effective “brain-drain” from collaborating institutes to CERN or other host laboratories

• Grid computing has (eventually) shown that participating institutes can really make a major contribution (scientifically and through computing resources)

• This gives valuable feedback to the funding bodies – one of the arguments against “CERNtralizing” everything

¿ Can these arguments apply to cloud computing?• Not (obviously) with today’s commercial offerings…• But maybe we are simply talking about a different level

of abstraction – and moving from today’s complex Grid environment to really much simpler ones • Which would be good!

• But again, what large scale database / data management application really works without knowing (and controlling) in detail the implementation!

The Road Ahead

• We have already begun discussions on the LHC luminosity upgrade – the “SLHC”

• This will essentially be a staged upgrade of the entire CERN accelerator complex lasting ~the next decade

• I would therefore guess that the LHC / SLHC data taking period will continue until at least the late 2020s

• What (apart from Unix) will still be around on this timescale?

• I mentioned in passing multi-core – this already has a significant impact on our service deployment model – but this is a “known known” – it’s the “unknown unknowns” we have to worry about…

LHC Update:Press Release from 07/08 at 14:45 UTC

• CERN announces start-up date for LHC!

• First attempt to circulate a beam in the Large Hadron Collider (LHC) will be made on 10 September• At injection energy of 450GeV• All 8 sectors at operating temperature of 1.9K end July

• Next phase is ‘synchronization’ with SPS – first test weekend of 9th August

• The (10 Sep) event will be webcast through http://webcast.cern.ch, and distributed through the Eurovision network.

• For further information and accreditation procedures: http://www.cern.ch/lhc-first-beam

The LHC rap

• http://www.youtube.com/watch?v=j50ZssEojtM

Grids Today, Clouds on the Horizon [ As tools for computational physics ] Jamie.Shiers@cern.ch Grid...

Documents

Transcript of Grids Today, Clouds on the Horizon [ As tools for computational physics ] Jamie.Shiers@cern.ch Grid...

Corpus Christi Procession Ouro Preto 2002

The CAPÃO Topaz Deposit, Ouro Preto, Minas Gerais, Brazil · Ouro Preto is located about 285 airline kilometers north of Rio de Janeiro, in the Serra do Espinha~o mountain range

UNIVERSIDADE FEDERAL DE OURO PRETO...UNIVERSIDADE FEDERAL DE OURO PRETO INSTITUTO DE CIÊNCIAS HUMANAS E SOCIAIS WALQUÍRIA SILVA LÚCIO O (DES)PREPARO DO/A PROFESSOR/A NA PRESENÇA

Universidade Federal de Ouro Preto Instituto de Filosofia ......1 EDUARDO REINA BASTOS Estética Afirmativa em Arthur Schopenhauer Dissertação apresentada ao programa de Mestrado

ON THE ASSOCIATION OF PALLADIUM-BEARING …the ouro preto from the Ouro Preto area was presumedly recovered from itabirite-hosted, Pd-bearing gold min-eralization, known as jacutinga

New INSTITUTO FEDERAL DE MINAS GERAIS Campus Ouro Preto …restauro.ouropreto.ifmg.edu.br/wp-content/uploads/sites/... · 2018. 11. 6. · formal, chafarizes e lavabos com inserção

Supplementary Information - scielo.br · Universidade Federal de Ouro Preto, 35400-000 Ouro Preto-MG, Brazil eInstituto de Física de São Carlos, Universidade de São Paulo, 13560

Universidade Federal de Ouro Preto Instituto de … · Universidade Federal de Ouro Preto Instituto de Ciências Exatas e Biológicas ... Todos os professores participantes do minicurso

FOR AZOREAN ARTHROPODS: EVOLUTION, DIVERSITY, …islandlab.uac.pt/fotos/projectos/1440884882.pdf · 2 Universidade Federal de Ouro Preto, Instituto de Ciências Exatas e Biológicas,

UNIVERSIDADE FEDERAL DE OURO PRETO INSTITUTO DE …‡ÃO... · Preto. Instituto de Ciências Humanas e Sociais. Departamento de História. Programa de Pós-Graduação em História.

Fourteenth International Seminar on Urban Form, Ouro Preto ... · Fourteenth International Seminar on Urban Form, Ouro Preto, Brazil, 28-31 August 2007 This was the first ISUF conference

Metalogenia dos Greenstone-Belts Canadenses: … · Metalogenia dos Greenstone-Belts Canadenses: Ênfase nos Depόsitos de OURO Gema R. Olivo SIMEXMIN 2012 Ouro Preto

Instituto Federal de Minas Gerais – campus Ouro …sites.ouropreto.ifmg.edu.br/fisica/files/2014/07/PROJETO... · Web viewInstituto Federal de Minas Gerais – campus Ouro Preto

WATER-PRESENT ECLOGITE MELTING THE EFFECTS OF THE … · universidade federal de ouro preto escola de minas departamento de geologia ouro preto, dezembro de 2014 dissertaÇÃo de

The Morro da Queimada Archaeological Park, Ouro Preto, MG ...morrodaqueimada.fiocruz.br/pdf/16_morro.pdf · The Morro da Queimada Archaeological Park, Ouro Preto, MG - Brazil. B.

Vila Rica do Ouro Preto A Panoramic Vision of the Past

A escola de Minas de Ouro Preto

ESTRUTURA INSTITUCIONAL DO MERCOSUL Protocolo de Ouro Preto – 16.12.94 Decreto Legislativo – nº. 188 de 18.12.95 Decreto – nº. 1.901 de 09.05.96 Profª.

Universidade Federal de Ouro Preto – UFOP Instituto de ... · Professor: David Menotti ( menottid@gmail.com ) UFOP – ICEB – DECOM – 2º. Sem 2008 – David Menotti 10 algoritmo

PHOTOGRAPHY - casaturquesa.com.br · OURO PRETO Solar do Carmo Marcia Litchfield is the doyenne of Ouro Preto and her 18th-century townhouse oozes dolce-vita glamour. An elegant drawing