Post on 29-Jan-2016
Grids Today, Clouds on the Horizon
[ As tools for computational physics ]
Jamie.Shiers@cern.chJamie.Shiers@cern.ch
Grid Support Group, IT Department, CERN Grid Support Group, IT Department, CERN ~~~
CCP, Ouro Preto, August 2008
Abstract
• By the time of CCP 2008, the LHC should have been cooled down to its operating temperature of 1.9K
• Collisions (pp) at 5 + 5 TeV should be a matter of weeks away, with 7 + 7 TeV running in 2009 and beyond [ news at end! ]
In order to process the data from this machine we have put our “Higgs in one basket” – the Grid
After many years of preparation, we have run a final series of readiness challenges, aimed at 2008 needs (CCRC’08)
But change – as always – is on the horizon• Funding models for Grids – EGEE (EELA etc.) EGI• New paradigms – Clouds versus Grids?• Multi-core h/w – impact on deployment models• A comparison with LEP – hints for future challenges…
2
Background
• Neither “Grids” nor “Clouds” are new concepts in the area of computing• Some people even claim that they are synonymous• Others say that “Clouds = Grids + Business model” (or
extension)
• Using the concrete example of Computing for the experiments at the Large Hadron Collider at CERN, I will attempt to define:• What is meant by a production Grid• Some fundamental differences between (at least today’s)
clouds and Grids• Why these are of interest to a wide range of applications,
possibly including yours…
3
So What is the Grid?
• According to Michel Vetterli, the chair of the WWLLCCGG Collaboration Board, the Grid is
• “The Star Trek Computer”
• Another vision is that it should work as “the Ultimate Application Accelerator”
On the down-site, one must also consider the complications of a vast, distributed system…
• Issues related to the deployment and operation of such a system must be carefully balanced against the benefits…
WLCG Service Deployment – Lessons Learned 4
What’s Special about “the Grid”
From the viewpoint of a “large consumer”:
Grids have proven to be an excellent way of federating resources across computer centres of varying sizes into much larger quasi-homogeneous infrastructures.
This matches well with the needs of international science, allowing resources at participating institutes to meet the needs of the entire collaboration.
This in turn adds value to the individual sites, leading to a positive feedback situation.
5
[Legendary] Grid Classification…
• Grid Computing (potentially) offers value to a wide range of applications, broadly classified as follows:
Provisioned• Large scale, long term “Grand Challenge”• e.g. LHC (“space microscopes”), space telescopes, ….
Scheduled• Require large resources for short periods• Far too expensive to provision for a single ‘application’ Not (always) time critical – disaster response?
Opportunistic• Which includes the above but also other areas which are less “real
time”
• You can find numerous examples of “Mission Critical” applications in each of these categories (e.g. EGEE User Forum!)• “Mission Critical” as in “Life or Death”
6
Common Misconceptions
• Talking of a middleware component / release / distribution as if it were as service
• Believing / assuming that service reliability can be added post facto
• Thinking that robust and reliable services take more manpower than crummy ones – or are more expensive
• That more monitoring (alone) makes services more reliable…
• “The Grid” certainly adds significant extra complexity, as per Ian Foster’s 3 laws of Gridability…
8
Ticklist for a new service
• User support procedures (GGUS)• Troubleshooting guides + FAQs• User guides
• Operations Team Training• Site admins• CIC personnel• GGUS personnel
• Monitoring• Service status reporting• Performance data
• Accounting• Usage data
• Service Parameters • Scope - Global/Local/Regional• SLAs• Impact of service outage• Security implications
• Contact Info• Developers• Support Contact• Escalation procedure to developers
• Interoperation• ???
• First level support procedures• How to start/stop/restart service• How to check it’s up• Which logs are useful to send to
CIC/Developers• and where they are
• SFT Tests• Client validation; • Server validation• Procedure to analyse these
• error messages and likely causes
• Tools for CIC to spot problems• GIIS monitor validation rules (e.g.
only one “global” component)• Definition of normal behaviour
• Metrics• CIC Dashboard
• Alarms
• Deployment Info• RPM list• Configuration details (for yaim)• Security audit
Pros & Cons – Managed Services
Predictable service level and interventions; fewer interventions, lower stress level and more productivity, good match of expectations with reality, steady and measurable improvements in service quality, more time to work on the physics, more and better science, …
Stress, anger, frustration, burn-out, numerous unpredictable interventions, including additional corrective interventions, unpredictable service level, loss of service, less time to work on physics, less and worse science, loss and / or corruption of data, …
September 2, 2007 M.Kasemann
WLCG CCRC’08 – Motivation and Goals
What when:• The LHC is operating and experiments take data? (15PB/year after
reduction)• All experiments want to use the computing infrastructure simultaneously?• The data rates and volumes to be handled at the Tier0, the Tier1 and Tier2
centers are the sum of ALICE, ATLAS, CMS and LHCb as specified in the experiments computing model
• Each experiment has done data challenges, computing challenges, tests, dress rehearsals, …. at a schedule defined by the experiment
This will stop: we will no longer be the master of our schedule……. Once LHC starts to operate.
We need to prepare for this … together ….
A combined challenge by all Experiments should be used to demonstrate the readiness of the WLCG Computing infrastructure before start of data taking at a scale comparable to the data taking in 2008.
This should be done well in advance of the start of data taking on order to identify flaws, bottlenecks and allow to fix those.
We must do this challenge as WWLLCCGG collaboration: Centers and Experiments
HEPiX Rome 05apr06
LCG
les.robertson@cern.ch
simulation
reconstruction
analysis
interactivephysicsanalysis
batchphysicsanalysis
batchphysicsanalysis
detector
event summary data
rawdata
eventreprocessing
eventreprocessing
eventsimulation
eventsimulation
analysis objects(extracted by physics topic)
Data Handling and Computation for
Physics Analysisevent filter(selection &
reconstruction)
event filter(selection &
reconstruction)
processeddata
les.r
ob
ert
son
@cern
.ch
HEPiX Rome 05apr06
LCG
les.robertson@cern.ch
LCG Service Hierarchy
Tier-0 – the accelerator centreTier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Data Distribution to Tier-1 centres
Canada – Triumf (Vancouver)
France – IN2P3 (Lyon)
Germany –Karlsruhe
Italy – CNAF (Bologna)
Netherlands – NIKHEF/SARA (Amsterdam)
Nordic countries – distributed Tier-1
Spain – PIC (Barcelona)
Taiwan – Academia SInica (Taipei)
UK – CLRC (Oxford)
US – FermiLab (Illinois)
– Brookhaven (NY)
Tier-1 – “online” to data Tier-1 – “online” to data acquisition process acquisition process high high availabilityavailability
Managed Mass Storage – grid-enabled data service
All re-processing passes Data-heavy analysis National, regional support
Tier-2 – ~100 centres in ~40 countriesTier-2 – ~100 centres in ~40 countries Simulation End-user analysis – batch and interactive Services, including Data Archive and Delivery, from Tier-1s
23.01.07 G. Dissertori 15
LHC: One Ring to Bind them…LHC: One Ring to Bind them…
Introduction
Status of
LHCb
ATLAS
ALICE
CMS
Conclusions
LHC : 27 km long100m underground
ATLAS
General Purpose,pp, heavy ions
General Purpose,pp, heavy ions
CMS+TOTEM
Heavy ions, pp
ALICE
pp, B-Physics,CP Violation
Status of the LHC in May
16
And now…
17
CCRC’08 – Objectives
• Primary objective (next) was to demonstrate that we (sites, experiments) could run together at 2008 production scale This includes testing all “functional blocks”:
• Experiment to CERN MSS; CERN to Tier1; Tier1 to Tier2s etc.
• Two challenge phases were foreseen:1.February : not all 2008 resources in place – still adapting to
new versions of some services (e.g. SRM v2.2) & experiment s/w2.May: all 2008 resources in place – full 2008 workload, all
aspects of experiments’ production chains
N.B. failure to meet target(s) would necessarily result in discussions on de-scoping!
• Fortunately, the results suggest that this is not needed, although much still needs to be done before, and during, May!
18
19becla@slac.stanford.edu – CHEP2K - Padua
Startup woes – BaBar Startup woes – BaBar experience experience
How We Measured Our Success
• Agreed up-front on specific targets and metrics – these were 3-fold and helped integrate different aspects of the service (CCRC’08 wiki):1. Explicit “scaling factors” set by the experiments for each
functional block: discussed in detail together with sites to ensure that the necessary resources and configuration were in place;
2. Targets for the lists of “critical services” defined by the experiments – those essential for their production, with an analysis of the impact of service degradation or interruption (WLCG Design, Implementation & Deployment standards)
3. WLCG “Memorandum of Understanding” (MoU) targets – services to be provided by sites, target availability, time to intervene / resolve problems …
Clearly some rationalization of these would be useful – significant but not complete overlap
20
Critical Service Follow-up
• Targets (not commitments) proposed for Tier0 services• Similar targets requested for Tier1s/Tier2s• Experience from first week of CCRC’08 suggests targets for
problem resolution should not be too high (if ~achievable)
• The MoU lists targets for responding to problems (12 hours for T1s)
¿ Tier1s: 95% of problems resolved <1 working day ?¿ Tier2s: 90% of problems resolved < 1 working day ?
Post-mortem triggered when targets not met!
21
Time Interval Issue (Tier0 Services) TargetEnd 2008 Consistent use of all WLCG Service Standards 100%
30’ Operator response to alarm / call to x5011 / alarm e-mail 99%
1 hour Operator response to alarm / call to x5011 / alarm e-mail 100%
4 hours Expert intervention in response to above 95%
8 hours Problem resolved 90%24 hours Problem resolved 99%
Week-3: all experiments
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Week-2: Results
DAY1
All days (throughput) All days (errors)
26
On WLCG Readiness
• The service runs smoothly – most of the time
• Problems are typically handled rather rapidly, with a decreasing number that require escalation
• We have a well-proven “Service Model” that allows us to handle anything from “Steady State” to “Crisis” situations
• We have repeatedly proven that we can – typically rather rapidly – work through even the most challenging “Crisis Situation”
• Typically short-term work-arounds with longer term solutions
27
Activities Results
1. Crisis & problems Stress, burn-out, fire fighting, crisis management
2. Planning, new opportunities
Vision, balance, control, discipline
3. Interruptions, e-mail, … Out of control, victimised
4. Trivia, time wasting Irresponsible, …
Power & Cooling – Post CCRC’08
28
Site Comments
IN2P3
Had a serious problem this w/e with A/C. Had to stop about 300 WNs - waiting for action this week to repair A/C machine. Keep info posted on website.
INFN CNAF - suffered serious problem. UPS too heavy & floor collapsed!
• IMHO, a “light-weight” post-mortem, as prepared by various sites for events during the May run of CCRC’08, should be circulated for both of these cases.
• I believe that this was in fact agreed (by the WLCG MB) during the February run of CCRC’08
I think we should assume that power & cooling problems are part of life and plan accordingly
Julia Andreeva, CERN, 12.06.2008 CCRC08 Postmortem Workshop WLCG workshop 29
Jumping to the conclusions
• The main monitoring sources for the challenge were experiment specific monitoring tools.
• For activities at CERN (Tier0, CAF) Lemon was widely used. • SAM and SLS were used by all experiments for monitoring of the
status of the services and sites • In general worked quite well and provided enough information to
follow the challenge, to see whether the targets are met, to spot the problem rather quickly
• In most cases the problems were triggered by people on shifts using the monitoring UIs, alarms are not yet common practice.
• We do not yet have a straight forward way to show what is going on in the experiments for people external to the VO and even for users inside the VO (non-experts).
For performance measurements except Lemon for CERN related activities and T0-T1 transfer display in GridView, nothing else was provided to show the combined picture of experiments metrics sharing the same resources.
• Sites are still a bit disoriented. They do not have clear idea how to to understand their own role/performance and whether they are
serving the VOs wellWork is ongoing to address the last points
CCRC’08 – Conclusions (LHCC referees)
• The WLCGWLCG service is running (reasonably) smoothly
• The functionality matches: what has been tested so far – and what is (known to be) required
We have a good baseline on which to build
• (Big) improvements over the past year are a good indication of what can be expected over the next!
• (Very) detailed analysis of results compared to up-front metrics – in particular from experiments!
30
Strengths
• CCRC’08 and accompanying experiment “dress rehearsals” have in most cases demonstrated that the services / sites / experiments are ready for higher loads than are expected from 2008 pp data taking
• The middleware process is working well!
• The database services are working well!
• We have a well tested service model and have demonstrated steady improvement over a long time
31
Weaknesses
• Some of the services – including but not limited to storage / data management – are still not sufficiently robust. We have (so far) failed to define and regularly update a clear table of versions + release + patch level. This is nevertheless the target, with a weekly update at the joint EGEE-OSG-WLCG operations meeting
• Communication is still an issue / concern. This requires work / attention from everybody – it is not a one-way flow.
• Not all activities (e.g. reprocessing, chaotic end-user analysis) were fully demonstrated even in May, nor was there sufficient overlap between all experiments (and all activities). Work continues (July and beyond)…
• There were a large number of Tier0 service upgrades in June – not always well scheduled and / or motivated. We must balance stability with needed fixes
32
Opportunities
• There is no technical reason why we cannot solve the non-technical problems in the storage area (i.e. define recommended versions that have been released and tested)
• Communication – certainly no silver bullet expected. Need solutions that scale to the number of sites / players involved, that can adapt to constraints of time zones and affordable technology (audio & video conferencing, for example…)
• Improvements in monitoring and automation to reduce human expert involvement to a sustainable level (medium – long-term?)
• We still need to maintain a high(-er) level view of the overall WLCG service – a purely component view is not compatible with a highly complex service with many inter-dependencies
33
Threats
• The biggest threat we see is to fall back from reliable service mode into “fire-fighting” at the first sign of (major?) problems.
• This in the past has been accompanied by memos to the highest level, triggering time and effort consuming response / post-mortems, but is not sustainable and is much less efficient than the proven service mode.
• This requires close collaboration and concerted effort – as has been the case through many years of data and service challenges, and as we have seen at previous machines. We will use a daily operations meeting as a focus / dispatching point plus constant interactions with experiments / sites.
34
A Comparison with LEP…
• In January 1989, we were expecting e+e- collisions in the summer of that year…
• The “MUSCLE” report was 1 year old and “Computing at CERN in the 1990s” was yet to be published (July 1989)
It took quite some time for the offline environment (CERNLIB+experiment s/w) to reach maturity
• Some key components had not even been designed!• Not one single line of “FATMEN” had been written by this stage!
Major changes in the computing environment were about to strike!
• We had just migrated to CERNVM – the Web was around the corner, as was distributed computing (SHIFT)
• (Not to mention OO & early LHC computing!) HEPiX was two years from being born!
35
Summary
• CCRC’08 has proven to be a very valuable exercise for demonstrating readiness for 2008 data taking, including identifying (and fixing) holes in the service
• With justification, we can be confident of our readiness – from steady operation through to unexpected “crises” (which we will quickly defuse & resolve…)
• Communication & coordination have been key
• It has been – at least at times – very hard work, but also extremely rewarding!
• We are looking forward to collisions
36
CCRC ’09 - Outlook
• SL(C)5• CREAM• Oracle 11g• SRM v2.2++ Other DM fixes…• SCAS• [ new authorization
framework ]• …
• 2009 resources• 2009 hardware• Revisions to Computing
Models• EGEE III transitioning to
more distributed operations• Continued commissioning,
7+7 TeV, transitioning to normal(?) data-taking (albeit low luminosity?)
• New DG, …
37
But what of Grids vs Clouds?• An issue that was already a concern at the time of LEP
was the effective “brain-drain” from collaborating institutes to CERN or other host laboratories
• Grid computing has (eventually) shown that participating institutes can really make a major contribution (scientifically and through computing resources)
• This gives valuable feedback to the funding bodies – one of the arguments against “CERNtralizing” everything
¿ Can these arguments apply to cloud computing?• Not (obviously) with today’s commercial offerings…• But maybe we are simply talking about a different level
of abstraction – and moving from today’s complex Grid environment to really much simpler ones • Which would be good!
• But again, what large scale database / data management application really works without knowing (and controlling) in detail the implementation!
38
The Road Ahead
• We have already begun discussions on the LHC luminosity upgrade – the “SLHC”
• This will essentially be a staged upgrade of the entire CERN accelerator complex lasting ~the next decade
• I would therefore guess that the LHC / SLHC data taking period will continue until at least the late 2020s
• What (apart from Unix) will still be around on this timescale?
• I mentioned in passing multi-core – this already has a significant impact on our service deployment model – but this is a “known known” – it’s the “unknown unknowns” we have to worry about…
39
LHC Update:Press Release from 07/08 at 14:45 UTC
• CERN announces start-up date for LHC!
• First attempt to circulate a beam in the Large Hadron Collider (LHC) will be made on 10 September• At injection energy of 450GeV• All 8 sectors at operating temperature of 1.9K end July
• Next phase is ‘synchronization’ with SPS – first test weekend of 9th August
• The (10 Sep) event will be webcast through http://webcast.cern.ch, and distributed through the Eurovision network.
• For further information and accreditation procedures: http://www.cern.ch/lhc-first-beam
40
The LHC rap
• http://www.youtube.com/watch?v=j50ZssEojtM
41