Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT...

Experimen t Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010

Transcript of Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT...

Page 1: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.


Introduction to HammerCloud for The LHCb Experiment

Dan van der Ster

CERN IT Experiment Support

3 June 2010

Page 2: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Outline

• Introduction to HammerCloud– Motivation, History, Use-Cases

• How HammerCloud works– Design and Implementation Details

• Interface Tour for Users and Admins

• Possibilities for an LHCb Plugin

HammerCloud Introduction for LHCb – 2

Page 3: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Introduction to HammerCloud

• HammerCloud (HC) is a Distributed Analysis testing system serving two use-cases:– Robot-like Functional Testing: frequent “ping” jobs to all

sites to perform basic site validation– DA Stress Testing: on-demand large-scale stress tests

using real analysis jobs to test one or many sites simultaneously to:• Help commission new sites• Evaluate changes to site infrastructure• Evaluate SW changes• Compare site performances…

HammerCloud Introduction for LHCb – 3

Page 4: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HammerCloud and Job Robots

• HammerCloud is part of an evolution of job robots:– CMS Job Robot inspired the ATLAS GangaRobot (functional testing)– In ~Sept 2008, a form of the ATLAS GangaRobot was used to

manually stress test the Italian ATLAS Tier2’s:• 5 users manually submitting hundreds of instrumented jobs simultaneously

(SIMD)• Manual results collection and summarization• Early results were shown to be very useful:

– One early test showed a bimodal performance plot that was later traced to a faulty network switch which negatively affected the performance of some WNs. The need for an automated DA stress testing system was clear.

– HammerCloud was born in November 2008 to deliver on-demand stress tests to ATLAS sites:

• Since then HC has run >1300 “Tests” using more than 4 million jobs.• ATLAS has invested >200k CPU-days in HC tests

– CMS has also agreed to use HC: in April a prototype was delivered, and now scale tests are about to begin.

HammerCloud Introduction for LHCb – 4

Page 5: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HC and ATLAS during STEP’09

HammerCloud Introduction for LHCb – 5


Page 6: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HammerCloud Use-Cases

• Provides On-Demand and Automated Testing

• HC Operators define test templates: FUNCTIONAL and STRESS

• Functional Tests are automatically scheduled

– Results are published on the HC website and can be pushed to other systems (e.g. SAM)

• Stress tests are generally scheduled on demand as needed by:

– Central VO managers– Cloud/Regional managers– Site managers

• For all tests, a detailed report summarizing the job success rates and performances is produced.

HammerCloud Introduction for LHCb – 6

Page 7: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HammerCloud Components

• The HC UI is implemented as a Django web app:– View test results– View cloud/site evolution– DB Admin

• State is maintained in a MySQL DB

• HC Logic (job submission, monitoring, resubmission) implemented on top of the Ganga Grid Programming Interface (GPI)

HammerCloud Introduction for LHCb – 7

Page 8: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HammerCloud Logic

• An HC Test is described by:– The analysis code to run (typically a real analysis from the user community)– The dataset pattern (which can be resolved to a set of datasets appropriate

for the analysis code)– The list of sites to be tested, and the target number of jobs to run

concurrently per site– A start time and an end time

• Test execution proceeds in 4 steps:– Generate: Test description is converted to a set of submittable jobs (e.g.

Ganga job objects, one for each site under test)– Submit: the job objects are submitted– Run: jobs are monitored, outputs recorded to the HC DB, jobs are

resubmitted to achieve the target number of running jobs per site– Exit: at the test end time, leftover jobs are killed

• Concurrently, the HC Web shows real time test results

HammerCloud Introduction for LHCb – 8

Page 9: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport An HC-LHCb Plugin

• What customizations would be needed for an HC-LHCb plugin?

• HC is built upon Ganga and exploits its job management features:– job repository, job configuration via

python, job submission, job monitoring in background thread(s)

• Given the existing GangaLHCb plugins, modifications to HC itself would be relatively minor, e.g.– HC Test Generation:

• Query a data discovery service to form a job processing random input data

– HC Test Running:• Changes to extract LHCb-specific job

metrics from Ganga

HammerCloud Introduction for LHCb – 9

Page 10: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.


Interface Tour

1. The Public User Interface

HammerCloud Introduction for LHCb – 10

Page 11: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HC Home

• The HC Homepage lists the running and scheduled tests.

HammerCloud Introduction for LHCb – 11

Page 12: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Viewing a Test

• The test overview gives a quick summary of: Overall job efficiency, CPU/Walltime, Events/WrapperTime

• Also shows a summary of the jobs running at each site involved in the test.

HammerCloud Introduction for LHCb – 12

Page 13: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Viewing a Test: Summary Stats

• The Test Overview page also gives summary statistics by site• Here you can see some example metrics (for CMS)

HammerCloud Introduction for LHCb – 13

Page 14: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Viewing a Test: Per-Site Plots

• View plots of the recorded metrics for each site

HammerCloud Introduction for LHCb – 14

Page 15: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Viewing a Test: Metric Comparisons

• View the plots for all sites for a specific metric

• Used to compare site-by-site

HammerCloud Introduction for LHCb – 15

Page 16: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Modify a Running Test

• Authorized users can modify the parameters of a test at run time– E.g. change the end time, or number of running jobs per site

HammerCloud Introduction for LHCb – 16

Page 17: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Clone a Previous Test

• Cloning a previous test is simple– Useful to repeat the test or to run an identical test at a

different set of sites

HammerCloud Introduction for LHCb – 17

Page 18: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Overall HC Plots

• Historical plots show previous test statistics• Currently shows # running jobs per site. Plots showing the

evolution of the performance metrics are in development.

HammerCloud Introduction for LHCb – 18

Page 19: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HC Robot View

• The “Robot” view is used to show the success rates of functional test jobs over the past 24 hrs. (Similar to SSB)

• Clicking a site takes you to the list of Robot jobs executed at that site

HammerCloud Introduction for LHCb – 19

Page 20: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.


Interface Tour

2. Admin Interface

HammerCloud Introduction for LHCb – 20

Page 21: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HC Admin: Operator and User Views

• HC Operators have access to admin all tables in the HC DB via a web interface

• HC Users have more limited access

HammerCloud Introduction for LHCb – 21

Page 22: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HC Admin: Tests and Templates

Above: List all Test Templates Below: List all Tests

HammerCloud Introduction for LHCb – 22

Page 23: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HC Admin: Edit a Test Template

• Test templates are defined via the Admin UI

• All of the parameters of a test are here, plus:– An active flag indicating that a

template should be auto-scheduled

– A default lifetime: auto-scheduled test instances of this template will run for this time period

• Normally, functional test templates include the list of sites to be tested, whereas stress test templates do not include a list of sites.

HammerCloud Introduction for LHCb – 23

Page 24: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport HC Admin: Adding a new Test

• Adding a new test on-demand is simple. Select the test template of interest, a start time, and an end time.

• If needed, Tests can be further customized after the template is copied over.

HammerCloud Introduction for LHCb – 24

Page 25: Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

ExperimentSupport Summary

• HammerCloud is a DA functional and stress testing system used widely by ATLAS and coming soon for CMS

• Two basic use-cases:– Continuous stream of test jobs to measure site availability– Enable central managers to define standardized (stress)

tests, and empower site managers to invoke those tests on-demand.

• An HC-LHCb plugin would leverage the existing GangaLHCb work– A prototype plugin would not take significant effort

HammerCloud Introduction for LHCb – 25