Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas...
-
Upload
barnaby-adam-gordon -
Category
Documents
-
view
221 -
download
0
description
Transcript of Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas...
![Page 1: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/1.jpg)
IT-SDC : Support for Distributed Computing
Commissioning the CERN IT Agile Infrastructure with experiment
workloadsRamón Medrano Llamas
IT-SDC-OL
14.10.2013
![Page 2: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/2.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
2IT-SDC
Agenda
The Agile Infrastructure Workload Management Systems Dynamic provisioning Conclusions
14.10.2013
![Page 3: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/3.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
3IT-SDC
The Agile Infrastructure
Private IaaS cloud OpenStack based Federates Meyrin and Wigner 15,000 hypervisors by 2015 300,000 VMs by 2015 Configuration management tools
Puppet, Foreman
14.10.2013
![Page 4: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/4.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
4IT-SDC
Workload Management Systems
Pilot based systems ATLAS: PanDA CMS: glidein WMS Both have an HTCondor backend Using Nova and EC2 APIs
14.10.2013
![Page 5: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/5.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
5IT-SDC
PanDA integration
Manual HTCondor cluster deployment Long lived worker nodes Condor for pilot submission CVMFS + EOS
Same setup on HLT, Helix Nebula, Rackspace…
14.10.2013
![Page 6: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/6.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
6IT-SDC
glidein integration
Dynamic cluster deployment (EC2) Worker node automatically managed Condor for batch orchestration CVMFS + EOS
14.10.2013
![Page 7: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/7.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
7IT-SDC
Deployment
14.10.2013
![Page 8: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/8.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
8IT-SDC
Support from the AI Team
Got 1,600 cores from the OpenStack team Testing Essex, Folsom, Grizzly
Complete freedom to access resources
Consultancy at any time Rapid bug report-solution cycle
14.10.2013
![Page 9: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/9.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
9IT-SDC
Testing strategy
Standard HammerCloud benchmark Compared with other clouds, bare
metal 690,000 ATLAS jobs 337,000 CMS jobs
14.10.2013
![Page 10: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/10.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
10IT-SDC
Testing summary
14.10.2013
ATLAS CMSScheduler PanDA Scheduler glideinWMS
Cluster management Static Cluster management DynamicCluster size 770 cores Cluster size 200 cores
Jobs submitted 694,698 Jobs submitted 337,080Failure rate 9.95% Failure rate 0.31%
Job type Simulation Job type SimulationTypical job duration 31 min. Typical job duration 9 min.Duration variance 17.8 min. Duration variance 4.8 min.
Most common error Failed to read LFC Most common error App. Error 8020
![Page 11: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/11.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
11IT-SDC
Performance testing: ATLAS, Ibex
Site Wallclock (s)
CPU efficiency(%
)Failure rate
(%)
OPENSTACK_CLOUD 3,114 78.8 2.1BNL_CLOUD 1,505 80.2 -
IAAS 1,539 61.5 -CERN-PROD 1,540 78.5 -
BNL_CVMFS_1 1,660 67.5 -
14.10.2013
Tested Late 2012 Over commission of resources
CPU efficiency not reliable in this context
![Page 12: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/12.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
12IT-SDC
Performance testing: ATLAS, Grizzly
Site Wallclock (s)
CPU efficiency(%
)Failure rate
(%)
OPENSTACK_CLOUD 1,827 82.3 13.7BNL_CLOUD 1,960 69.9 -
IAAS 1,417 67.5 -CERN-PROD 1,499 82.3 -
BNL_CVMFS_1 1,611 72.6 -
14.10.2013
Tested Late 2013 Good improvement in performance
And predictability
![Page 13: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/13.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
13IT-SDC
Performance testing: CMS, Ibex
Site Wallclock (s)
CPU efficiency(%
)Failure rate
(%)
T2_CH_CERN_AI 616 91.1 0.0T2_CH_CERN 914 82.8 -T1_US_FNAL 742 91.8 -
T1_DE_KIT 783 91.6
14.10.2013
Tested Mid 2013 Reliability is incredible good
![Page 14: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/14.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
14IT-SDC
Performance testing: ATLAS wallclock
14.10.2013
RACKSPACE_CLOUD
OPENSTACK_CLOUD
BNL_CLOUD
IAAS
CERN-PROD
BNL_CVMFS_1
0 500 1000 1500 2000 2500
Wallclock (s)
Get job Stage in Running Stage out Cleanup
![Page 15: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/15.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
15IT-SDC
Reliability testing
Few failures Infrastructure vs. WMS failure
Need new monitoring techniques Difficult to measure with state of the art
tools
14.10.2013
![Page 16: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/16.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
16IT-SDC
Reliability testing
14.10.2013
![Page 17: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/17.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
17IT-SDC
Reliability testing
14.10.2013
![Page 18: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/18.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
18IT-SDC
Reliability testing
14.10.2013
![Page 19: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/19.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
19IT-SDC
Dynamic provisioning
gLidein scales clusters automatically, Slowness with non-batch requests
PanDA was still not ready for it Studying APF and Cloud Scheduler
14.10.2013
![Page 20: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/20.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
20IT-SDC
Conclusions
Being able to successfully use the infrastructure
Scalability tests passed
14.10.2013
![Page 21: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/21.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
21IT-SDC
Future work
Unification of image lifetime Federation of other OpenStack clouds Accounting tools for clouds Better understanding of failures
14.10.2013
![Page 22: Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL 14.10.2013.](https://reader035.fdocuments.us/reader035/viewer/2022062317/5a4d1acf7f8b9ab059970b0f/html5/thumbnails/22.jpg)
Commissioning the CERN IT Agile Infrastructure with experiment workloads
22IT-SDC
Questions?
14.10.2013