Download - ATLAS & CMS Online Clouds

ATLAS & CMS Online Clouds

Olivier Chaze (CERN-PH-CMD) & Alessandro Di Girolamo (CERN IT-SDC-OL)

exploit the Experiments’ online farms

for offline activitiesduring the LS1 & beyond

~ 1

00 m

The Large Hadron Collider

6 Dec 2013Alessandro Di Girolamo 2

The experiment data flow

Trigger Level 1 - Special Hardware

Trigger Level 2 - Embedded Processors

40 MHz

40 MHz ((1000 TB/sec)

1000 TB/sec)

Trigger Level 3 - Farm of commodity CPUs

75 kHz 75 kHz (75 GB/sec)

(75 GB/sec)5 kHz5 kHz (5 GB/sec)

(5 GB/sec)400 Hz 400 Hz (400 MB/sec)

(400 MB/sec)Tier0 Tier0 (CERN (CERN Computing Centre)Computing Centre)

Data Recording &Data Recording &Offline AnalysisOffline Analysis

…similar for each experiment...


Resources overview High Level Trigger Experiment farms

• ATLAS P1: 15k cores (28k Hyper Threading, 25% reserved for TDAQ)

• CMS P5: 13k cores (21k Hyper Threading)

! when available: ! ~50% bigger than the Tier0,

! doubling the capacity of biggest Tier1 of the Experiments

Network connectivity to the IT Computing Centre (Tier0)

46 Dec 2013Alessandro Di Girolamo

Type Current status

P1 ↔ CERN IT CC ( so called Castor link) 70 Gbps (20 Gbps reserved for Sim@P1)

P5 ↔ CERN IT CC 20Gbps (80 Gbps foreseen in the next months)

Why

Experiments always resource hungry:• ATLAS + CMS: more than 250k jobs running in

parallel

… exploit all the available resources!


Teamwork

Experts from the Trigger & Data Acquisition teams of the Experiments

Experts from other institutes• BNL RACF, Imperial College …

Experts of WLCG (Worldwide LHC Computing Grid)


Why Cloud? Cloud as an overlay infrastructure

• provides necessary management of VM resources

• support & control of physical hosts remain with TDAQ

• delegate Grid support

easy to quickly switch from HLT ↔ Grid• during LS1: periodic full-scale test of TDAQ sw upgrade

• can be used in the future also during short LHC stop

OpenStack: common solution, big community!• CMS, ATLAS, BNL, CERN IT….

• sharing experiences

• …and support if needed


OpenStack Glance: VM base image storage and management

• Central image repository (and distribution) for Nova

Nova: Central operations controller for hypervisors and VMs• CLI tools, VM scheduler, Compute node client

Network in multi-host mode for CMS

Horizon/ High level control tools• WebUI for Openstack infrastructure/project/VM control (limited use)

RabbitMQ

ATLAS: OpenStack version currently used: Folsom

CMS: OpenStack version currently used: Grizzly


Network challenges Avoid any interference with:

• Detector Control System operations

• internal Control network

Each compute node is connected to two networks• One subnet per rack per network

• Routers allow traffic to registered machines only

ATLAS• A new dedicated VLAN has been setup

VMs are registered on this network

CMS• VMs aren’t registered

• SNAT rules defined on the hypervisors to bypass network limitations.


Compute Node (x 1.3k)Compute Node (x 1.3k)

VMVM

ethXethX ethXethX ethXethX

br928br928br928br928

Glance serverGlance server

Controller (x4)Controller (x4)

NATNATNATNATNATNATRabbit MQRabbit MQ

Corosync/PacemakerCorosync/Pacemaker

conductorconductor

apiapi

KeystoneKeystone

DashboardDashboard


conductorconductor

apiapi

KeystoneKeystone

DashboardDashboard


conductorconductor

apiapi

KeystoneKeystone

DashboardDashboard


Nova Scheduler

Nova Scheduler

Nova APIsNova APIs

KeystoneKeystone

HorizonHorizon

NATNAT

Gateways10.29.0.1 and 10.29.0.2

Gateways10.29.0.1 and 10.29.0.2

MySQL ClusterMySQL Cluster

group2group2

group1group1

db4db4db3db3

db1db1 db2db2

mgmt1mgmt1 mgmt2mgmt2

Data network (10.179.0.0/22)Data network (10.179.0.0/22)

Nova network (10.29.0.0/16)Nova network (10.29.0.0/16)

Control network (10.176.0.0/25) Control network (10.176.0.0/25) GPN networkGPN network

NovaNetwork

NovaNetwork

NovaCompute

NovaCompute

NovaMetadata

NovaMetadata

Libvirt/KVM

Libvirt/KVM

VMVM

network to IT Computing Centrenetwork to IT Computing Centre

20Gb

CMS online Cloud

SNAT

SNAT

CM

S S

ite


GRID services

CVMFS, Glideins, Condor, Castor/EOS

GRID services

CVMFS, Glideins, Condor, Castor/EOS

VMVM

VM image

SL5 x86_64 based (KVM HyperVisor)• Post-boot contextualization method: script injected into the base

image (Puppet in the future)

• Pre-caching of images on HyperVisors bzip2 compressed QCOW2 images that are about 350 MB

Experiment specific SW distribution with CVMFS• CVMFS: network file system based on HTTP and optimized to

deliver experiment software.


12

Sim

@P

1D

eplo

ymen

t

AT

LA

S T

DA

Q T

R J

uly

AT

LA

S T

DA

Q T

R A

ugu

st

LH

C P

1 C

oolin

g In

terv

enti

on

6 Dec 2013

ATLAS: Sim@P1 running jobs: 1 June – 20 Oct

Overall: ~ 55% of time available for Sim@P1

17 k

Alessandro Di Girolamo

Last month dedicated to interventions on the infrastructure

ATLAS: Sim@P1: Start & Stop

Aug 27, 2013

Sep 2, 2013

Ungraceful shutdownof the VM group

(rack by rack)

10 min, 29 Hz

Restoring the VM group from the dormant state

45 min, 6 Hz

Job

flow

:0.

8 H

z

17.1k jobslots

CP

Us

6 Dec 201313

• Restoring Sim@P1: VMs all up and running

within 45min (6Hz) MC Production jobs flow

0.8 Hz now improved to

almost 1.5 Hz

• Shutdown: 10min (29Hz) the

infrastructure is back to TDAQ

1321 October 2013Alessandro Di Girolamo

14

ATLAS: Sim@P1 completed jobs: 1 June – 20 Oct

6 Dec 2013

Total successful jobs: 1.65MEfficiency: 80% Total WallClock: 63.8 G

seconds

WallClock failed jobs: 10.3%

78% Lost Heartbeat:Intrinsic to the opportunistic nature of resources

Comparison with CERN-PRODTotal WallClock: 83.3 Gsec WallClock failed jobs: 6%

Overall: ~ 55% of time available for Sim@P1


ConclusionsExperiments online Clouds are a reality

Cloud solution: no impact on data taking, easy switch of activity:Quick onto them, quick out from them, e.g.:

! from 0 to 17k Sim@P1 jobs running in 3.5hours

! from 17k Sim@P1 jobs running to TDAQ ready in10 mins

! contributing to computing as one big Tier1 or CERN-PROD!

Operations: still a lot of (small) things to do

! Integrate OpenStack with the online control to allow dynamic allocation of resources to the Cloud

! open questions not unique to the experiments’ online clouds: opportunity to unify solutions to minimize manpower!


BackUp

6 Dec 2013 16Alessandro Di Girolamo

Sim@P1: Dedicated Network Infrastructure

6 Dec 2013 17

P1 CastorRouter

IT CastorRouter

20-80 Gbps

Racks in SDX1

1 GbpsD

ata

sw

itch

CN

...

Con

trol sw

itch

1 G

bps

1 G

bps

ATCN

CN

CN

CN

1 G

bp

s

DataCore

ATLAS HLT SubFarm Output Nodes (SFOs)

10 Gbps

ATLAS:Puppet, REPO

CtrlCore

IT GRID:Condor, Panda

EOS/Castor, CvmFS

ACLs

Sim@P1 VMs will use a dedicated 1 Gbps physical network connecting the P1 rack data switches to the “Castor router”

VLANisolation


6 Dec 2013 18Alessandro Di Girolamo

Cloud Infrastructure of Point1 (SDX1)Keystone Horizon

RabbitMQCluster

Keystone (2012.2.4) No issues with stability/performance observed of any scale Initial configuration of tenant / users / services / endpoints

might deserve some higher level automation• Some automatic configuration scripts were already available in 2013Q1

from the third parties, but we found that using the Keystone CLI directly is more convenient & transparent

Simple replication of the keystone MysQL DB works fine for maintaining redundant Keystone instances

19Alessandro Di Girolamo

Slide from Alex Zaytsev (BNL RACF)

6 Dec 2013

Nova (2012.2.4) Once bug fix needed to be applied

• Back port from Grizzly release:“Handle compute node records with no timestamp”:https://github.com/openstack/nova/commit/fad69df25ffcea2a44cbf3ef636a68863a2d64d9

The prefix for the VMs’ MAC addresses had to be changed in order to match the range pre-allocated for Sim@P1 project• No configuration option for this, direct patch to Python code was needed

Configuring the server environment for Nova Controller supporting more than 1k hypervisors / 1k of VMs requires rising the default limits for maximum number of open files per several system users• Not documented / handled automatically by Openstack recommended configuration procedures,

but pretty straightforward to figure out RabbitMQ cluster consisting of minimum two nodes was required in order to scale

beyond 1k hypervisors per single Nova Controller• RabbitMQ configuration procedure / stability is version sensitive• We had to try several version (currently v3.1.3-1) before achieving a stable cluster

configuration Overall: stable long term operations with only one Cloud controller

(plus one hot spare backup instance) for the entire Point 120Alessandro Di Girolamo


6 Dec 2013

https://github.com/openstack/nova/commit/fad69df25ffcea2a44cbf3ef636a68863a2d64d9

https://github.com/openstack/nova/commit/fad69df25ffcea2a44cbf3ef636a68863a2d64d9

Glance (2012.2.4) Single Glance instance (provided with single 1 Gbps uplink) works nicely as

an central distribution point up to the scale of about 100 hypervisors / 100 VM instances

Scaling beyond that (1.3k hypervisors, 2.1k VM instances) requires either • A dedicated group of cache servers between Glance and hypervisors• Custom made mechanism for pre-deployment of the base images on all compute

nodes (multi-level replication) Since we operate with only one base image at the time which changes rarely

(approximately once a month) we built a custom image deployment mechanism, living the central Glance instances with functionality of image repositories, but not the image central distribution points• No additional cache servers needed• We distribute bzip2 compressed QCOW2 images that re only about 350 MB in size• Pre-placement of the new image to all the hypervisors take in total only about 15

minutes despite 1 Gbps network limitations on both Glance instances and on the level of every rack of compute nodes

Snapshot functionality of Glance is used only for making persistent changes in the base image• No changes are saved for VM instances during production operations

21



Horizon (2012.2.4) Very early version of the web interface Many security features are missing

• Such as no native HTTPS support Not currently used for production at Sim@P1 Several configuration / stability issues encountered

• Such as debug mode must be enabled in order for Horizon to function properly Limited feature set of the web interface

• No way to perform non-trivial network configuration purely via web-interface• No way to handle large groups of VMs (1-2k+) in a conveniently, such as to

display VM instances in a tree structured according to the configuration of the availability zones / instance names

• No convenient way to perform bulk operations on large subgroups of VMs (hundreds) within the production group of VMs consisting of 1-2k

All of these problems, presumably, already addressed in the resent Openstack releases

22



… …

• …

• …

…• ..

• …

• ..! …