Compute services at CERN · •Self service for end users within the policies and allocations...

16
Compute services at CERN 1 Gavin McCance

Transcript of Compute services at CERN · •Self service for end users within the policies and allocations...

Page 1: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Compute services at CERN

1

Gavin McCance

Page 2: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Tier-0 compute services 2017

2

• Universal resource provisioning layer for bare metal, containers and

VMs

• HTCondor as the single end user interface

• Continue investing in automation and other communities for scaling

with fixed staff

• Self service for end users within the policies and allocations

OpenStack Resource Provisioning

HTCondor

Public Cloud

VMsContainersHPC

LSF

Volunteer

Computing

IT &

Experiment

Services

End Users CI/CD

APIs

CLIs

GUIs

Experiment Pilot Factories

Page 3: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Tier-0 compute services 2017

3

• Universal resource provisioning layer for bare metal, containers and

VMs

• HTCondor as the single end user interface

• Continue investing in automation and other communities for scaling

with fixed staff

• Self service for end users within the policies and allocations

OpenStack Resource Provisioning

HTCondor

Public Cloud

VMsContainersHPC

LSF

Volunteer

Computing

IT &

Experiment

Services

End Users CI/CD

APIs

CLIs

GUIs

Experiment Pilot Factories

Page 4: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

4X

Compute Workloads

HPCServerHosting

HighThroughput

Opportunistic

CostperUnitWork

Accessibility

PotentialUsage

Page 5: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

5X

Compute Workloads

HPCServerHosting

HighThroughput

Opportunistic

CostperUnitWork

Accessibility

PotentialUsage

Page 6: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

High-throughput batch service

• 113k CPU cores and increasing

• LHC: Prompt Tier-0 calibration / hot events

• LHC: Tier-0 bulk reconstruction

• LHC: Our share of WLCG Grid quota

• All other CERN experiments (e.g. Compass)

• Associated experiments (e.g. AMS)

• Various local CERN groups in EP (ATLAS,

LHCb, etc)

• Typically delivering around 500k jobs per day

6

Page 7: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

High-throughput batch service

• Batch service balances the fair-share across all competing applications according to CERN resource policies

• Users interaction pattern: “submit a job,

sits in queue, runs, get result back”

• Currently migrating from proprietary product (LSF) to Open Source HTCondor

• We also run misc Griddy services (Compute Element and Argus) interfacing WLCG Grid to local site resources

7

Page 8: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Batch HTC challenges and plans

• Scaling the service for Run3 and beyond

• Understanding and improving overall CPU

efficiency (currently 60%)

• Moving 1300 users to Condor (~end of

Run2) and retiring LSF

• Expanding into the external cloud / hosting…

8

Page 9: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Computing scale challenge

9

0

20

40

60

80

100

120

140

160

Run1 Run2 Run3 Run4

GRID

ATLAS

CMS

LHCb

ALICE

• Efficiency

studies…!

• Expand to public

cloud?

• Volunteer

computing?

Page 10: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

To the public cloud and beyond..

• Investigating extending HTCondor pool to include

external cloud resources

• IBM Softlayer -> T-Systems -> HNScience Cloud

• Public procurement challenge

• From technical HTCondor point of view, adding

external nodes was quite easy

• More work involved from puppet infrastructure,

monitoring, cloud APIs, and coordination with

experiments

• Standard tooling (cf. interfaces) critical

10

Page 11: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Recent activity: T-Systems

• WAN largely used

• Sometimes even saturated

11

• Mixture of “CPU-intensive”

and “network-intensive” tasks

– MC workloads tend to

dominate: easier to manage?

• Batch resources

fully loaded

– shared among VOs

Running Cores

ITTF IT-CM-IS

Page 12: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

12X

Compute Workloads

HPCServerHosting

HighThroughput

Opportunistic

CostperUnitWork

Accessibility

PotentialUsage

Page 13: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

LHC@home

• Scavenged resources

• Allows us to get additional computing

resources

• Volunteers (e.g. home PCs)

• Institute desktops

• Supercomputing backfill

• Small farms with easy deployment!

• Unpredictable but significant resources

• Target CPU bound simulations (not

data intensive)

• Over 50% of LHC compute is simulation!

• Outreach benefits!

13

Page 14: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Lots of potential resources…

14

Page 15: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Opportunistic challenges and plans

• Resources free but ops cost is not

• Reduce ops overhead - leverage same

technologies as we’re using to expand into

the cloud (e.g. Condor)

• Ramping up capacity and ensure the BOINC

infrastructure can scale

• Continue to work with experiments to find

workload that’s suitable for this use-case

15

Page 16: Compute services at CERN · •Self service for end users within the policies and allocations OpenStack Resource Provisioning HTCondor Public Cloud HPC Containers VMs LSF Volunteer

Compute Workloads

16X

Compute Workloads

HPCServerHosting

HighThroughput

Opportunistic

CostperUnitWork

Accessibility

PotentialUsage

Carolina