CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul...

21
PES CERN IT Department CH-1211 Gen` eve 23 Switzerland www.cern.ch/it CERN IT Department CERN Batch System, Monitoring and Accounting HEPiX Fall 2012 erˆ ome Belleman CERN – IT-PES October 2012

Transcript of CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul...

Page 1: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

CERN Batch System, Monitoring andAccounting

HEPiX Fall 2012

Jerome BellemanCERN – IT-PES

October 2012

Page 2: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

2 – CERN BatchSystem, Monitoring

and Accounting

Context

Growing community

Busier batch system

Agile Infrastructure project

Page 3: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

3 – CERN BatchSystem, Monitoring

and Accounting

Outline

1 Batch System Challenges

2 Batch Monitoring Tools

3 Batch Accounting Overhaul

Page 4: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

4 – Batch SystemChallenges

Section 1

Batch System Challenges

Page 5: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

5 – Batch SystemChallenges

CERN Batch Setup

Platform LSF 7.0.6

All resources to one cluster

Different shares for different customers: public, grid andseveral for CERN experiments

LSF Master NodeNFS Server

LSF Master Failover

WNWN WN WN WN WN WN WN WN WN

Local Jobs Grid Jobs

Page 6: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

6 – Batch SystemChallenges

A Large Batch System

> 4 000 physical nodes

> 60 000 cores, some SMT-enabled (25% overcommit)

> 55 000 job slots, > 400 000 jobs/day:

Page 7: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

7 – Batch SystemChallenges

Future of the Batch Service

Agile Infrastructure Project:

Virtualise resources in CC: batch nodes to be fat VMs

Uniform IaaS layer

Configuration management with Puppet

Page 8: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

8 – Batch SystemChallenges

Today’s Operational Issues

High submission and query load → Slow response

Ensuring fairshare scheduling

Complex LSF setup

Poor dynamism requiring daily reconfiguration

Scalability

Page 9: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

9 – Batch SystemChallenges

Possible Alternatives to LSF

Goal for 5 years:

4 000→ 12 000 physical nodes

60 000→ 300 000 cores

Support frequent structural changes

Possible alternatives (unordered):

LSF 8

Condor

Grid Engine

Torque

SLURM ←−

Page 10: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

10 – Batch SystemChallenges

Evaluating SLURM

From the SLURM Web site:

Free

65 000 physical nodes

120 000 jobs/hour

Active community

Extensible via plug-ins

Test bed:

Implement and test hierarchical fairshare model

Controllably submit queries and jobs

Reproducible load

Scale number of hosts, jobs, slots and queries

Page 11: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

11 – BatchMonitoring Tools

Section 2

Batch Monitoring Tools

Page 12: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

12 – BatchMonitoring Tools

Technology Overview

Oracle, Python, Matplotlib & Django → Stats

Cassandra → Fairshare monitoring

OpenTSDB → Live monitoring

Splunk → Historical usage

Page 13: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

13 – BatchMonitoring Tools

Live Monitoring with OpenTSDB

Page 14: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

14 – BatchMonitoring Tools

Historical Usage with Splunk

Page 15: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

15 – BatchAccounting Overhaul

Section 3

Batch Accounting Overhaul

Page 16: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

16 – BatchAccounting Overhaul

New Batch Accounting: Goals

Make portable to other schedulers

Publish local job information

Publish correct normalisation factor per job

Use the new APEL software

Remove complexity, improve consistency

Page 17: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

17 – BatchAccounting Overhaul

Old vs. New Batch Accounting

CEsBLAH

File

LRMSAcct.File

Acct.

ReportsAcct.Page

APELAcct.Portal

Daily

FilterLocalAPEL

SSMMessaging

Page 18: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

17 – BatchAccounting Overhaul

Old vs. New Batch Accounting

CEsBLAH

File

LRMSAcct.File

Acct.

ReportsAcct.Page

APELAcct.Portal

Daily

FilterLocalAPEL

SSMMessaging

Page 19: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

17 – BatchAccounting Overhaul

Old vs. New Batch Accounting

CEsBLAH

File

LRMSAcct.File

Acct.

ReportsAcct.Page

APELAcct.Portal

Real-T

ime

FilterLocalAPEL

SSMMessaging

Page 20: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

18 – CERN BatchSystem, Monitoring

and Accounting

Conclusion

We need to scale

We’re moving to new infrastructure tools

CERN batch service being prepared for future challenges

Page 21: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

19 – CERN BatchSystem, Monitoring

and Accounting

Thanks!

Questions?