Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the...

29
Larry Pezzaglia NERSC Computational Systems Group [email protected] HEPiX Spring 2014 Cluster Consolidation at NERSC

Transcript of Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the...

Page 1: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

Larry PezzagliaNERSC Computational Systems [email protected]

HEPiX Spring 2014

ClusterConsolidation

at NERSC

Page 2: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I Located at Lawrence Berkeley National Laboratory,NERSC is the production computing facility for theUS DOE Office of Science

I NERSC serves ~5000 users, ~400 projects, and~500 codes

I Focus is on “unique” resources:I Expert computing and other servicesI 24x7 monitoringI High-end computing and storage systems

I Known for:I Excellent services and user supportI Diverse workload

I NERSC provides Hopper (a Cray XE6), Edison (aCray XC30), and three data-intensive systems:Carver, PDSF, and Genepool.

Snapshot of NERSC

- 2 -

Page 3: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

The NERSC Cluster Model

- 3 -

Page 4: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I In 2012, NERSC purchased a new system, “Mendel”to systematically expand its cluster resources

I 500+ Sandy Bridge nodes, 8000+ coresI FDR InfiniBand interconnect

I Mendel transparently expands production clustersand services

I Carver, PDSF, and Genepool (the “parent systems”)schedule jobs on portions of Mendel

I Mendel provides multiple software environments tomatch those on each parent system

I This model was presented at the 2013 Cray UserGroup meeting

I http://cug.org/proceedings/cug2013_proceedings/includes/files/pap184-file1.pdf

I http://cug.org/proceedings/cug2013_proceedings/includes/files/pap184-file2.pdf

Cluster Expansion

- 4 -

Page 5: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

12+ PB NERSC Global Filesystems

ESnet

NERSCNetwork

Bioinformatics Cluster(Genepool)

General Purpose Cluster (Carver)

High Energy Physics/NuclearPhysics Cluster (PDSF)

12+ PB NERSC Global Filesystems

ESnet

NERSCNetwork

Bioinformatics Cluster(Genepool)

General Purpose Cluster (Carver)

High Energy Physics/NuclearPhysics Cluster (PDSF)

Data-Intensive Systems

- 5 -

Page 6: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

12+ PB NERSC Global Filesystems

ESnet

NERSCNetwork

Bioinformatics Cluster(Genepool)

General Purpose Cluster (Carver)

High Energy Physics/NuclearPhysics Cluster (PDSF)

12+ PB NERSC Global Filesystems

ESnet

NERSCNetwork

Bioinformatics Cluster(Genepool)

General Purpose Cluster (Carver)

High Energy Physics/NuclearPhysics Cluster (PDSF)

Expansion Cluster (Mendel)

Data-Intensive Systems

- 6 -

Page 7: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I We use tools to construct convenient managementabstractions and tuned user environments on topof this platform:

I Familiar open-source software:I xCAT to provision and manage nodesI Cfengine3 to provide configuration management

(versioned with SVN)I NERSC-developed BSD-licensed software:

I avs_image_mgr to handle xCAT imagemanagement and versioninghttp://github.com/lpezzaglia/avs_image_mgr

I CHOS to provide multiple compute environmentsconcurrently and seamlesslyhttp://github.com/scanon/chos

I minimond to collect trending data fortroubleshooting and analysishttp://github.com/lpezzaglia/minimond

The Mendel Approach

- 7 -

Page 8: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

Unified Mendel Hardware Platform

Unified Mendel Base OS

Add-ons

PDSFAdd-ons

PDSFxCAT Policy

PDSFCfengine Policy

PDSFUGE

PDSFsl64

CHOS

PDSFsl53

CHOS

PDSFSL 6.4Apps

PDSFSL 5.3Apps

GenepoolAdd-ons

GenepoolxCAT Policy

GenepoolCfengine Policy

GenepoolUGE

GenepoolCompute

CHOS

GenepoolLoginCHOS

GenepoolDebian 6

Apps

GenepoolDebian 6

Logins

CarverAdd-ons

CarverxCAT Policy

CarverCfengine Policy

CarverTORQUE

CarverCompute

CHOS

CarverSL 5.5Apps

Hardware/Network

BaseOS

Boot-timeDifferentiation

CHOS

UserApplications

NERSC Cluster Model

- 8 -

Page 9: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

Extending the Model

- 9 -

Page 10: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I Easy Mendel administration highlighted theoperational burden of managing legacy clusters

I Changing configurations with pdsh scales poorlyI Mendel demonstrated the value of leveraging

automation to manage complex systemsI We pursued further cluster consolidation efforts

I Staff efficiency is highly valuedI Legacy hardware also gains benefits through

Mendel membershipI Cost: Upfront effort to consolidate clusters and risk

of user disruptionI Reward: Reduced long-term sysadmin burden and

increased system consistency

Motivation for Consolidation

- 10 -

Page 11: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I In spring 2014, we merged “Genepool”, a legacyparent cluster, into Mendel’s management system

I The combined cluster is now managed as a singleintegrated system with:

I ~1000 nodesI Multi-generational, multi-vendor hardwareI Multiple separate interconnectsI A unified xCAT+Cfengine management interface

I Constrained by a 24x7, disruption-sensitiveenvironment

I Change activated in a single all-day maintenance

Extending the Model

- 11 -

Page 12: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

Multi-vendor, Multi-generational hardwarexCAT Management Abstractions

Unified Mendel Base OS

Add-ons

PDSFAdd-ons

PDSFxCAT Policy

PDSFCfengine Policy

PDSFUGE

PDSFsl64

CHOS

PDSFsl53

CHOS

PDSFSL 6.4Apps

PDSFSL 5.3Apps

GenepoolAdd-ons

GenepoolxCAT Policy

GenepoolCfengine Policy

GenepoolUGE

GenepoolCompute

CHOS

GenepoolLoginCHOS

GenepoolDebian 6

Apps

GenepoolDebian 6

Logins

CarverAdd-ons

CarverxCAT Policy

CarverCfengine Policy

CarverTORQUE

CarverCompute

CHOS

CarverSL 5.5Apps

Hardware/Network

BaseOS

Boot-timeDifferentiation

CHOS

UserApplications

Consolidated Cluster Model

- 12 -

Page 13: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

Genepool and Mendel differ in several respects

Mendel (New) Genepool (Legacy)Productioninterconnect

FDR InfiniBand Gigabit Ethernet

Provisioning/IPMInetwork

Dedicated GbE Dedicated IPMInetworkProvisions overproduction network

OS SL 6.3 base OS withCHOS

Debian 6 withoutCHOS

Hardware Homogeneousplatform

Many hardwareconfigurations

Specific challenges

- 13 -

Page 14: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I xCAT’s hierarchical management features aresuited to managing dissimilar hardware

I Expansion required changes to our software stack1. Expand hardware support in the base OS image2. Configure an xCAT service node3. Expand Cfengine rules and boot scripts4. Perform thorough testing5. Reboot the Genepool nodes through Mendel’s

management system

Approach

- 14 -

Page 15: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I Support for all Genepool hardware was added tothe base OS image.

I Kernel modules for disk and network controllersI Initrd code to handle Genepool network

characteristicsI An xCAT add-on was created for the xCAT service

nodeI Changes were made with avs_image_mgr

I Provides a full revision history of every fileI Provides the ability to roll back to any previous

image

Base OS modifications

- 15 -

Page 16: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I An xCAT Service Node (SN) handles Genepoolprovisioning/management under the direction ofthe Management Node (MN)

I Only the SN requires connectivity to the Genepoolnetworks

I The MN only requires connectivity to the SNI The SN provides DHCP/TFTP/HTTP/xCATd servicesI xCAT commands, such as power/console

operations, are transparently routed through the SNI The SN is provisioned through the Mendel cluster

model1. The Mendel base OS image is booted2. The xCAT add-on is activated3. Cfengine rules apply SN-specific configurations

xCAT modifications

- 16 -

Page 17: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I Postscripts were extended to support Genepoolcharacteristics

I GPFS cluster configurationI Multipath access to local disk arraysI Local filesystem configurationsI Multiple hardware configurations

I Cfengine rules were augmented to support theadditional node classes

Postscripts and Cfengine

- 17 -

Page 18: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

Mendel IBProductionNetwork

Mendel GbEManagement

Network

GP GbEIPMI

Network

GP GbEBoot/ProdNetwork

Compute nodes

Service nodes

Image builders

Login nodes

xCATservice node

Legacycompute nodes

Legacyinteractive nodes

xCAT Commands

xCATManagement node

The Combined Cluster

- 18 -

Page 19: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I The combined Mendel+Genepool system iscomplex

I Many different node classesI Each node class represents a unique

software/hardware combinationI Configuration complexity grows with the number of

node classes.

Cluster Automation

- 19 -

Page 20: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I The quantity of node classes exceeds what ahuman administrator can hold in immediatememory

I We must build abstractions to retain systemmanageability as complexity increases

I Configuration management has become a necessityI We manage a single integrated system, not a

collection of nodes.I Every change must be considered in a system-wide

contextI Cfengine must broker change rollout

Cluster Automation

- 20 -

Page 21: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

Development Updates

- 21 -

Page 22: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I CHOS enables concurrent support of multiple Linuxenvironments on a single server

I A core component of the Mendel cluster modelI Under active developmentI Recent changes include:

I Ability to exit CHOS from within a CHOSenvironment

I Build system improvementsI pam_chos configurability enhancementsI EL7 kernel support in a testing branch

I Planned features include:I Scripts to transform an installed EL system into a

CHOS environmentI A framework for user-supplied CHOS environmentsI Reduced kernel module scope

CHOS development

- 22 -

Page 23: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I Collecting trending data for historical analysis isgrowing increasingly important

I NERSC developed minimond to systematize thisprocess

I Collects ~1000 statistics per nodeI Modular framework for sending metrics to multiple

data aggregation servicesI Supported output methods: plain text and Ganglia

(via gmetric or EmbeddedGmetric)I AMQP support is planned

I Only absolute counter values are recorded.Calculation of derived statistics must be performedon a remote analysis server

Data Collection

- 23 -

Page 24: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

GPFS mmpmon Operations

Metrics Graphs

- 24 -

Page 25: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

GPFS mmpmon Throughput

Metrics Graphs

- 25 -

Page 26: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

IB gateway Throughput

Metrics Graphs

- 26 -

Page 27: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I The extended NERSC cluster model enablessystematic management of several multi-vendor,multi-interconnect, and multi-generational clustersas a single integrated system

I A unified management interface abstracts awaycomplex details

I Implementation involved minimal user disruptionI Extending the model was far easier than

separately managing both clustersI The new model dramatically simplifies operations

Conclusions

- 27 -

Page 28: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

I This work was supported by the Director, Office of Science, Office ofAdvanced Scientific Computing Research of the U.S. Department ofEnergy under Contract No. DE-AC02-05CH11231.

Acknowledgements

- 28 -

Page 29: Cluster Consolidation at NERSC...I Located at Lawrence Berkeley National Laboratory, NERSC is the production computing facility for the US DOE Office of Science I NERSC serves ~5000

National Energy Research Scientific Computing Center