Michigan!/usr/group: Compute Clusters Building Blocks …€¦ · Compute Clusters—Building...

Innovation Intelligence®

Michigan!/usr/group:

Compute Clusters—Building Blocks of the

Public Cloud

Jeff Marraccini, Vice President, Computer Systems

[email protected]

August 2015

mailto:[email protected]

About me, and yes, the disclaimer

• Work at Altair Engineering in Troy

18 years

• My team manages a number of clusters

• I manage staff that handle our internal

clusters in a 2,300+ employee company, so:

My employer may not agree with all my

opinions – they are my own. I am also

a generalist. Check with others before

spending money on a cluster.

• Like many things, there are no “one size

fits all” solutions with HPC! Please research!

Fa

bric

Head Node

Storage

Exec Nodes

Visualization Nodes

Thank you, and seeing this stuff for real

Michigan!/usr/group contributed to my career – thank you!

Past and present members contributed to tools we use daily. Preaching to

the choir here: knowledge exchange empowers us all.

Tours:

I cannot show too much live while we are recording.

I would be glad to give you a tour if you are in the Troy, MI area – please

message me at [email protected]. Must agree not to reveal operational

specifics.

mailto:[email protected]

Overview of today’s talk

• Why clusters?

• Some history

• “Private cloud” clusters

• Architecture

• Failures

• The Virtual Machine era

• The Container / Docker era

• “Public cloud” clusters

• Facebook and the Open Data Center

• Appliance Computing

• Resources to learn more

Why clusters? And what’s the big deal?

• Mainframe costs, even today

• Individual server performance and Moore’s Law

• Networking + computers + “cluster software” = often vast power

• What do we do with these 3-5 year old computers on a 7-10 year budget

cycle?

• Sony PlayStations, Apple XServes, Raspberry Pi

• Operating systems (usually) no longer as expensive as the computer

Universities, government agencies, companies,

and basements near you…

• They got us started…

• NASA BEOWULF (you may be using a BSD/Linux Ethernet driver based on

Donald J. Becker @ NASA’s work!)

• NSA fed back scalability ideas (!!), early adopter

• Older operating systems: Tandem, Digital VAX/VMS & OpenVMS, Some UNIX,

Microsoft Windows Server Clusters

• Universities world wide – open source contributions

• Military projects

• Basement clusters run by grad & undergrad students

• LucasFilm and related special effects firms

• MASSIVE (Peter Jackson/WETA Digital!) – got us into 10GbE message passing

What do they do?

• Scientific and engineering computing – the start of it all

• Render farms – special effects for movies, TV, commercials, games, live

TV and sports overlays…

• Media conversion (YouTube!)

• Web services, E-Mail at scale

• BitCoin and other computational currency

• Databases, “Big Data”

• Scale out Storage (EMC Isilon is an InfiniBand cluster!)

• Building and testing software (my workplace)!

• Social media (combining a lot of the above)

• Cracking passwords, encryption

• Neural networks / expert systems / IBM WATSON

Some of the largest clusters are…

• 10’s-100’s of thousands of cores

• NSA (probably), along with other governments’ security arms

• Other classified installations

• CERN

• Research labs (NCSA near Chicago is one)

• Public clouds (Google, Amazon, Microsoft, Rackspace, IBM, others)

• 1’s-10’s of thousands of cores

• Square Kilometer Array (Australia / South Africa, just got back from there)

• Weather forecasting

• Japan’s Earth project (early 2000’s)

• Render “farms”

• Large organizations (corporate, universities, “smaller” public cloud providers)

• Small businesses often have dozens to hundreds of cores, and may not

realize it if leasing private and/or public cloud services!

10,000 hands working in the space of a living room

“Cluster programming is a lot like putting a large puzzle together with 10,000 hands in the space of a living room, keeping them in sync”

- Altair developer when I reported a memory leak

Software development complexities & architecture

• Message passing (MPI) libraries, achieve huge scales

• Shared memory with proprietary interconnect (Some Cray, NEC, SGI

Altix)

• Process Migration (LinuxPMI, OpenMOSIX, some Cray, NEC, SGI Altix

UV)

• Systemd (w/ cgroups) is really nice on clusters as it reduces start up

and restaging latency due to parallel daemon startup and reduces shell

script complexity

• Ansible, Salt, and other configuration automation tools for sysadmin

“Private Cloud”

• Internal use clusters

• Sometimes accessible via remote access, Virtual Private Networks

• “Secret sauce” behind internal tools, some of which now have public

cloud front ends

• Requires a forging of networking, storage, and computing teams

• Oracle 10g databases often first exposure to IT

• Scalable internal storage (EMC Isilon, ExaGrid, HP 3PAR, Ceph, etc.)

High Availability Private Cluster Block Diagram

Firewall

• Protects often unpatched cluster software and firmware

• Load balancer

• Remote access

Head Nodes

• 1

• 2

• Authentication, Scheduling, Staging, Reloading, Push notifications, Periodic Check-pointing

Switch Fabrics

• 1

• 2

• Infiniband, 1/10/40/100GB Ethernet, Proprietary (Cray!)

Execution Nodes

• 1 … N

• Local storage, local “scratch”

Shared Storage Pools

• Staging

• Check-points

640 core half rack SuperMicro

TwinBlade chassis w/ 100TB usable

storage, QDR InfiniBand, ~~9 kW

2 X this for high availability

Altair’s Internal Clusters

• We use PBS Professional for all (it is our product!)

• HyperWorks Unlimited – “cluster in a box” – many around the world,

hundreds to 2048 cores, single rack or virtual clusters in public clouds

• Legacy “E-Compute” & Compute Manager (newer) – several clusters of

a few hundred cores each

• HyperWorks – several hundred cores, Windows, Linux, Mac - Michigan

and India, 80+ compilations (400K+ files/each), thousands of tests daily

• Test clusters – 128-256 cores, often restaged, scrounged older hardware

A regular cluster (or a basement one!)

Head Node

• Authentication, Scheduling, Staging, (Reloading, Push notifications, Periodic Check-pointing)

Cluster fabric(s)

• Ethernet switch

• Infiniband switch

• Storage Area Network

Execution Nodes

• 1 … N (could be varying hardware)

• Local storage (maybe!)

Shared Storage Pools

• Staging

• Checkpoints (maybe!)

• Could be FreeNAS, Lustre, Isilon…

Could well ALL be running on a single

virtual machine hypervisor for dev &

test!

An Engineer’s Patience

96 core job running on part of the cluster from the previous slide:

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----

182651.XXXXXXXX YYYYYYYY radioss- ZZZZZZZZZZ 29362 8 96 -- -- R 36:41

node006/0*12+node007/0*12+node009/0*12+node010/0*12

+node011/0*12+node012/0*12+node013/0*12+node014/0*12

Without oversubscription, that cluster may run 10 96 core jobs at once.

Most jobs on it run longer than a day – some for a couple weeks.

We are very paranoid when someone opens the cabinet doors on it…

The Fabric – cluster scaling and speed

• InfiniBand (10-56Gb/s, low latency)

• MyriNet (obsoleted, fiber optics)

• PCIe

• Ethernet (10GbE/40GbE/100GbE)

• Proprietary (CrayLink and others)

• Virtual network switches

256 core SGI half rack, QDR InfiniBand, Nvidia GPU’s,

Ethernet 1Gb/s mgmt, no HA. Surprisingly quiet in full use!

Storage

Varying needs = varying capacities (Computational Fluid Dynamics/CFD,

“crash”, chemistry, optimization, Bitcoin, hash cracking…)

Cluster storage is HARD, especially scale out – “Big Data” approaches not

good back end storage for scientific/engineering computing (yet)

Reliability - High availability often is more than 2X the cost

Local storage limits (blades, enterprise SSD, 2.5” HDD)

Spinning it down when portions idle = complex

Management

• Staging the nodes – potentially thousands during install and upgrades

Herding cats = scheduling different user communities’ requirements

Failures and recovery

• Staging jobs in/out – a CFD project may be 1TB+ of output * 200 jobs

• Push notifications, “Is it done yet?”

• Portals

• Continuous resource monitoring

• Check-pointing

• Energy efficiency

When it breaks

• Nodes will fail

• We have hardware failures every week, bigger clusters may have hourly

failures or even more

• Check-pointing = costly in storage and processing time, see

http://www.csm.ornl.gov/~engelman/publications/wang10hybrid2.pdf

• Restoring from a checkpoint may be unreliable

• Restaging

• Job migration

• Jeff’s “I meant to type a 11 and typed 1” glitch

• The dreaded faulty InfiniBand cable

• “If you monitor me, my job slows down!”

http://www.csm.ornl.gov/~engelman/publications/wang10hybrid2.pdf

The Virtual Machine Cluster

• Great way to demo cluster software, Ansible/Salt, etc.

• SIMH & OpenVMS (Jeff’s VMS cluster on a Surface Pro 3 tablet)

• Multics may now be emulated, see http://multicians.org/

• Virtual network switches work great on multi-core hosts

• “Pull” the virtual network cable, see if the storage busts

• Test your upgrades

• Learn without spending $50,000+

• Hypervisors add I/O latency

• Fabric support limited

• = Scalability limited

http://multicians.org/

The Container / Docker – More than a fad

• Famous “Pets” vs “Livestock” (some call “Cattle”) argument for

application design

• Single operating system per host, operating system ensures containers

are sandboxed from each other AND they have cluster fabric access!

• Multiple containers (load balancer + web server + app server + database

server + log server) may be spun up and scaled with appropriate app

design

• Still have to patch the containers if there are vulnerabilities inside!

Ansible, etc. useful!

“I’m out of oomph” -> BURSTING

• “Promise” of the Public Cloud

• Credit card financed computing

• Possibly loosely coupled

• Fabric compromises

• Getting better!

Internal ClusterVPN to Amazon AWS/Microsoft

Azure

Cloud Execution Nodes

Cloud fabric

Cloud storage

Spread out clusters

• May be in the “Public Cloud” or at multiple “Private Cloud” sites

(research centers, remote data centers, leased private capacity)

• Redundancy – Hadoop and derivatives quickly copy object data and

store archival copies, etc.

• Scalability, 100Gb/s inter-data-center links now common

• Lots of “dark fiber” available for leasing

• Watch out for latency sensitive implementations

Facebook and Open Compute Project

• Mainly useful for big organizations

• Power efficiency, reduce impact

• Shared power supplies

• Optimized cooling

• Storage & node spin-down

• Designed to fail and be easily serviceable

• Quick upgrades

• Scalability beyond conventional designs

• Might slow down commodity server price drops, volume decreasing

• http://www.opencompute.org/

http://www.opencompute.org/

Appliances and Platform as a Service (PaaS)

• “Cluster in a box” (well, racks!) or cloud

• Bursting

• Project-based computing

• Nimble

• Geek skills embedded

• Easy portal / front ends

Where do we go from here?

• Public library access to Lynda.com – Amazon AWS & Microsoft Azure

“Up and Running” courses

• SIMH hobbyist OpenVMS cluster: https://vanalboom.org/node/18

• OpenStack on virtual machines: http://www.openstack.org/ and

http://docs.openstack.org/developer/devstack/#quick-start

• Example appliance: http://www.altair.com/hwul/

• PBS Professional, IBM LSF, Grid Engine, other cluster mgmt. software

• OpenStack Ceph scalable block storage: http://ceph.com/

• Lustre storage free software: http://wiki.lustre.org/

Aside from security, the ability to build and maintain private and public

cluster systems are near the top of the pay scale in IT!

http://www.altair.com/hwul/

http://wiki.lustre.org/

Michigan!/usr/group: Compute Clusters Building Blocks …€¦ · Compute Clusters—Building...

Documents

Transcript of Michigan!/usr/group: Compute Clusters Building Blocks …€¦ · Compute Clusters—Building...