Michigan!/usr/group: Compute Clusters Building Blocks …€¦ · Compute Clusters—Building...
Transcript of Michigan!/usr/group: Compute Clusters Building Blocks …€¦ · Compute Clusters—Building...
Innovation Intelligence®
Michigan!/usr/group:
Compute Clusters—Building Blocks of the
Public Cloud
Jeff Marraccini, Vice President, Computer Systems
August 2015
About me, and yes, the disclaimer
• Work at Altair Engineering in Troy
18 years
• My team manages a number of clusters
• I manage staff that handle our internal
clusters in a 2,300+ employee company, so:
My employer may not agree with all my
opinions – they are my own. I am also
a generalist. Check with others before
spending money on a cluster.
• Like many things, there are no “one size
fits all” solutions with HPC! Please research!
Fa
bric
Head Node
Storage
Exec Nodes
Visualization Nodes
Thank you, and seeing this stuff for real
Michigan!/usr/group contributed to my career – thank you!
Past and present members contributed to tools we use daily. Preaching to
the choir here: knowledge exchange empowers us all.
Tours:
I cannot show too much live while we are recording.
I would be glad to give you a tour if you are in the Troy, MI area – please
message me at [email protected]. Must agree not to reveal operational
specifics.
Overview of today’s talk
• Why clusters?
• Some history
• “Private cloud” clusters
• Architecture
• Failures
• The Virtual Machine era
• The Container / Docker era
• “Public cloud” clusters
• Facebook and the Open Data Center
• Appliance Computing
• Resources to learn more
Why clusters? And what’s the big deal?
• Mainframe costs, even today
• Individual server performance and Moore’s Law
• Networking + computers + “cluster software” = often vast power
• What do we do with these 3-5 year old computers on a 7-10 year budget
cycle?
• Sony PlayStations, Apple XServes, Raspberry Pi
• Operating systems (usually) no longer as expensive as the computer
Universities, government agencies, companies,
and basements near you…
• They got us started…
• NASA BEOWULF (you may be using a BSD/Linux Ethernet driver based on
Donald J. Becker @ NASA’s work!)
• NSA fed back scalability ideas (!!), early adopter
• Older operating systems: Tandem, Digital VAX/VMS & OpenVMS, Some UNIX,
Microsoft Windows Server Clusters
• Universities world wide – open source contributions
• Military projects
• Basement clusters run by grad & undergrad students
• LucasFilm and related special effects firms
• MASSIVE (Peter Jackson/WETA Digital!) – got us into 10GbE message passing
What do they do?
• Scientific and engineering computing – the start of it all
• Render farms – special effects for movies, TV, commercials, games, live
TV and sports overlays…
• Media conversion (YouTube!)
• Web services, E-Mail at scale
• BitCoin and other computational currency
• Databases, “Big Data”
• Scale out Storage (EMC Isilon is an InfiniBand cluster!)
• Building and testing software (my workplace)!
• Social media (combining a lot of the above)
• Cracking passwords, encryption
• Neural networks / expert systems / IBM WATSON
Some of the largest clusters are…
• 10’s-100’s of thousands of cores
• NSA (probably), along with other governments’ security arms
• Other classified installations
• CERN
• Research labs (NCSA near Chicago is one)
• Public clouds (Google, Amazon, Microsoft, Rackspace, IBM, others)
• 1’s-10’s of thousands of cores
• Square Kilometer Array (Australia / South Africa, just got back from there)
• Weather forecasting
• Japan’s Earth project (early 2000’s)
• Render “farms”
• Large organizations (corporate, universities, “smaller” public cloud providers)
• Small businesses often have dozens to hundreds of cores, and may not
realize it if leasing private and/or public cloud services!
10,000 hands working in the space of a living room
“Cluster programming is a lot like putting a large puzzle together with 10,000 hands in the space of a living room, keeping them in sync”
- Altair developer when I reported a memory leak
Software development complexities & architecture
• Message passing (MPI) libraries, achieve huge scales
• Shared memory with proprietary interconnect (Some Cray, NEC, SGI
Altix)
• Process Migration (LinuxPMI, OpenMOSIX, some Cray, NEC, SGI Altix
UV)
• Systemd (w/ cgroups) is really nice on clusters as it reduces start up
and restaging latency due to parallel daemon startup and reduces shell
script complexity
• Ansible, Salt, and other configuration automation tools for sysadmin
“Private Cloud”
• Internal use clusters
• Sometimes accessible via remote access, Virtual Private Networks
• “Secret sauce” behind internal tools, some of which now have public
cloud front ends
• Requires a forging of networking, storage, and computing teams
• Oracle 10g databases often first exposure to IT
• Scalable internal storage (EMC Isilon, ExaGrid, HP 3PAR, Ceph, etc.)
High Availability Private Cluster Block Diagram
Firewall
• Protects often unpatched cluster software and firmware
• Load balancer
• Remote access
Head Nodes
• 1
• 2
• Authentication, Scheduling, Staging, Reloading, Push notifications, Periodic Check-pointing
Switch Fabrics
• 1
• 2
• Infiniband, 1/10/40/100GB Ethernet, Proprietary (Cray!)
Execution Nodes
• 1 … N
• Local storage, local “scratch”
Shared Storage Pools
• Staging
• Check-points
640 core half rack SuperMicro
TwinBlade chassis w/ 100TB usable
storage, QDR InfiniBand, ~~9 kW
2 X this for high availability
Altair’s Internal Clusters
• We use PBS Professional for all (it is our product!)
• HyperWorks Unlimited – “cluster in a box” – many around the world,
hundreds to 2048 cores, single rack or virtual clusters in public clouds
• Legacy “E-Compute” & Compute Manager (newer) – several clusters of
a few hundred cores each
• HyperWorks – several hundred cores, Windows, Linux, Mac - Michigan
and India, 80+ compilations (400K+ files/each), thousands of tests daily
• Test clusters – 128-256 cores, often restaged, scrounged older hardware
A regular cluster (or a basement one!)
Head Node
• Authentication, Scheduling, Staging, (Reloading, Push notifications, Periodic Check-pointing)
Cluster fabric(s)
• Ethernet switch
• Infiniband switch
• Storage Area Network
Execution Nodes
• 1 … N (could be varying hardware)
• Local storage (maybe!)
Shared Storage Pools
• Staging
• Checkpoints (maybe!)
• Could be FreeNAS, Lustre, Isilon…
Could well ALL be running on a single
virtual machine hypervisor for dev &
test!
An Engineer’s Patience
96 core job running on part of the cluster from the previous slide:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
182651.XXXXXXXX YYYYYYYY radioss- ZZZZZZZZZZ 29362 8 96 -- -- R 36:41
node006/0*12+node007/0*12+node009/0*12+node010/0*12
+node011/0*12+node012/0*12+node013/0*12+node014/0*12
Without oversubscription, that cluster may run 10 96 core jobs at once.
Most jobs on it run longer than a day – some for a couple weeks.
We are very paranoid when someone opens the cabinet doors on it…
The Fabric – cluster scaling and speed
• InfiniBand (10-56Gb/s, low latency)
• MyriNet (obsoleted, fiber optics)
• PCIe
• Ethernet (10GbE/40GbE/100GbE)
• Proprietary (CrayLink and others)
• Virtual network switches
256 core SGI half rack, QDR InfiniBand, Nvidia GPU’s,
Ethernet 1Gb/s mgmt, no HA. Surprisingly quiet in full use!
Storage
Varying needs = varying capacities (Computational Fluid Dynamics/CFD,
“crash”, chemistry, optimization, Bitcoin, hash cracking…)
Cluster storage is HARD, especially scale out – “Big Data” approaches not
good back end storage for scientific/engineering computing (yet)
Reliability - High availability often is more than 2X the cost
Local storage limits (blades, enterprise SSD, 2.5” HDD)
Spinning it down when portions idle = complex
Management
• Staging the nodes – potentially thousands during install and upgrades
Herding cats = scheduling different user communities’ requirements
Failures and recovery
• Staging jobs in/out – a CFD project may be 1TB+ of output * 200 jobs
• Push notifications, “Is it done yet?”
• Portals
• Continuous resource monitoring
• Check-pointing
• Energy efficiency
When it breaks
• Nodes will fail
• We have hardware failures every week, bigger clusters may have hourly
failures or even more
• Check-pointing = costly in storage and processing time, see
http://www.csm.ornl.gov/~engelman/publications/wang10hybrid2.pdf
• Restoring from a checkpoint may be unreliable
• Restaging
• Job migration
• Jeff’s “I meant to type a 11 and typed 1” glitch
• The dreaded faulty InfiniBand cable
• “If you monitor me, my job slows down!”
The Virtual Machine Cluster
• Great way to demo cluster software, Ansible/Salt, etc.
• SIMH & OpenVMS (Jeff’s VMS cluster on a Surface Pro 3 tablet)
• Multics may now be emulated, see http://multicians.org/
• Virtual network switches work great on multi-core hosts
• “Pull” the virtual network cable, see if the storage busts
• Test your upgrades
• Learn without spending $50,000+
• Hypervisors add I/O latency
• Fabric support limited
• = Scalability limited
The Container / Docker – More than a fad
• Famous “Pets” vs “Livestock” (some call “Cattle”) argument for
application design
• Single operating system per host, operating system ensures containers
are sandboxed from each other AND they have cluster fabric access!
• Multiple containers (load balancer + web server + app server + database
server + log server) may be spun up and scaled with appropriate app
design
• Still have to patch the containers if there are vulnerabilities inside!
Ansible, etc. useful!
“I’m out of oomph” -> BURSTING
• “Promise” of the Public Cloud
• Credit card financed computing
• Possibly loosely coupled
• Fabric compromises
• Getting better!
Internal ClusterVPN to Amazon AWS/Microsoft
Azure
Cloud Execution Nodes
Cloud fabric
Cloud storage
Spread out clusters
• May be in the “Public Cloud” or at multiple “Private Cloud” sites
(research centers, remote data centers, leased private capacity)
• Redundancy – Hadoop and derivatives quickly copy object data and
store archival copies, etc.
• Scalability, 100Gb/s inter-data-center links now common
• Lots of “dark fiber” available for leasing
• Watch out for latency sensitive implementations
Facebook and Open Compute Project
• Mainly useful for big organizations
• Power efficiency, reduce impact
• Shared power supplies
• Optimized cooling
• Storage & node spin-down
• Designed to fail and be easily serviceable
• Quick upgrades
• Scalability beyond conventional designs
• Might slow down commodity server price drops, volume decreasing
• http://www.opencompute.org/
Appliances and Platform as a Service (PaaS)
• “Cluster in a box” (well, racks!) or cloud
• Bursting
• Project-based computing
• Nimble
• Geek skills embedded
• Easy portal / front ends
Where do we go from here?
• Public library access to Lynda.com – Amazon AWS & Microsoft Azure
“Up and Running” courses
• SIMH hobbyist OpenVMS cluster: https://vanalboom.org/node/18
• OpenStack on virtual machines: http://www.openstack.org/ and
http://docs.openstack.org/developer/devstack/#quick-start
• Example appliance: http://www.altair.com/hwul/
• PBS Professional, IBM LSF, Grid Engine, other cluster mgmt. software
• OpenStack Ceph scalable block storage: http://ceph.com/
• Lustre storage free software: http://wiki.lustre.org/
Aside from security, the ability to build and maintain private and public
cluster systems are near the top of the pay scale in IT!