A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz...

Post on 18-Jan-2016

217 views 3 download

Tags:

Transcript of A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz...

A 1.7 Petaflops Warm-Water-Cooled System: Operational Experiences and Scientific Results

Łukasz Flis , Karol Krawentek, Marek Magryś

ACC Cyfronet AGH-UST

• established in 1973• part of AGH University of Science and Technology in

Krakow, Poland• provides free computing resources for scientific

institutions• centre of competence in HPC and Grid Computing• IT service management expertise (ITIL, ISO 20k)• member of PIONIER consortium• operator of Krakow MAN• home for supercomputers

International projects

PL-Grid infrastructure• Polish national IT infrastructure supporting e-Science

– based upon resources of most powerful academic resource centres– compatible and interoperable with European Grid– offering grid and cloud computing paradigms– coordinated by Cyfronet

• Benefits for users– unified infrastructure from 5 separate compute centres– unified access to software, compute and storage resources– non-trivial quality of service

• Challenges– unified monitoring, accounting, security– create environment of cooperation rather than competition

• Federation – the key to success

Competence Centre in the Field of Distributed Computing Grid Infrastructures

• Duration: 01.01.2014 – 31.11.2015• Project Coordinator: Academic Computer Centre CYFRONET AGH

The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competence centre in the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.

PLGrid Core project

ZEUS

374 TFLOPS#211, #1 in Poland

Zeus usage

44.84%

41.45%

7.87%

chemistryphysicsmedicinetechnicalastronomybiologycomputer scienceelectronics, telecomunicationmetalurgymathematicsother

Why upgrade?

• Job size growth• Users hate waiting for resources• New projects, new requirements• Follow the advances in HPC• Power costs

New building

Requirements for the new system

• Petascale system• Low TCO• Energy efficiency• Density• Expandability• Good MTBF• Hardware:

– core count– memory size– network topology– storage

Requirements: Liquid Cooling

• Water: up to 1000x more efficient heat exchange than air

• Less energy needed to move the coolant• Hardware (CPUs, DIMMs) can handle ~80C• Challenge: cool 100% of HW with liquid

– network switches– PSUs

Requirements: MTBF

• The less movement the better– less pumps– less fans– less HDDs

• Example– pump MTBF: 50 000 hrs– fan MTBF: 50 000 hrs– 1800 node system MTBF: 7 hrs

Requirements: Compute

• Max jobsize ~10k cores• Fastest CPUs, but compatible with old codes

– Two socket nodes– No accelerators at this point

• Newest memory– At least 4 GB/core

• Fast interconnect– Infiniband FDR– No need for full CBB fat tree

Requirements: Topology

services nodes

Service isle

storage nodes 576 nodes

Compute isle

Core IB switches

576 nodes

Compute isle

576 nodes

Compute isle

576 nodes

Compute isle

Why Apollo 8000?• Most energy efficient• The only solution with 100% warm water

cooling• Highest density• Lowest TCO

Even more Apollo

• Focuses also on ‘1’ in PUE!– Power distribution– Less fans– Detailed monitoring

• ‘energy to solution’

• Dry node maintenance• Less cables• Prefabricated piping• Simplified management

Prometheus

• HP Apollo 8000• 13 m2, 15 racks (3 CDU, 12 compute)• 1.65 PFLOPS• PUE <1.05, 680 kW peak power• 1728 nodes, Intel Haswell E5-2680v3• 41472 cores, 13824 per island• 216 TB DDR4 RAM• System prepared for expansion• CentOS 7

Prometheus storage

• Diskless compute nodes• Separate tender for storage

– Lustre-based– 2 file systems:

• Scratch: 120 GB/s, 5 PB usable space• Archive: 60 GB/s, 5 PB usable space

– HSM-ready• NFS for home directories and software

Deployment timeline

• Day 0 - Contract signed (20.10.2014)• Day 23 - Installation of the primary loop starts• Day 35 - First delivery (service island)• Day 56 - Apollo piping arrives• Day 98 - 1st and 2nd island delivered• Day 101 - 3rd island delivered• Day 111 - basic acceptance ends

• Official launch event on 27.04.2015

Facility preparation

• Primary loop installation took 5 weeks• Secondary (prefabricated) just 1 week• Upgrade of the raised floor done „just in case”• Additional pipes for leakage/condensation drain• Water dam with emergency drain• Lot of space needed for the hardware deliveries

(over 100 pallets)

Secondary loop

Challenges

• Power infrastructure being build in parallel• Boot over Infiniband

– UEFI, high frequency port flapping– OpenSM overloaded with port events

• BIOS settings being lost occasionally• Node location in APM is tricky• 5 dead IB cables (2‰)• 8 broken nodes (4‰)• 24h work during weekend

Solutions

• Boot to RAM over IB, image distribution with HTTP– Whole machine boots up in 10 min with just 1 boot server

• Hostname/IP generator based on MAC collector– Data automatically collected from APM and iLO

• Graphical monitoring of power, temperature and network traffic– SNMP data source,– GUI allows easy problem location– Now synced with SLURM

• Spectacular iLO LED blinking system developed for the offical launch

• 24h work during weekend

System expansion

• Prometheus expansion already ordered• 4th island

– 432 regular nodes (2 CPUs, 128 GB RAM)– 72 nodes with GPGPUs (2x Nvidia Tesla K40XL)

• Installation to begin in September• 2.4 PFLOPS total performance (Rpeak)• 2232 nodes, 53568 CPU cores, 279 TB RAM

Future plans

• Push the system to it’s limits• Further improvements of the monitoring tools• Continue to move users from the previous system• Detailed energy and temperature monitoring• Energy-aware scheduling• Survive the summer and measure performance• Collect the annual energy and PUE• HP-CAST 25 presentation?