Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core...

23
Tier 2 Computer Centres www.hpc-uk.ac.uk CSD3 Cambridge Service for Data Driven Discovery

Transcript of Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core...

Page 1: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Tier 2 Computer Centres

www.hpc-uk.ac.uk

CSD3 Cambridge Service for Data Driven Discovery

Page 2: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Tier 2 Computer Centres A community resource……….founded on cooperation and collaboration

Each centre will give a short introduction covering (some of): • USP • Contact Details • Hardware • Access Mechanisms • RSE Support

Open Access Call – 12th Oct (Technical Assessment – 21st Sep) https://www.epsrc.ac.uk/funding/calls/tier2openaccess/

Page 3: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Andy Turner, [email protected]

Page 4: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

280 node HPE(SGI) ICE XA:• 10,080 cores (2 18-core Xeon per node)• 128 GiB memory per node• DDN Lustre file system• Single rail FDR Infiniband hypercube

1.9 PiB Tier-2 Data Facility:• DDN Web Object Scalar Appliances• Link to other Tier-1/2 facilities

CallumBennetts/M

averickPhotography

Simple access routes• Free Instant Access for testing• (Driving Test access coming soon)• EPSRC RAP: Open Access Call

http://www.cirrus.ac.uk

Page 5: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Cirrus RSE SupportUser Support

• Freely available to all users from any institution

• Provided by EPCC experts in a wide range of areas

• Easily accessed through helpdesk: just ask for the help you need

• Help provided directly to researcher or to RSE working with researchers

Technical Projects

• Explore new technologies, software, tools

• Add new capabilities to Cirrus• Benchmark and profile commonly

used applications• Work with user community and

other RSE’s

Keen to work with RSE’s at other institutions to help them support local users on Cirrus

Page 6: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

http://gw4.ac.uk/isambard

James Price, University of [email protected]

Page 7: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

The System• Exploring Arm processor technology• Provided by Cray• 10,000+ ARMv8 cores • Cray software tools

• Compiler, math libraries, tools...• Technology comparison:

• x86, Xeon Phi (KNL), NVIDIA P100 GPUs• Sonexion 3000 SSU (~450 TB)• Phase 1 installed March 2017• The Arm part arrives early 2018

• Early access nodes from September 2017

Page 8: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

User Support• 4 x 0.5 FTEs from GW4 consortium• Cray/Arm centre of excellence• Training (porting/optimising for Arm)• Hackathons

Target codes• Will focus on the main

codes from ARCHER• Already running on Arm:

• VASP• CP2K• GROMACS• Unified Model (UM)• OpenFOAM• CloverLeaf• TeaLeaf• SNAP

• Many more codes ported by the wider Arm HPC user community

Access• 25% of the machine time will be available

to users from the EPSRC community• EPSRC RAP: Open Access Call

Page 9: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

HPCMidlandsPluswww.hpc-midlands-plus.ac.uk

[email protected]

Page 10: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

CentreFacilities• SystemsuppliedbyClustervision-Huawei• x86system• 14,336x86cores• consistingof512nodeseachwith• 2xIntelXeonE5-2680v4cpus with14corespercpu• 128GBRAMpernode

• 3:1blockingEDRInfiniband network• giving756corenon-blockingislands

• 1PBGPFSfilestore

• 15%ofthesystemmadeavailablebyEPSRCRAPandseedcorntime

Page 11: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

CentreFacilities• OpenPower System• 5x(2x10)core2.86GHzPOWER8systemseachwith1TBRAMconnectedtotheInfiniband network• onewith2xP100GPGPUs

• Dedicated10TBSSDGPFSfilestore forprestagingfiles• Aimofthesystemisthreefold• Dataanalysisoflargedatasets• Testbedforcodesthatarememorybandwidthlimited• On-the-flydataprocessing

• Comprehensivesoftwarestackinstalledwww.hpc-midlands-plus.ac.uk/software-list• 4FTERSEsupportforacademicsatconsortiumUniversities

Page 12: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Dr Paul Richmond EPSRC Research Software Engineering Fellow

http://www.jade.ac.uk

Page 13: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

The JADE System • 22 NVIDIA DGX-1

• 3.740 PetaFLOPs (FP16) • 2.816 Terabytes HBM GPU Memory

• 1PB filestore • P100 GPUs - Optimised for Deep Learning

• NVLink between devices • PCIe to Host (dense nodes)

• Use cases • 50% ML (Deep Learning) • 30% MD • 20% Other

Page 14: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Hosting and Access

• ATOS have been selected as the provider • Following procurement committees review from tender • Running costs to be recouped through selling time to industrial users

• Hosted at STFC Daresbury • Will run SLURM scheduler for scheduling at the node level

• Resource allocation • Open to all without charge • Some priority to supporting institutions • Light touch review process (similar to DiRAC)

Page 15: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Governance and RSE Support

• All CIs have committed RSE support time for their local institutions • To support local users of JADE system • Training: Some commitment to training offered by come CIs (EPCC, Paul

Richmond EPSRC RSE Fellow) • Organisation Committee: RSE Representative from each institution • Software Support and Requests via Github issue tracker

• Governance via steering committee • Responsible for open calls

http://docs.jade.ac.uk

Page 16: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Tier 2 Hub in Materials and Molecular Modelling (MMM Hub) Thomas www.thomasyoungcentre.org

Page 17: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Rationale for a Tier 2 Hub in MMM

• Increased growth in UK MMM research created an unprecedented need for HPC, particularly for medium-sized, high-throughput simulations

• These were predominantly run on ARCHER (30% VASP). Tier 3 sources were too constrained

• The aim of the installation of “Thomas” was to rebalance the ecosystem for the MMM community

• It has created a UK-wide Hub for MMM that serves the entire UK MMM community

• The Hub will build a community to foster collaborative research and the cross-fertilisation of ideas

• Support and software engineering training is offered

Page 18: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

“Thomas” Cluster 17,280 cores, 720 nodes; 24 cores/node, 128GB RAM/node

x16 OSS x16

x16 OSS x16

Intel OPA

1:1 36 node blocks 3:1 between blocks x16 slot

Thomas scratch (428TB) home and software

Thomas Service Architecture

www.thomasyoungcentre.org

Performance - Technical performance

- 523.404 Tflop/s - 5.5 GiB/s IO bandwidth

Page 19: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Access and Sharing • Access models/mechanisms:

75% of machine cycles are available to the

university partners providing funding for Thomas’ hosting and operations costs

Funding partners Imperial, King’s, QMUL and UCL, Belfast, Kent, Oxford, Southampton

25% of cycles are available to the wider UK MMM Community

Allocations to non-partner researchers and groups across the UK will be handled via existing consortia (MCC & UKCP), not T2 RAC

Tier 2 – 1 integration via SAFE will be developed over the coming year

www.thomasyoungcentre.org

Page 20: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Coordinator (Karen Stoneham) based at the TYC

UCL RITS Research Computing Team support (x9)

Online training & contact details User group oversee service at

regular meetings ‘Points of Contact’ at each partner

Institution managing allocations and account approval

Thomas Support Team

www.thomasyoungcentre.org

Page 21: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

CSD3 Cambridge Service for Data Driven Discovery

www.csd3.cam.ac.uk

Mike Payne, University of Cambridge [email protected]

Page 22: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

CSD3 Cambridge Service for Data Driven Discovery

USPs • co-locate ‘big compute’ and ‘big data’ • facilitate complex computational tasks/workflows

Hardware

• 12,288 cores (2 x 16 core Intel Skylake/384 GB per node)

• 12,288 cores (2 x 16 core Intel Skylake/192 GB per node)

• 342 Intel Knights Landing/96 GB • Intel Omnipath

• 90 x Intel Xeon/4 Nvidia P100 (16GByte)/96 GB • EDR Infiniband

• 50 node Hadoop cluster

• Hierarchical storage (burst buffers/SSDs/etc) • 5 PB disk + 10PB tape www.csd3.cam.ac.uk

Page 23: Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core Xeon per node) •128 GiBmemory per node •DDN Lustrefile system •Single rail

Access Mechanisms • Pump priming/Proof of Concept • EPSRC Open Access • EPSRC Grants (other RCs?) • Cash (for academic/industrial/commercial

users) [email protected]

Aspirations

It is our intention that over the lifetime of the CSD3 service an increasing proportion of the computational workload will be more complex computational tasks that exploit multiple capabilities on the system.

You, as RSEs, are uniquely placed to develop new computational methodologies, along with the innovative researchers you know. The CSD3 system is available to you for developing and testing your methodology and for demonstrating its capability.

RSE Support • Led by Filippo Spiga • 3 FTEs (plus additional support in some of

our partner institutions) • Collaborative/cooperative support model

CSD3