Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core...
Transcript of Tier 2 Computer Centres 2... · 2017-11-01 · 280 node HPE(SGI) ICE XA: •10,080 cores (2 18-core...
Tier 2 Computer Centres
www.hpc-uk.ac.uk
CSD3 Cambridge Service for Data Driven Discovery
Tier 2 Computer Centres A community resource……….founded on cooperation and collaboration
Each centre will give a short introduction covering (some of): • USP • Contact Details • Hardware • Access Mechanisms • RSE Support
Open Access Call – 12th Oct (Technical Assessment – 21st Sep) https://www.epsrc.ac.uk/funding/calls/tier2openaccess/
Andy Turner, [email protected]
280 node HPE(SGI) ICE XA:• 10,080 cores (2 18-core Xeon per node)• 128 GiB memory per node• DDN Lustre file system• Single rail FDR Infiniband hypercube
1.9 PiB Tier-2 Data Facility:• DDN Web Object Scalar Appliances• Link to other Tier-1/2 facilities
CallumBennetts/M
averickPhotography
Simple access routes• Free Instant Access for testing• (Driving Test access coming soon)• EPSRC RAP: Open Access Call
http://www.cirrus.ac.uk
Cirrus RSE SupportUser Support
• Freely available to all users from any institution
• Provided by EPCC experts in a wide range of areas
• Easily accessed through helpdesk: just ask for the help you need
• Help provided directly to researcher or to RSE working with researchers
Technical Projects
• Explore new technologies, software, tools
• Add new capabilities to Cirrus• Benchmark and profile commonly
used applications• Work with user community and
other RSE’s
Keen to work with RSE’s at other institutions to help them support local users on Cirrus
http://gw4.ac.uk/isambard
James Price, University of [email protected]
The System• Exploring Arm processor technology• Provided by Cray• 10,000+ ARMv8 cores • Cray software tools
• Compiler, math libraries, tools...• Technology comparison:
• x86, Xeon Phi (KNL), NVIDIA P100 GPUs• Sonexion 3000 SSU (~450 TB)• Phase 1 installed March 2017• The Arm part arrives early 2018
• Early access nodes from September 2017
User Support• 4 x 0.5 FTEs from GW4 consortium• Cray/Arm centre of excellence• Training (porting/optimising for Arm)• Hackathons
Target codes• Will focus on the main
codes from ARCHER• Already running on Arm:
• VASP• CP2K• GROMACS• Unified Model (UM)• OpenFOAM• CloverLeaf• TeaLeaf• SNAP
• Many more codes ported by the wider Arm HPC user community
Access• 25% of the machine time will be available
to users from the EPSRC community• EPSRC RAP: Open Access Call
HPCMidlandsPluswww.hpc-midlands-plus.ac.uk
CentreFacilities• SystemsuppliedbyClustervision-Huawei• x86system• 14,336x86cores• consistingof512nodeseachwith• 2xIntelXeonE5-2680v4cpus with14corespercpu• 128GBRAMpernode
• 3:1blockingEDRInfiniband network• giving756corenon-blockingislands
• 1PBGPFSfilestore
• 15%ofthesystemmadeavailablebyEPSRCRAPandseedcorntime
CentreFacilities• OpenPower System• 5x(2x10)core2.86GHzPOWER8systemseachwith1TBRAMconnectedtotheInfiniband network• onewith2xP100GPGPUs
• Dedicated10TBSSDGPFSfilestore forprestagingfiles• Aimofthesystemisthreefold• Dataanalysisoflargedatasets• Testbedforcodesthatarememorybandwidthlimited• On-the-flydataprocessing
• Comprehensivesoftwarestackinstalledwww.hpc-midlands-plus.ac.uk/software-list• 4FTERSEsupportforacademicsatconsortiumUniversities
Dr Paul Richmond EPSRC Research Software Engineering Fellow
http://www.jade.ac.uk
The JADE System • 22 NVIDIA DGX-1
• 3.740 PetaFLOPs (FP16) • 2.816 Terabytes HBM GPU Memory
• 1PB filestore • P100 GPUs - Optimised for Deep Learning
• NVLink between devices • PCIe to Host (dense nodes)
• Use cases • 50% ML (Deep Learning) • 30% MD • 20% Other
Hosting and Access
• ATOS have been selected as the provider • Following procurement committees review from tender • Running costs to be recouped through selling time to industrial users
• Hosted at STFC Daresbury • Will run SLURM scheduler for scheduling at the node level
• Resource allocation • Open to all without charge • Some priority to supporting institutions • Light touch review process (similar to DiRAC)
Governance and RSE Support
• All CIs have committed RSE support time for their local institutions • To support local users of JADE system • Training: Some commitment to training offered by come CIs (EPCC, Paul
Richmond EPSRC RSE Fellow) • Organisation Committee: RSE Representative from each institution • Software Support and Requests via Github issue tracker
• Governance via steering committee • Responsible for open calls
http://docs.jade.ac.uk
Tier 2 Hub in Materials and Molecular Modelling (MMM Hub) Thomas www.thomasyoungcentre.org
Rationale for a Tier 2 Hub in MMM
• Increased growth in UK MMM research created an unprecedented need for HPC, particularly for medium-sized, high-throughput simulations
• These were predominantly run on ARCHER (30% VASP). Tier 3 sources were too constrained
• The aim of the installation of “Thomas” was to rebalance the ecosystem for the MMM community
• It has created a UK-wide Hub for MMM that serves the entire UK MMM community
• The Hub will build a community to foster collaborative research and the cross-fertilisation of ideas
• Support and software engineering training is offered
“Thomas” Cluster 17,280 cores, 720 nodes; 24 cores/node, 128GB RAM/node
…
…
x16 OSS x16
x16 OSS x16
Intel OPA
1:1 36 node blocks 3:1 between blocks x16 slot
Thomas scratch (428TB) home and software
Thomas Service Architecture
www.thomasyoungcentre.org
Performance - Technical performance
- 523.404 Tflop/s - 5.5 GiB/s IO bandwidth
Access and Sharing • Access models/mechanisms:
75% of machine cycles are available to the
university partners providing funding for Thomas’ hosting and operations costs
Funding partners Imperial, King’s, QMUL and UCL, Belfast, Kent, Oxford, Southampton
25% of cycles are available to the wider UK MMM Community
Allocations to non-partner researchers and groups across the UK will be handled via existing consortia (MCC & UKCP), not T2 RAC
Tier 2 – 1 integration via SAFE will be developed over the coming year
www.thomasyoungcentre.org
Coordinator (Karen Stoneham) based at the TYC
UCL RITS Research Computing Team support (x9)
Online training & contact details User group oversee service at
regular meetings ‘Points of Contact’ at each partner
Institution managing allocations and account approval
Thomas Support Team
www.thomasyoungcentre.org
CSD3 Cambridge Service for Data Driven Discovery
www.csd3.cam.ac.uk
Mike Payne, University of Cambridge [email protected]
CSD3 Cambridge Service for Data Driven Discovery
USPs • co-locate ‘big compute’ and ‘big data’ • facilitate complex computational tasks/workflows
Hardware
• 12,288 cores (2 x 16 core Intel Skylake/384 GB per node)
• 12,288 cores (2 x 16 core Intel Skylake/192 GB per node)
• 342 Intel Knights Landing/96 GB • Intel Omnipath
• 90 x Intel Xeon/4 Nvidia P100 (16GByte)/96 GB • EDR Infiniband
• 50 node Hadoop cluster
• Hierarchical storage (burst buffers/SSDs/etc) • 5 PB disk + 10PB tape www.csd3.cam.ac.uk
Access Mechanisms • Pump priming/Proof of Concept • EPSRC Open Access • EPSRC Grants (other RCs?) • Cash (for academic/industrial/commercial
users) [email protected]
Aspirations
It is our intention that over the lifetime of the CSD3 service an increasing proportion of the computational workload will be more complex computational tasks that exploit multiple capabilities on the system.
You, as RSEs, are uniquely placed to develop new computational methodologies, along with the innovative researchers you know. The CSD3 system is available to you for developing and testing your methodology and for demonstrating its capability.
RSE Support • Led by Filippo Spiga • 3 FTEs (plus additional support in some of
our partner institutions) • Collaborative/cooperative support model
CSD3