Appro Supercomputer Solutions - GPU Technology...

49
Appro Supercomputer Solutions Steven Lyness, VP HPC Solutions Engineering Appro and Tsukuba University Accelerator Cluster Collaboration

Transcript of Appro Supercomputer Solutions - GPU Technology...

Page 1: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Appro Supercomputer Solutions

Steven Lyness, VP HPC Solutions Engineering

Appro and Tsukuba University Accelerator Cluster Collaboration

Page 2: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Company Overview Appro Celebrates 20 Years of HPC Success….

About Appro Over 20 Years of Experience

Moving Forward….

2007 to 2012

End-To-End

Supercomputer Solutions

1991 – 2000

OEM Server

Manufacturer

2001-2007

Branded Servers

Clusters Solutions

Manufacturer

2

Page 3: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

• Over 2 PFLOPs (peak) with just five Top100 systems added in to Top500 in November

• Variety of technologies: −Intel, AMD, NVIDIA

−Multiple server form factors

−Infiniband and GigE

−Fat Tree and 3D Torus

• Excellent Linpack efficiency on non-optimized SB systems

−85.5% Fat Tree

−83% - 85% 3D Torus

Appro on Top 500

3

Page 4: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Appro Milestones Installations in 2012

Site Peak Performance

Los Alamos (LANL) > 1.8 PFLOPs

Sandia (SNL) > 1.2 PFLOPs

Livermore (LLNL) > 1.5 PFLOPs

Japan (Tsukuba, Kyoto) > 1 PFLOPs

Page 5: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

• HA-PACS (Highly Accelerated Parallel Advanced system for

Computational Sciences)

Apr. 2011 – Mar. 2014, 3-year project

Project Leader: Prof. M. Sato (Director, CCS, Univ. of Tsukuba)

• Develop next generation GPU system : 15 members

Project Office for Exascale Computing System Development

(Leader: Prof. T. Boku)

GPU cluster based on Tightly Coupled Accelerators architecture

• Develop large scale GPU applications : 15 members

Project Office for Exascale Computational Sciences

(Leader: Prof. M. Umemura)

Elementary Particle Physics, Astrophysics, Bioscience, Nuclear/Quantum

Physics, Global Environmental Science, High Performance Computing

5

About University Of Tsukuba HA-PACS Project

Page 6: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

:: Problem Definition

University of Tsukuba- HA-PACS Project

• Many technology discussions to determine KEY :

Fixed budget

High Availability

Latest Processor / High Flops

1:2 CPU:Accelerator Ratio

High Bandwidth to the Accelerator

High bandwidth, low latency interconnect

Apps Could take advantage of “more than QDR IB”

High IO Bandwidth to storage

“Easy to Manage”

6 2012 GTC Conference

Page 7: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Solution Keys

Fixed Budget Considerations

Need to find a balance between:

Performance - Flops, bandwidth (memory, IO

Capacity (CPU Qty, GPU Qty, Memory per core, IO, Storage)

Availability Features

Ease of Management / Supportability

Architecture needed: High Availability

Nodes (PS, Fans)

IPC networks (Ex. InfiniBand)

Service Networks (Provisioning and Management)

7 2012 GTC Conference

Page 8: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

What Appro Brings to NWS 8

Challenge: Create a Solution with High Availability

− Redundant power supplies

− Redundant hot swap fan trays

− Redundant Hot swap disk drives

− Redundant Networks

Solution: Appro Xtreme-X™ Supercomputer, flagship product-line using GreenBlade™ sub-rack component used for for the DoE TLCC2 project

Expand to add support for new custom blade nodes

Meeting Key Requirements

Page 9: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

:: Appro Xtreme-X™ Supercomputer

Solution Architecture

9 2012 GTC Conference

Unified scalable cluster architecture that can

be provisioned and managed as a stand-alone

supercomputer.

Improved power & cooling efficiency to

dramatically lower total cost of ownership

Offers high performance and high availability

features with lower latency and higher

bandwidth.

Appro HPC Software Stack - Complete HPC

Cluster Software tools combined with the

Appro Cluster Engine™ (ACE) Management

Software including the following capabilities:

System Management

Network Management

Server Management

Cluster Management

Storage Management

Page 10: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

10 Presentation Name

Optimal Performance

Meeting Key Requirements

Peak Performance CPU Contribution

Sandy Bridge-EP 2.6 GHz E5-2670 Processor (332 GFlops per node)

GPU Contribution

665 GFlops per NVIDIA S2090

Four (4) S2090’s per node or 2.66 TFlops per node

Combined Peak Performance is 3 TFlops per node

Two Hundred and Sixty-Eight (268) nodes provides 802 TFlops

Accelerator Performance DEDICATED PCI-e Gen3 X16 for each NVIDIA GPU

Uses Gen2 so we have up to 8 GB/s per GPU available

IO Performance 2 x QDR (Mellanox CX3) – Up to 4GB/s per link (on PCI-e Gen3 X8) bus

GigE for Operations networks

Page 11: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Up to 4x 2P GB812X blades

− Expandability for HDD, SSD, GPU, MIC

Six Cooling Fan Units

− Hot swappable & redundant

Up to six 1600W power supplies

− Platinum-rated; 95%+ efficient

− Hot swappable & redundant

Support one or redundant iSCB platform manager modules with

Enhanced management capabilities

− Active & dynamic fan control

− Power monitoring

− Remote power control

− Integrated console server

Appro GreenBlade™ Sub-Rack With Accelerator Expansion Blades

Appro Confidential and Proprietary

Page 12: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Appro GreenBlade™ Subrack

Appro Confidential and Proprietary

• Server Board

−Increased memory footprint (2 DPC)

−Provides access to two (2) PCI-e Gen3 X16 PER SOCKET

• Provides for increased IO capability

−QDR or FDR InfiniBand on the motherboard

−Internal RAID Adapter on Gen3 bus

• Up to two (2) 2.5” Hard drives

NOTE: Can run diskless/stateless because of Appro Cluster Engine but needed local scratch

iSCB Modules

2012 GTC Conference

Page 13: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

::

Challenge: Create a server node with

− Latest Generation of processors: Need for flops AND IO capacity

− HIGH bandwidth to the Accelerators

− High Memory capacity

Solution: High Bandwidth Intel Sandy Bridge-EP for CPU and the NVIDIA Tesla for GPU

Working with Intel® EPSD EARLY on to design a motherboard

− Washington Pass (S2600WP) Motherboard with:

Dual Sandy Bridge-EP (E5-2700) sockets

Expose four (4) PCI-e Gen3 X16 for Accelerator Connectivity

Expose one (1) PCI-e Gen3 X8 for Expansion slot/IO

Two (2) DIMMS Per channel (16 DIMMS total)

− 2U form factor for fit and air flow/cooling

13

Server Node Design

Meeting Key Requirements

2012 GTC Conference

Page 14: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

4 Channels 1,600 MHz

51.2 GB/sec

SNB-EN

4 Channels 1,600 MHz 51.2 GB/sec

EP

2xQDR IB

EP

Sandy Bridge QPI Patsburg

PCH

PCI-e X4 Gen 3

x 8

Dual GbE

GbE

Dual

BMC

BIOS ESI Sandy Bridge

DD

R3

DD

R3

DD

R3

DD

R3

QPI

Gen 3

x 1

6

DMI

Gen 3

x 8

D

DR3

DD

R3

DD

R3

DD

R3

Gen 3

x 1

6

4 x NVIDIA M2090

Gen 3

x 1

6

Gen 3

x 1

6

Appro Confidential and Proprietary

Intel® EPSD S2600WP Motherboard

Meeting Key Requirements

2012 GTC Conference

Page 15: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

GreenBlade Node Design

PAG

E |

15

QDR InfiniBand (Port 0)

GigE – Cluster Management / Operations Network (Prime)

QDR InfiniBand (Port 1)

GigE – Cluster Management / Operations Network (Secondary)

HDD0

HDD1

2012 GTC Conference

Page 16: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

:: Network Availability

Meeting Key Requirements

Challenge To provide cost effective redundant networks to eliminate/reduce failures (MTTI)

Solution − Build system with redundant operations Ethernet networks

Redundant on-board GigE each with access to IPMI

Redundant iSCB Modules for baseboard management, node control and monitoring

− Build system with redundant InfiniBand networks

DUAL QDR for price/performance

Selected Mellanox due to Gen3 X8 support (dual port adapter)

16 2012 GTC Conference

Page 17: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

:: Operations Networking

Meeting Key Requirements

17

Management

Node(s)

Compute Nodes

GbE

10GigE

10GigE Switch

External Network

Sub Management Node

(GreenBlade™ GB812X)

Login

Node(s)

Rack (1) , Rack (2)

and

Rack (3)

Rack (N-2) , Rack (N-1)

and

Rack (N)

Sub Management Node

(GreenBlade™ GB812X)

48 port Leaf

Switches

Compute Nodes

2012 GTC Conference

Page 18: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

:: Ease of Use

Meeting Key Requirements

Challenge • Need the System top install quickly to get into production

• Most have limited “people resources”

• Need to be able to keep the system running and doing science

Solution • Appro HPC Software Stack

− Tested and Validated

− Full stack from HW layer to Application layer

− Allows for quick bring up of a cluster

18 2012 GTC Conference

Page 19: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Appro HPC Software Stack

User Applications

Intel® Cluster

Studio PGI (PGI CDK) GNU PathScale

MVAPICH2 OpenMPI Intel® MPI-(Intel Cluster

Studio)

Appro Cluster Engine (ACE™) Virtual Clusters

Linux (Red Hat, CentOS, SuSE)

ACE ™

OS

Provisioning

Remote Power Mgmt PowerMan

Message Passing

Compilers

Console Mgmt ACE ™ ConMan

Grid Engine Job Scheduling PBS Pro

NFS (3.x Storage Lustre Local FS

(ext3, ext4, XFS)

ACE™ (iSCB and OpenIPMI) Cluster Monitoring

HPCC IOR netperf Performance Monitoring

Appro Xtreme-X™ Supercomputer – Building Blocks

Appro HPC Professional Services - On-site Installation services and/or Customized services

Appro Turn-Key Integration & Delivery Services HW and SW integration, pre-acceptance testing, dismantle, packing and shipping

Appro

HPC S

oft

ware

Sta

ck

Perfctr PAPI/IPM

PanFS

SLURM

2012 GTC Conference

Page 20: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

:: Summary

Appro Key Advantages

• Partnering with Key technology partners to offer cutting-edge

integrated solutions:

− Performance

Storage IOR

Networking Bandwidth, latencies and message rates

− Features

High Availability (high standard MTBF, redundancy - PS)

Ease of Management

− Flexibility

− Price /Performance

− Training Programs

Pre-Sales (Sell everything it does and ONLY that)

Installation and Tuning

Post Install Support

20 2012 GTC Conference

Page 21: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

::

Appro Corporate Presentation

Turn-Key Solution Summary

Appro Cluster Engine™ (ACE) Management Software Suite

Capability Computing

Hybrid Computing

Capacity Computing

Data Intensive

Computing

Appro Xtreme-X™ Supercomputer addressing 4 HPC Workload Configurations

Appro HPC Software Stack

Turn-Key Integration & Delivery Services

- Node, Rack, Switch, Interconnect, cable, network, storage, software, Burning-in - Pre-acceptance testing, performance validation, dismantle, packing and shipping

Appro HPC Professional Services - On-site Installation services and/or Customized services

Appro Xtreme-X™ Supercomputer

21

Page 22: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Appro Supercomputer Solutions

Questions?

Steve Lyness, VP HPC Solutions Engineering

Ask Now or see us at Table #54

Learn More at www.appro.com

Page 23: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Taisuke Boku

Center for Computational Sciences

University of Tsukuba [email protected]

HA-PACS Next Step for Scientific Frontier

by Accelerated Computing

2012/05/15

23 GTC2012, San Jose

Page 24: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Project plan of HA-PACS

HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) Accelerating critical problems on various scientific fields in Center for Computational Sciences, University of Tsukuba

− The target application fields will be partially limited

− Current target: QCD, Astro, QM/MM (quantum mechanics / molecular mechanics, for life science)

Two parts − HA-PACS base cluster:

for development of GPU-accelerated code for target fields, and performing product-run of them

− HA-PACS/TCA: (TCA = Tightly Coupled Accelerators)

for elementary research on new technology for accelerated computing

Our original communication system based on PCI-Express named “PEARL”, and a prototype communication chip named “PEACH2”

2012/05/15

24

GTC2012, San Jose

Page 25: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

GPU Computing: current trend of HPC

GPU clusters in TOP500 on Nov. 2011 − 2nd 天河Tienha-1A (Rpeak=4.70 PFLOPS)

− 4th 星雲Nebulae (Rpeak=2.98 PFLOPS)

− 5th TSUBAME2.0 (Rpeak=2.29 PFLOPS)

− (1st K Computer Rpeak=11.28 PFLOPS)

Features − high peak performance / cost ratio

− high peak performance / power ratio

− large scale applications with GPU acceleration don’t run yet in production on GPU cluster ⇒ Our First target is to develop large scale applications accelerated by GPU in real computational sciences

25

2012/05/15

GTC2012, San Jose

Page 26: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Problems of GPU Cluster

Problems of GPGPU for HPC − Data I/O performance limitation

Ex) GPGPU: PCIe gen2 x16

Peak Performance: 8GB/s (I/O) ⇔ 665 GFLOPS (NVIDIA M2090)

− Memory size limitation Ex) M2090: 6GByte vs CPU: 4 – 128

GByte

− Communication between accelerators: no direct path (external) ⇒ communication latency via CPU becomes large

Ex) GPGPU: GPU mem ⇒ CPU mem ⇒ (MPI) ⇒ CPU mem ⇒ GPU mem

Researches for direct communication between GPUs are required

26

Our another target is developing a direct communication system between external GPUs for a feasibility study for future accelerated computing

2012/05/15

GTC2012, San Jose

Page 27: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Project Formation

HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)

Apr. 2011 – Mar. 2014, 3-year project

Project Leader: Prof. M. Sato (Director, CCS, Univ. of Tsukuba)

Develop next generation GPU system : 15 members

Project Office for Exascale Computing System Development (Leader: Prof. T. Boku)

GPU cluster based on Tightly Coupled Accelerators architecture

Develop large scale GPU applications : 15 members

Project Office for Exascale Computational Sciences (Leader: Prof. M. Umemura)

Elementary Particle Physics, Astrophysics, Bioscience, Nuclear/Quantum Physics, Global Environmental Science, High Performance Computing

27

2012/05/15

GTC2012, San Jose

Page 28: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS base cluster (Feb. 2012)

2012/05/15

GTC2012, San Jose 28

Page 29: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS base cluster

2012/05/15

GTC2012, San Jose 29

Front view

Side view

Page 30: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS base cluster

2012/05/15

GTC2012, San Jose 30

Rear view of one blade chassis with 4 blades

Front view of 3 blade chassis

Rear view of Infiniband switch and cables (yellow=fibre, black=copper)

Page 31: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS: base cluster (computation node)

GTC2012, San Jose 31

(2.6GHz x 8flop/clock)

Total: 3TFLOPS

8GB/s

AVX

665GFLOPSx4 =2660GFLOPS

20.8GFLOPSx16 =332.8GFLOPS

(16GB, 12.8GB/s)x8 =128GB, 102.4GB/s

(6GB, 177GB/s)x4 =24GB, 708GB/s

2012/05/15

Page 32: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Intel Xeon E5 (SandyBridge-EP) x 2

− 8 cores/socket (16 cores/node) with 2.6 GHz

− AVX (256-bit SIMD) on each core ⇒ peak perf./socket = 2.6 x 4 x 2 = 166.4 GFLOPS ⇒ pek perf./node = 332.8 GFLOPS

− Each socket supports up to 40 lanes of PCIe gen3 ⇒ great performance to connect multiple GPUs without I/O performance bottleneck ⇒ current NVIDIA M2090 supports just PCIe gen2, but net generation (Kepler) will support PCIe gen3

− M2090 x4 can be connected to 2 SandyBridge-EP still remaining PCIe gen3 x8 x2 ⇒ Infiniband QDR x 2

2012/05/15

GTC2012, San Jose 32

HA-PACS: base cluster unit(CPU)

Page 33: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS: base cluster unit(GPU)

NVIDIA M2090 x 4

− Number of processor core: 512

− Processor core clock: 1.3 GHz

− DP 665 GFLOPS, SP 1331GFLOPS

− PCI Express gen2 ×16 system interface

− Board power dissipation: <= 225 W

− Memory clock: 1.85 GHz, size: 6GB with ECC, 177GB/s

− Shared/L1 Cache: 64KB, L2 Cache: 768KB

GTC2012, San Jose 33

2012/05/15

Page 34: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS: base cluster unit(blade node)

GTC2012, San Jose 34

2x 2.6GHz 8core SandyBridge-EP

Air flow

2x NVIDIA Tesla M2090

1x PCIe slot for HCA

2x 2.5”HDD

2x NVIDIA Tesla M2090

Front view Rear view

Power Supply Unit and Fan - 8U enclosure - 4 nodes - 3 PSU(Hot Swappable) - 6 Fans(Hot

Swappable)

2012/05/15

Page 35: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Basic performance data

MPI pingpong

− 6.4 GB/s (N1/2= 8KB)

− with dual rail Infiniband QDR (Mellanox ConnectX-3)

− actually FDR for HCA and QDR for switch

PCIe benchmark (Device -> Host memory copy), aggregated perf. for 4 GPUs simultaneously

− 24 GB/s (N1/2= 20KB)

− PCIe gen2 x16 x4, theoretical peak = 8 GB/s x4 = 32 GB/s

Stream (memory)

− 74.6 GB/s

− theoretical peak = 102.4 GB/s

2012/05/15

GTC2012, San Jose 35

Page 36: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

PCIe Host:Device communication performance

2012/05/15

GTC2012, San Jose 36

Slower start on Host->Device compared with Device->Host

Page 37: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS Application (1):Elementary Particle Physics

37

Multi-scale physics Finite temperature and density

Investigate hierarchical properties via direct construction of nuclei in lattice QCD GPU to solve large sparse linear systems of equations

Phase analysis of QCD at finite temperature and density GPU to perform matrix-matrix product of dense matrices

quark

proton neutron

nucleus

Expected QCD phase diagram

2012/05/15

GTC2012, San Jose

Page 38: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS Applications (2):Astrophysics

38

(A) Collisional N-body Simulation (B) Radiation Transfer

Computations of the accelerations of particles and their time derivatives (jerks) are time consuming.

Direct (brute force) calculations of acceleration and jerks are required to achieve the required numerical accuracy

Globular Clusters

Massive Black Holes in Galaxies

Accelerations and jerks are computed on GPU

• Understanding of the formation of massive black holes in galaxies

• Numerical simulations of complicated gravitational interactions between stars and multiple black holes in galaxy centers.

• Fossil object as a clue to investigate the primordial universe

• Formation of the most primordial objects formed more than 10 giga years.

Calculation of the physical effects of photons emitted by stars and galaxies onto the surrounding matter.

So far, poorly investigated due to its huge amount of computational cost, though it is of critical importance in the formation of stars and galaxies. Computations of the radiation intensity and the resulting chemical reactions based on the ray-tracing methods can be highly accelerated with GPUs owing to its high concurrency.

First Stars and Re-ionization of the Universe

Accretion Disks around Black Holes

• Understanding of the formation of the first stars in the universe and the succeeded re-ionization of the universe.

• Study of the high temperature regions around black holes

2012/05/15

GTC2012, San Jose

Page 39: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS Application (3):Bioscience

39

DNA-protein complex

(macroscale MD)

Reaction mechanisms

(QM/MM-MD)

QM region

> 100 atoms

GPU acceleration - Direct coulmb (Gromacs, NAMD, Amber) -2 electron integral

2012/05/15

GTC2012, San Jose

Page 40: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS Application (4)

Other advanced researches on HPC Division in CCS

− XcalableMP-dev (XMP-dev) for easy and simple programming language to support distributed memory & GPU accelerated computing for large scale computational sciences

− G8 NuFuSE (Nuclear Fusion Simulation for Exascale) project platform for porting Plasma Simulation Code with GPU technology

− Climate simulation especially for LES (Large Eddy Simulation) for cloud-level resolution on city-model size simulation

− Any other collaboration ...

2012/05/15

GTC2012, San Jose 40

Page 41: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS: TCA (Tightly Coupled Accelerator)

TCA: Tightly Coupled Accelerator

− Direct connection between accelerators (GPUs)

− Using PCIe as a communication device between accelerator

Most acceleration device and other I/O device are connected by PCIe as PCIe end-point (slave device)

An intelligent PCIe device logically enables an end-point device to directly communicate with other end-point devices

PEARL: PCI Express Adaptive and Reliable Link

− We already developed such PCIe device (PEACH, PCI Express Adaptive Communication Hub) on JST-CREST project “low power and dependable network for embedded system”

− It enables direct connection between nodes by PCIe Gen2 x4 link

⇒ Improving PEACH for HPC to realize TCA GTC2012, San Jose 41

2012/05/15

Page 42: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

PEACH

PEACH: PCI-Express Adaptive Communication Hub

An intelligent PCI-Express communication switch to use PCIe link directly for node-to-node interconnection

Edge of PEACH PCIe link can be connected to any peripheral devices, including GPU

Prototype PEACH chip − 4-port PCI-E gen.2 with x4 lane / port

− PCI-E link edge control feature: “root complex” and “end points” are automatically switched (flipped) according to the connection handling

− Other fault-tolerant (reliability) function is implemented: “flip network link” to allow single link fault

in HA-PACS/TCA prototype development, we will enhance current PEACH chip ⇒ PEACH2

2012/05/15

42

GTC2012, San Jose

Page 43: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS/TCA (Tightly Coupled Accelerator)

Enhanced version of PEACH

⇒ PEACH2 − x4 lanes -> x8 lanes

− hardwired on main data path and PCIe interface fabric

PEACH2

CPU

PCIe

CPU

PCIe

Node

PEACH2

PCIe

PCIe

PCIe

GPU

GPU

PCIe

PCIe

Node

PCIe

PCIe

PCIe

GPU

GPU CPU

CPU

IB HCA

IB HCA

IB Switc

h

True GPU-direct

current GPU clusters require 3-hop communication (3-5 times memory copy)

For strong scaling, Inter-GPU direct communication protocol is needed for lower latency and higher throughput

MEM MEM

MEM

MEM MEM MEM

2012/05/15

GTC2012, San Jose 43

Page 44: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Implementation of PEACH2: ASIC⇒FPGA

FPGA based implementation − today’s advanced FPGA allows to use PCIe

hub with multiple ports

− currently gen2 x 8 lanes x 4 ports are available ⇒ soon gen3 will be available (?)

− easy modification and enhancement

− fits to standard (full-size) PCIe board

− internal multi-core general purpose CPU with programmability is available ⇒ easily split hardwired/firmware partitioning on certain level on control layer

Controlling PEACH2 for GPU communication protocol

− collaboration with NVIDIA for information sharing and discussion

− based on CUDA4.0 device to device direct memory copy protocol

2012/05/15

GTC2012, San Jose 44

Page 45: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

HA-PACS/TCA Node Cluster = NC

PEACH2 C x 2

PEARL Ring Network

Infiniband Link

Node Cluster with 16 nodes • GPUx64 (G) • CPUx32 (C) • GPU comm with PCIe • IB link / node • CPU: Xeon E5 • GPU: Kepler

Node Cluster

Node Cluster

Node Cluster

Node Cluster

Node Cluster

Infiniband Network

........

4 NC with 16 nodes, or 8 NC with 8 nodes = 360 TFLOPS extension to base cluster

•High speed GPU-GPU comm. by PEACH within NC (PCI-E gen2x8 = 5GB/s/link) •Infiniband QDR (x2) for NC-NC comm. (4GB/s/link)

45

Gx4

PEACH2 C x 2

Gx4

PEACH2 C x 2

Gx4

.....

Page 46: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

PEARL/PEACH2 variation (1)

46

C C

C C

C C

C C QPI

PCIe

GPU

GPU

GPU IB

HCA

GPU

PEACH2

PCIe SW

G2 x8

G3

x16

G3

x16

G3 x8

G3 x8

Option 1:

Performance comparison among IB and PEARL can be evenly compared

Additional latency by PCIe switch

2012/05/15

GTC2012, San Jose

Page 47: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

PEARL/PEACH2 variation (2)

47

C C

C C

C C

C C QPI

PCIe

GPU

GPU

GPU

IB HCA

GPU

PEACH2

PCIe SW

G2 x8

G3

x16

G3

x16

G3 x16

G3

x16

G3 x8

Option 2:

− Requires only 72 lanes in total

− asymmetric connection among 3 blocks of GPUs

2012/05/15

GTC2012, San Jose

Page 48: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

PEACH2 prototype board for TCA

2012/05/15

GTC2012, San Jose 48

FPGA (Altera Stratix IV GX530)

PCIe external link connector x2 (one more on daughter board)

PCIe edge connector (to host server)

daughter board connector

power regulators for FPGA

Page 49: Appro Supercomputer Solutions - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0618... · 2013-08-23 · :: Appro Xtreme-X™ Supercomputer Solution Architecture

Summary

HA-PACS consists of two elements: HA-PACS base cluster for application development and HA-PACS/TCA for elementary study for advanced technology on direct communication among accelerating devices (GPUs)

HA-PACS base cluster started its operation from Feb. 2012 with 802 TFLOPS peak performance (Linpack performance will come on June 2012, also expecting good score on Green500)

FPGA implementation of PEACH2 is finished for the prototype version on Mar. 2012 and enhanced for final version in following 6 months

HA-PACS/TCA with at least 300 TFLOPS additional performance will be installed around Mar. 2013

2012/05/15

GTC2012, San Jose 49