Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing...

Mohsin Ahmed ShaikhSupercomputing Applications Specialist

Trends in High Performance Computing

Today and Tomorrow

Typically HPC is used to achieve:

• Throughput• In a certain amount of time want to do:

• More iterations of the same instructions/operations

• More scenarios on the same data

• More independent (same or different) tasks on same or different data

• Capability• Solve bigger problem:

• (e.g. larger scale of model systems to understand emergent properties)

• Denser problems: • (e.g. higher resolution or more detail to understand mechanism, deep dive)

Motivations for using HPC






• Capability• Solver bigger problem:

• (e.g. larger scale of model systems to understand emergent properties)

• Denser problems: • (e.g. higher resolution or more detail to understand mechanism, deep dive)







• Capability• Solve bigger problem:

• e.g. larger scale of model systems to understand emergent properties

• Denser problems: • e.g. higher resolution or more detail to understand mechanism, deep dive)


• An application can either be:

• Compute intensive/bound

• e.g. does FLOPs most of the simulation time

• Memory Intensive/bound

• e.g. moves data between memory and caches most of the simulation time

• I/O intensive/bound

• e.g. Reads/Writes data to the disk most of the simulation time

So, what is Performance?

Parallelism

• Can I break my program into tasks and

execute them in parallel ?

• Tasks may have load imbalance

• Tasks may have dependencies

• Some may need to run before all others

Types of Parallelism

Task Parallelism Data parallelism

(Domain decomposition)

C A B

= C + x

P 1 P 2 P 3

P 3

P 1

P 4

P 2

Task pool [ maintained by Master ]

Task A

Task E

Task D

Empty Task C

Task B

Task F

Granularity of ParallelismCoarse-grained parallelism

(high level)

• May require code refactor

• Even distribution of work

• Load balancing is the key

• Greater autonomy = less synchronization

• More scalable

Fine-grained parallelism

(low level)

• Easier to implement (incremental)

• More synchronization overhead.

• Easier to load balance

• Scalability is generally limited

Coarse grain

Fine grain

Time

Amdahl’s Law

For a fixed problem size, scalability of

a program is limited by its serial

fraction

S Speedup

P parallel fraction of the program

1-P serial fraction of the program

N number of workers

ON Parallel overhead for N workers

0

5

10

15

20

25

30

35

40

45

50

0 100 200 300 400 500

Spe

ed

up

# CPUs

Amdahl's Law

85%

90%

98%

Gustafson’s Law

At large scale and big enough

problem size, the scalability of a

program may not be limited by its

serial fraction.

S Speedup

P Parallel fraction of the program

1-P Serial fraction of the program

N Number of workers

ON Parallel overhead for N workers

0

100

200

300

400

500

600

0 100 200 300 400 500

Spe

ed

up

# CPUs

Gustafson' Law

50%

85%

90%

98%

42 years of Microprocessor trend

Multicore architecture

• Multiple cores on a single die to scale out • Because it power efficient

• Dedicated execution resource per• e.g. registers, ALU, FPU and

vector units, L1 & L2 cache (SRAM) etc.

• Hardware threads / core

• Shared Last Level cache • Cache coherent

• Uncore units

Single socket die of Intel Xeon CORE

Memory hierarchy

• Memory latency increases

further away from CPU

• Capacity increases further

away from CPU

• Bandwidth decreases away

from CPU

Mem Level Latency Capacity

L1 cache ~1 ns 32KB

L2 cache 2.5ns 256KB

LLC Cache 10ns 101-102MB

DRAM 60 ns 101-103 GB

NVDIMMs ? ~6TB

NVRAM 600 ns 101-103 GB

FLASH (R/W) 50/500 usec 101-103 GB

HDD (R/W) 5/0.5 msec 101-103 TB

Tape 50 sec 101-103 TB

Volatile

Non-volatile / persistent

Beyond multicore

• Adding unlimited cores on a

silicon die not possible

• memory does not scale with

increasing CPUs

• Solution? Scaling out

• Multiple multicore sockets

• Cores see single pool of

memory (Global address space)

• Non Uniform Memory Access

• Shared memory model

NUMA 0 NUMA 1

Shared memory programming model

• OpenMP – API for shared memory

programming

• Both task and data parallelism can be

implemented

• Uses fork join model

• API consists of compiler directives#pragma omp parallel for

for ( i=0; i<N; i++)

C[i] = A[i] + B[i]

• Scoping variable to control race

conditions

• Binding in C, C++ and FortranOpenMP Fork Join model

• Thread-based parallelism for shared memory systems

• Explicit parallelism (parallel regions)

• Fork/join model

• Based mostly on inserting compiler directives in the code

Parallelism in OpenMP

Thread 0

Thread 1

Thread 2

Master Thread

Parrallel Task 1

End of

Program

Thread 0

Thread 1

Thread 2

Parrallel Task 2

Each compute node has one or more CPUs:

• Each CPU has multiple cores

• Each CPU has memory attached to it

Each node has an external network connection

Some systems have accelerators (e.g. GPUs)

Inside a Compute Node

Accelerators -- GPGPUs

NVIDIA Pascal P100

Streaming Multiprocessor (SM)

56 SMs on Pascal P100

NVIDIA Tesla P100 (Pascal)

• Basic execution unit

• Streaming Multiprocessor (SM)

• > 3000 SP cores

• Basic cores – uncomplicated tasks

• Low clock frequency

• Large number of threads/core

• Limited registers per thread

• L1 local to SM & L2 shared

• No sync b/w warps

• Limited main memory (16GB)

• Meant for Throughput

• fine-grained parallelism with 1000s of

threads

Programming GPGPUs

• OpenACC

• Simple compiler hints

• Compiler generates threaded code

• API for C, C++, Fortran

• CUDA

• API: Framework by NVIDIA to program GPUs

• CUDA Toolkit

• Dev Tools: Compiler C/C++, Debugger, profiler

• Accel libraries: Dropin interfaces

• Bindings in C/C++ and Fortran

• PyCUDA – call CUDA in Python

• Unified Memory model mitigates PCIe

bottleneck

Multicore CPU

Host Memory (RAM / Main

memory)102 − 103 GB

Device memory(HMB2) 8-16 GB

PCIe 3.0 Bus < 10 GB/s

~ 90 GB/s

~ 720 GB/s

Many CUDA cores

Scale out further

• Cluster of multisocket nodes• Connected via HSN

• Tight coupling

• Multi-layer network topology

• Electrical + Optical connections

• Distributed Memory model• Local memory address space

• Hybrid nodes possible

• Data or Task parallelism

• Communication only over HSN• Higher latency than node local

resource

• Lesser BW than node local resource

• Each node has its own local memory and data is

transferred across the nodes through the network

Parallel architectures – distributed memory

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

network

• Each node has a hybrid design with accelerators (e.g. GPUs) with

their local high bandwidth memory space

• Multiple levels of parallelism

Parallel architectures – hybrid systems

Memory

CPU

Memory

CPU

network

GPU GPU

Mem

ory

Mem

ory

Memory

CPU

Memory

CPU

GPU GPU

Mem

ory

Mem

ory

Memory

CPU

Memory

CPU

GPU GPU

Mem

ory

Mem

ory

Distributed Memory Programming

• Message Passing Interface (MPI) –de facto standard (300+ functions)

• Several Libraries• MPICH, OpenMPI, MVAPICH

• Vendor specific – Intel, Cray, SGI

• Send/Recv data over HSN

• Communication patterns• Point to Point

• Collectives• One to Many

• Many to One

• Many to Many

• Blocking / non blocking calls

• One sided communication

• Message Passing Interface• Standard defining how CPUs send and

receive data

• Vendor specific implementation adhering to the standard

• Allows CPUs to “talk” to each other• i.e. read and write memory

Parallelism in MPI

CPU 1

Memory

Recv_data

CPU 0

Memory

Send_data

Network

Process 0

Process 1

Process 2

MPI SectionLocal Serial Computation

Local Serial Computation

End of Program

•Message

Passing Inter

face•Sta

ndard defining

how CPUs sen

d and receive

data

•Vendor sp

ecific implemen

tation adhering

to the stan

dard

•Allows CP

Us to “talk” to e

ach other

•i.e. read a

nd write memor

yParallelism in

MPI

CPU 1 Memory Recv_data

CPU 0 Memory Send_data

Network

Process 0

Process 1

Process 2

MPI Section

Local Serial

Computation

Local Serial

Computation

End of Program

Partitioned Global Address Space (PGAS)

model• Motivation: Ease of use

• Fakes Shared memory

• Global Address Space:

• Data is shared in this space

• Threads may read/write remote data without

distinction of locality

• Partitioned:

• User designates data as local or global

• One sided MPI under the hood (MPI over

Remote Direct Memory Access)

• Languages extensions:

• UPC, CAF

• New languages:

• Chapel, X10, Fortress

Global addressspace

int ;

int ;

thread 0 thread 1 thread 2 thread 3

x x x x

y[0]

y[1]

y[2]

y[3]

y[4]

y[5]

y[6]

y[7]

private

shared

• Data locality is the key. • Allocate data in memory as your access

pattern• Will help compilers both for CPU and GPU

code

Single Instruction Multiple Data (SIMD)

Scalar vs Vector Ops

A0 + B0 = C0

A1 + B1 = C1

A2 + B2 = C2

A3 + B3 = C3

A0 B0 C0

A1

+B1 C1

A2 B2 C2

A3 B3 C3

=

Supposing vector length = 4

Solving A[i] + B[i] = C[i]

for (i=0; i<n; i++)

C[i] = A[i] + B[i]

for (i=0; i<n; i+=4)

C[i] = A[i] + B[i]

Parallel programming layers

How to get the maximum out of the modern HPC architecture?

• MPI across the nodes

• Multithreading on the node – OpenMP

• Vectorization employed by each thread

Source: Colfax

An abstract supercomputer

• Supercomputers are

expensive scientific

instruments

• Access is shared

• Scheduler provides access

to Compute Nodes

• High performance storage

hides I/O latency

Data Movers

High Performance Storage

Login Nodes

Compute Nodes

Scheduler

Abstract Supercomputer

High Performance StorageParallel File System: Lustre FS

• Compute node do I/O via dedicated high speed interconnect

• MDS controls state of files

• OSS maintains consistency

• OST=Disk pools

• Performance expectations• Parallel I/O patterns

• Large files stripped over OSTs

• Hides latency by increased BW

• Use High performance I/O libraries• MPI IO, HDF5, NetCDF, ADIOS

• Data redundancy provided

Lustre Architecture

Infiniband Interconnect

MDS

2

1

OSS

5

1

OSS

6

2

OSS

7

3

OSS

8

4

MDT

{ OST{

Clients {

• If you can trade data redundancy

• Local SSD pools can improve IOPS• Expensive today but looking better tomorrow

• Non Volatile Memory express pool • Offer higher capacity

• In memory persistent storage for high throughput • Large DRAM with volatile RAMDisk

• NVDIMMs ?? (not out yet)

• Test workload on various solutions

• If possible, best use high performance I/O libraries

High Performance Storage

Local storage pools

Thank you - questions welcome

[email protected]

Documentation and Training Materialhttp://support.pawsey.org.au

Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing...

Documents

Transcript of Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing...