Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing...
Transcript of Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing...
![Page 1: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/1.jpg)
Mohsin Ahmed ShaikhSupercomputing Applications Specialist
Trends in High Performance Computing
Today and Tomorrow
![Page 2: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/2.jpg)
Typically HPC is used to achieve:
• Throughput• In a certain amount of time want to do:
• More iterations of the same instructions/operations
• More scenarios on the same data
• More independent (same or different) tasks on same or different data
• Capability• Solve bigger problem:
• (e.g. larger scale of model systems to understand emergent properties)
• Denser problems: • (e.g. higher resolution or more detail to understand mechanism, deep dive)
Motivations for using HPC
![Page 3: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/3.jpg)
Typically HPC is used to achieve:
• Throughput• In a certain amount of time want to do:
• More iterations of the same instructions/operations
• More scenarios on the same data
• More independent (same or different) tasks on same or different data
• Capability• Solver bigger problem:
• (e.g. larger scale of model systems to understand emergent properties)
• Denser problems: • (e.g. higher resolution or more detail to understand mechanism, deep dive)
Motivations for using HPC
![Page 4: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/4.jpg)
Typically HPC is used to achieve:
• Throughput• In a certain amount of time want to do:
• More iterations of the same instructions/operations
• More scenarios on the same data
• More independent (same or different) tasks on same or different data
• Capability• Solve bigger problem:
• e.g. larger scale of model systems to understand emergent properties
• Denser problems: • e.g. higher resolution or more detail to understand mechanism, deep dive)
Motivations for using HPC
![Page 5: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/5.jpg)
• An application can either be:
• Compute intensive/bound
• e.g. does FLOPs most of the simulation time
• Memory Intensive/bound
• e.g. moves data between memory and caches most of the simulation time
• I/O intensive/bound
• e.g. Reads/Writes data to the disk most of the simulation time
So, what is Performance?
![Page 6: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/6.jpg)
Parallelism
• Can I break my program into tasks and
execute them in parallel ?
• Tasks may have load imbalance
• Tasks may have dependencies
• Some may need to run before all others
![Page 7: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/7.jpg)
Types of Parallelism
Task Parallelism Data parallelism
(Domain decomposition)
C A B
= C + x
P 1 P 2 P 3
P 3
P 1
P 4
P 2
Task pool [ maintained by Master ]
Task A
Task E
Task D
Empty Task C
Task B
Task F
![Page 8: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/8.jpg)
Granularity of ParallelismCoarse-grained parallelism
(high level)
• May require code refactor
• Even distribution of work
• Load balancing is the key
• Greater autonomy = less synchronization
• More scalable
Fine-grained parallelism
(low level)
• Easier to implement (incremental)
• More synchronization overhead.
• Easier to load balance
• Scalability is generally limited
Coarse grain
Fine grain
Time
![Page 9: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/9.jpg)
Amdahl’s Law
For a fixed problem size, scalability of
a program is limited by its serial
fraction
S Speedup
P parallel fraction of the program
1-P serial fraction of the program
N number of workers
ON Parallel overhead for N workers
0
5
10
15
20
25
30
35
40
45
50
0 100 200 300 400 500
Spe
ed
up
# CPUs
Amdahl's Law
85%
90%
98%
![Page 10: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/10.jpg)
Gustafson’s Law
At large scale and big enough
problem size, the scalability of a
program may not be limited by its
serial fraction.
S Speedup
P Parallel fraction of the program
1-P Serial fraction of the program
N Number of workers
ON Parallel overhead for N workers
0
100
200
300
400
500
600
0 100 200 300 400 500
Spe
ed
up
# CPUs
Gustafson' Law
50%
85%
90%
98%
![Page 11: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/11.jpg)
42 years of Microprocessor trend
![Page 12: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/12.jpg)
Multicore architecture
• Multiple cores on a single die to scale out • Because it power efficient
• Dedicated execution resource per• e.g. registers, ALU, FPU and
vector units, L1 & L2 cache (SRAM) etc.
• Hardware threads / core
• Shared Last Level cache • Cache coherent
• Uncore units
Single socket die of Intel Xeon CORE
![Page 13: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/13.jpg)
Memory hierarchy
• Memory latency increases
further away from CPU
• Capacity increases further
away from CPU
• Bandwidth decreases away
from CPU
Mem Level Latency Capacity
L1 cache ~1 ns 32KB
L2 cache 2.5ns 256KB
LLC Cache 10ns 101-102MB
DRAM 60 ns 101-103 GB
NVDIMMs ? ~6TB
NVRAM 600 ns 101-103 GB
FLASH (R/W) 50/500 usec 101-103 GB
HDD (R/W) 5/0.5 msec 101-103 TB
Tape 50 sec 101-103 TB
Volatile
Non-volatile / persistent
![Page 14: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/14.jpg)
Beyond multicore
• Adding unlimited cores on a
silicon die not possible
• memory does not scale with
increasing CPUs
• Solution? Scaling out
• Multiple multicore sockets
• Cores see single pool of
memory (Global address space)
• Non Uniform Memory Access
• Shared memory model
NUMA 0 NUMA 1
![Page 15: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/15.jpg)
Shared memory programming model
• OpenMP – API for shared memory
programming
• Both task and data parallelism can be
implemented
• Uses fork join model
• API consists of compiler directives#pragma omp parallel for
for ( i=0; i<N; i++)
C[i] = A[i] + B[i]
• Scoping variable to control race
conditions
• Binding in C, C++ and FortranOpenMP Fork Join model
• Thread-based parallelism for shared memory systems
• Explicit parallelism (parallel regions)
• Fork/join model
• Based mostly on inserting compiler directives in the code
Parallelism in OpenMP
Thread 0
Thread 1
Thread 2
Master Thread
Parrallel Task 1
End of
Program
Thread 0
Thread 1
Thread 2
Parrallel Task 2
Each compute node has one or more CPUs:
• Each CPU has multiple cores
• Each CPU has memory attached to it
Each node has an external network connection
Some systems have accelerators (e.g. GPUs)
Inside a Compute Node
![Page 16: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/16.jpg)
Accelerators -- GPGPUs
NVIDIA Pascal P100
Streaming Multiprocessor (SM)
56 SMs on Pascal P100
![Page 17: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/17.jpg)
NVIDIA Tesla P100 (Pascal)
• Basic execution unit
• Streaming Multiprocessor (SM)
• > 3000 SP cores
• Basic cores – uncomplicated tasks
• Low clock frequency
• Large number of threads/core
• Limited registers per thread
• L1 local to SM & L2 shared
• No sync b/w warps
• Limited main memory (16GB)
• Meant for Throughput
• fine-grained parallelism with 1000s of
threads
![Page 18: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/18.jpg)
Programming GPGPUs
• OpenACC
• Simple compiler hints
• Compiler generates threaded code
• API for C, C++, Fortran
• CUDA
• API: Framework by NVIDIA to program GPUs
• CUDA Toolkit
• Dev Tools: Compiler C/C++, Debugger, profiler
• Accel libraries: Dropin interfaces
• Bindings in C/C++ and Fortran
• PyCUDA – call CUDA in Python
• Unified Memory model mitigates PCIe
bottleneck
Multicore CPU
Host Memory (RAM / Main
memory)102 − 103 GB
Device memory(HMB2) 8-16 GB
PCIe 3.0 Bus < 10 GB/s
~ 90 GB/s
~ 720 GB/s
Many CUDA cores
![Page 19: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/19.jpg)
Scale out further
• Cluster of multisocket nodes• Connected via HSN
• Tight coupling
• Multi-layer network topology
• Electrical + Optical connections
• Distributed Memory model• Local memory address space
• Hybrid nodes possible
• Data or Task parallelism
• Communication only over HSN• Higher latency than node local
resource
• Lesser BW than node local resource
• Each node has its own local memory and data is
transferred across the nodes through the network
Parallel architectures – distributed memory
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
network
• Each node has a hybrid design with accelerators (e.g. GPUs) with
their local high bandwidth memory space
• Multiple levels of parallelism
Parallel architectures – hybrid systems
Memory
CPU
Memory
CPU
network
GPU GPU
Mem
ory
Mem
ory
Memory
CPU
Memory
CPU
GPU GPU
Mem
ory
Mem
ory
Memory
CPU
Memory
CPU
GPU GPU
Mem
ory
Mem
ory
![Page 20: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/20.jpg)
Distributed Memory Programming
• Message Passing Interface (MPI) –de facto standard (300+ functions)
• Several Libraries• MPICH, OpenMPI, MVAPICH
• Vendor specific – Intel, Cray, SGI
• Send/Recv data over HSN
• Communication patterns• Point to Point
• Collectives• One to Many
• Many to One
• Many to Many
• Blocking / non blocking calls
• One sided communication
• Message Passing Interface• Standard defining how CPUs send and
receive data
• Vendor specific implementation adhering to the standard
• Allows CPUs to “talk” to each other• i.e. read and write memory
Parallelism in MPI
CPU 1
Memory
Recv_data
CPU 0
Memory
Send_data
Network
Process 0
Process 1
Process 2
MPI SectionLocal Serial Computation
Local Serial Computation
End of Program
•Message
Passing Inter
face•Sta
ndard defining
how CPUs sen
d and receive
data
•Vendor sp
ecific implemen
tation adhering
to the stan
dard
•Allows CP
Us to “talk” to e
ach other
•i.e. read a
nd write memor
yParallelism in
MPI
CPU 1 Memory Recv_data
CPU 0 Memory Send_data
Network
Process 0
Process 1
Process 2
MPI Section
Local Serial
Computation
Local Serial
Computation
End of Program
![Page 21: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/21.jpg)
Partitioned Global Address Space (PGAS)
model• Motivation: Ease of use
• Fakes Shared memory
• Global Address Space:
• Data is shared in this space
• Threads may read/write remote data without
distinction of locality
• Partitioned:
• User designates data as local or global
• One sided MPI under the hood (MPI over
Remote Direct Memory Access)
• Languages extensions:
• UPC, CAF
• New languages:
• Chapel, X10, Fortress
Global addressspace
int ;
int ;
thread 0 thread 1 thread 2 thread 3
x x x x
y[0]
y[1]
y[2]
y[3]
y[4]
y[5]
y[6]
y[7]
private
shared
![Page 22: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/22.jpg)
• Data locality is the key. • Allocate data in memory as your access
pattern• Will help compilers both for CPU and GPU
code
Single Instruction Multiple Data (SIMD)
![Page 23: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/23.jpg)
Scalar vs Vector Ops
A0 + B0 = C0
A1 + B1 = C1
A2 + B2 = C2
A3 + B3 = C3
A0 B0 C0
A1
+B1 C1
A2 B2 C2
A3 B3 C3
=
Supposing vector length = 4
Solving A[i] + B[i] = C[i]
for (i=0; i<n; i++)
C[i] = A[i] + B[i]
for (i=0; i<n; i+=4)
C[i] = A[i] + B[i]
![Page 24: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/24.jpg)
Parallel programming layers
How to get the maximum out of the modern HPC architecture?
• MPI across the nodes
• Multithreading on the node – OpenMP
• Vectorization employed by each thread
Source: Colfax
![Page 25: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/25.jpg)
An abstract supercomputer
• Supercomputers are
expensive scientific
instruments
• Access is shared
• Scheduler provides access
to Compute Nodes
• High performance storage
hides I/O latency
Data Movers
High Performance Storage
Login Nodes
Compute Nodes
Scheduler
Abstract Supercomputer
![Page 26: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/26.jpg)
High Performance StorageParallel File System: Lustre FS
• Compute node do I/O via dedicated high speed interconnect
• MDS controls state of files
• OSS maintains consistency
• OST=Disk pools
• Performance expectations• Parallel I/O patterns
• Large files stripped over OSTs
• Hides latency by increased BW
• Use High performance I/O libraries• MPI IO, HDF5, NetCDF, ADIOS
• Data redundancy provided
Lustre Architecture
Infiniband Interconnect
MDS
2
1
OSS
5
1
OSS
6
2
OSS
7
3
OSS
8
4
MDT
{ OST{
Clients {
![Page 27: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/27.jpg)
• If you can trade data redundancy
• Local SSD pools can improve IOPS• Expensive today but looking better tomorrow
• Non Volatile Memory express pool • Offer higher capacity
• In memory persistent storage for high throughput • Large DRAM with volatile RAMDisk
• NVDIMMs ?? (not out yet)
• Test workload on various solutions
• If possible, best use high performance I/O libraries
High Performance Storage
Local storage pools
![Page 28: Trends in High Performance Computing Today and Tomorrow · Mohsin Ahmed Shaikh Supercomputing Applications Specialist Trends in High Performance Computing Today and Tomorrow. Typically](https://reader034.fdocuments.us/reader034/viewer/2022050219/5f64e8b0b56b89039f257dea/html5/thumbnails/28.jpg)
Thank you - questions welcome
Documentation and Training Materialhttp://support.pawsey.org.au