Optimisation of Parallel Scientific Applications on Highly

22
Optimisation of Parallel Scientific Applications on Highly Heterogeneous Modern Multicore and Multi-GPU Computing Nodes Vladimir Rychkov Ziming Zhong Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin May 11, 2012 Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 1 / 16

Transcript of Optimisation of Parallel Scientific Applications on Highly

Page 1: Optimisation of Parallel Scientific Applications on Highly

Optimisation of Parallel Scientific Applicationson Highly Heterogeneous Modern Multicore and

Multi-GPU Computing Nodes

Vladimir Rychkov Ziming Zhong

Heterogeneous Computing LaboratorySchool of Computer Science and Informatics

University College Dublin

May 11, 2012

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 1 / 16

Page 2: Optimisation of Parallel Scientific Applications on Highly

Introduction

Introduction

Hybrid multi-CPU/GPU architectures in HPCHow to utilise highly heterogeneous hardware and software stack?

Target platforms and applicationsDedicated multicore and multi-GPU nodes and clustersData-parallel applications dependent on data locality

Performance modeling and model-based optimisationRealistic models of processors executing data-parallel applicationAccurate measurement of performance on each processorOptimal data partitioning based on the models

Functional performance modelsHeterogeneous uniprocessor clusters

How to model performance of devices on a hybrid platform?Different programming models and softwareShared resources of different speed and capacity

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 2 / 16

Page 3: Optimisation of Parallel Scientific Applications on Highly

Introduction

Introduction

Hybrid multi-CPU/GPU architectures in HPCHow to utilise highly heterogeneous hardware and software stack?

Target platforms and applicationsDedicated multicore and multi-GPU nodes and clustersData-parallel applications dependent on data locality

Performance modeling and model-based optimisationRealistic models of processors executing data-parallel applicationAccurate measurement of performance on each processorOptimal data partitioning based on the models

Functional performance modelsHeterogeneous uniprocessor clusters

How to model performance of devices on a hybrid platform?Different programming models and softwareShared resources of different speed and capacity

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 2 / 16

Page 4: Optimisation of Parallel Scientific Applications on Highly

Introduction

Introduction

Hybrid multi-CPU/GPU architectures in HPCHow to utilise highly heterogeneous hardware and software stack?

Target platforms and applicationsDedicated multicore and multi-GPU nodes and clustersData-parallel applications dependent on data locality

Performance modeling and model-based optimisationRealistic models of processors executing data-parallel applicationAccurate measurement of performance on each processorOptimal data partitioning based on the models

Functional performance modelsHeterogeneous uniprocessor clusters

How to model performance of devices on a hybrid platform?Different programming models and softwareShared resources of different speed and capacity

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 2 / 16

Page 5: Optimisation of Parallel Scientific Applications on Highly

Introduction

Hybrid HPC Platform

Device Memory

GPU

PCIe

Interface

PCIe

Interface

Multicore Host

...

Main Memory Device Memory

GPU

core 1

cache cache

core n

shared cache

Highly heterogeneous devices

Memory: shared and distributed

Resource contention: memory, PCI Express

Can be modelled as a heterogeneous distributed-memory systemFunctional performance models of devicesIntra-node data partitioning between devices

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 3 / 16

Page 6: Optimisation of Parallel Scientific Applications on Highly

Introduction

Hybrid HPC Platform

Device Memory

GPU

PCIe

Interface

PCIe

Interface

Multicore Host

...

Main Memory Device Memory

GPU

core 1

cache cache

core n

shared cache

Highly heterogeneous devices

Memory: shared and distributed

Resource contention: memory, PCI Express

Can be modelled as a heterogeneous distributed-memory systemFunctional performance models of devicesIntra-node data partitioning between devices

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 3 / 16

Page 7: Optimisation of Parallel Scientific Applications on Highly

Introduction

Outline

Functional performance models of multiple cores and GPUs

Accurate measurement of performance of devices

Optimisation of computational kernels

Intra-node data partitioning based on FPMs of devices

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 4 / 16

Page 8: Optimisation of Parallel Scientific Applications on Highly

Functional Performance Models of Multiple Cores and GPUs

Heterogeneous Data-Parallel Application

Heterogeneous matrix multiplicationA B n x bCb

b

n x b

Beaumont, O., Boudet, V., Rastello, F., Robert, Y.: Matrix Multiplication onHeterogeneous Platforms. IEEE Trans. Parallel Distrib. Syst. 12(10), 1033-1051 (2001)

Representative kernel, GEMM: Ci+ = A(b) × B(b)

+=bX

b

m x b

n x b

i m x bi

i

n x bi

A B C i(b) (b)

Computation is proportional to the area of submatrix Ci

The same memory access pattern as the whole application

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 5 / 16

Page 9: Optimisation of Parallel Scientific Applications on Highly

Functional Performance Models of Multiple Cores and GPUs

Functional Performance Models of Devices

Optimised GEMM routines: MKL/ACML(CPU), CUBLAS(GPU)

CPU cores compete with each other within a socketsc(x) - speed of a socket executing CPU GEMM on c cores,with submatrices x/c

GPU depends on a host process, which handles data transfers

GPU performance can be measured only within some rangeg(x) - combined speed of a GPU and its dedicated core executing CUBLAS,can be defined at [0,∞) for out-of-core GEMM

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 6 / 16

Page 10: Optimisation of Parallel Scientific Applications on Highly

Functional Performance Models of Multiple Cores and GPUs

Functional Performance Models of Devices

Optimised GEMM routines: MKL/ACML(CPU), CUBLAS(GPU)

CPU cores compete with each other within a socketsc(x) - speed of a socket executing CPU GEMM on c cores,with submatrices x/c

GPU depends on a host process, which handles data transfers

GPU performance can be measured only within some rangeg(x) - combined speed of a GPU and its dedicated core executing CUBLAS,can be defined at [0,∞) for out-of-core GEMM

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 6 / 16

Page 11: Optimisation of Parallel Scientific Applications on Highly

Functional Performance Models of Multiple Cores and GPUs

Functional Performance Models of Devices

Optimised GEMM routines: MKL/ACML(CPU), CUBLAS(GPU)

CPU cores compete with each other within a socketsc(x) - speed of a socket executing CPU GEMM on c cores,with submatrices x/c

GPU depends on a host process, which handles data transfers

GPU performance can be measured only within some rangeg(x) - combined speed of a GPU and its dedicated core executing CUBLAS,can be defined at [0,∞) for out-of-core GEMM

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 6 / 16

Page 12: Optimisation of Parallel Scientific Applications on Highly

Performance Measurement on Hybrid Platforms

Performance Measurement on Hybrid Platforms

Bind processes to cores- to avoid potential performance degradation resulting fromautomatic rearranging of processes by operating system

Synchronise processes- to ensure resources are shared between processes

Repeat experiments multiple times- until the results are proved statistically reliable

Performance of multiple cores and GPUs can be measured separately

Simultaneous benchmark of multiple cores on a socket

Combined benchmark for a GPU and its dedicated core

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 7 / 16

Page 13: Optimisation of Parallel Scientific Applications on Highly

Performance Measurement on Hybrid Platforms

Performance Measurement on Hybrid Platforms

Bind processes to cores- to avoid potential performance degradation resulting fromautomatic rearranging of processes by operating system

Synchronise processes- to ensure resources are shared between processes

Repeat experiments multiple times- until the results are proved statistically reliable

Performance of multiple cores and GPUs can be measured separately

Simultaneous benchmark of multiple cores on a socket

Combined benchmark for a GPU and its dedicated core

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 7 / 16

Page 14: Optimisation of Parallel Scientific Applications on Highly

Performance Measurement on Hybrid Platforms

Experimental Platform

Multicore multi-GPU node ig.eecs.utk.eduInnovative Computing Laboratory, University of Tennessee, USA

CPU (AMD) GPUs (NVIDIA)

Architecture Opteron 8439SE GeForce GTX480 Tesla C870Core Clock 2.8 GHz 700 MHz 600 MHzNumber of Cores 8×6 cores 480 cores 128 coresMemory Size 64 GB 1536 MB 1536 MBMemory Bandwidth 177.4 GB/s 76.8 GB/s

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 8 / 16

Page 15: Optimisation of Parallel Scientific Applications on Highly

Functional Performance Models of Multiple Cores

Functional Performance Models of Multiple Cores

Speed functions of multiple cores

60

80

100

120

0 1e+08 2e+08 3e+08 4e+08

Sp

ee

d (

GF

lop

s)

Matrix area (d)

socket

5 cores, b 640 6 cores, b 640

s5(x), s6(x) - speed functions for5 and 6 cores on a socket

Resource contention with GPU

60

80

100

120

0 1e+08 2e+08 3e+08 4e+08 5e+08

Sp

ee

d (

GF

lop

s)

Matrix area (d)

5 cores

CPU-onlycores:GPU = 1:3

cores:GPU = 1:8

s5(x) remains stable when sharingresource with GPU

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 9 / 16

Page 16: Optimisation of Parallel Scientific Applications on Highly

Functional Performance Models of GPUs

Functional Performance Models of GPUs

Speed functions of GeForce GTX480

0

200

400

600

800

0 4e+08 8e+08 1.2e+09 1.6e+09

Sp

ee

d (

GF

lop

s)

Matrix area (d)

memory limit

b 640 b 640, overlap

Out-of-core implementationextends the range of problem sizes

Overlapping improves performance

Resource contention

0

200

400

600

800

0 4e+08 8e+08 1.2e+09 1.6e+09

Sp

ee

d (

GF

lop

s)

Matrix area (d)

GPU-onlycores:GPU = 1:3

cores:GPU = 1:8

Performance decreases around 10%from resource contention

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 10 / 16

Page 17: Optimisation of Parallel Scientific Applications on Highly

Functional Performance Models of GPUs

Optimisation of the Kernel

Out-of-core computations

Overlap data transfers and computation

A0 B0 C0

A1

GEMM

GEMM C1

A0 C0GEMM

Stream 0

Stream 1

........ ........ ........ ........

time

C0

C1

C0Stream 2

...

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 11 / 16

Page 18: Optimisation of Parallel Scientific Applications on Highly

FPM-based Data Partitioning on Hybrid Platform

FPM-based Data Partitioning on Hybrid Platform

Execution time of the parallel matrix multiplication applicationon different configurations on the hybrid server

Execution time (sec)Matrix size CPUs GPUs CPUs+GPUs

12800× 12800 14.6 10.5 5.819200× 19200 43.4 32.5 16.225600× 25600 99.8 147.9 38.232000× 32000 189.2 265.3 114.1

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 12 / 16

Page 19: Optimisation of Parallel Scientific Applications on Highly

Ongoing and Future Study

Ongoing and Future Study

Target platform: multi-GPU servers - more resource contention

Accurate measurement technique

FPM of multiple GPUs on the same server

Design of computational kernel for multi-GPU servers

Experiments on data partitioning with different models

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 13 / 16

Page 20: Optimisation of Parallel Scientific Applications on Highly

Output

Publications

Zhong, Z., Rychkov, V., Lastovetsky, A. ”Data Partitioning onHeterogeneous Multicore Platforms”, Cluster 2011, Austin, Texas,USA, IEEE Computer Society, pp. 580-584 (2011).

Zhong, Z., Rychkov, V., Lastovetsky, A. ”Functional PerformanceModels of Scientific Applications for Heterogeneous Multicore andMulti-GPU Systems” (submitted to Euro-Par 2012, Rhodes Island,Greece).

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 14 / 16

Page 21: Optimisation of Parallel Scientific Applications on Highly

Output

Output

Project web page: http://hcl.ucd.ie/project/fupermod

Software

Implemented in the FuPerMod package, developed at HCL

Based on system and mathematical software: C/C++, MPI,Autotools, GNU Scientific Library, Boost C++ libraries, BLAS,CUDA Toolkit

Team

1 postdoctoral researcher: Vladimir Rychkov

2 PhD students: David Clarke, Ziming Zhong

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 15 / 16

Page 22: Optimisation of Parallel Scientific Applications on Highly

Output

CollaborationHardware

Multicore multi-GPU servers (Innovative Computing Laboratory, Universityof Tennessee, USA)

Multicore multi-GPU cluster (High Performance Computing & Architecturesgroup, University Jaume I, Spain)

Grid’5000 (INRIA, CNRS, RENATER, France)

Collaboration

Leonel Sousa, Alexandar Ilic (Institute of System Engineering, ComputerResearch and Development, Portugal)

Enrique Quintana Orti (High Performance Computing & Architectures,University Jaume I, Spain)

Financial Support

Science Foundation UCD CSI China Scholarship Complex HPC actionIreland Council COST

Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 16 / 16