High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

48
High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement Prof. Thomas Sterling Department of Computer Science Louisiana State University February 27 th , 2007

description

High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement. Prof. Thomas Sterling Department of Computer Science Louisiana State University February 27 th , 2007. Term Projects. Graduate students only Due date: April 19 th , 2007 Total time: approx. 40 hours - PowerPoint PPT Presentation

Transcript of High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

Page 1: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

High Performance Computing: Concepts, Methods & Means

Performance 3 : Measurement

Prof. Thomas SterlingDepartment of Computer Science

Louisiana State University

February 27th, 2007

Page 2: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

Term Projects

• Graduate students only• Due date: April 19th, 2007• Total time: approx. 40 hours• 20% of final grade• 4 categories of projects:

– A) technology evolution report (20 page report)– B) Fixed application code execution scaling (7 page report)– C) Synthetic code parametric studies (7 page report)– D) Parallel application development (7 page report)

• 1 paragraph abstract due March 9th

– Email to Chirag by COB Friday

Page 3: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

Term Projects – Technology Evolution

• In depth survey of an enabling technology• Report on capability with respect to time and factors• Two general classes of technology:

– Device technology• Main memory

• Secondary storage

• System network

• Logic

– Architecture• SIMD

• Vector

• Systolic

• Dataflow

Page 4: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

Term Project – Fixed Application Scaling• Select an application code

– Need not be one of those in class– Must be a parallel code– You need not write this yourself

• Select two or more system parameters to scale with– # processors– # nodes– Network bandwidth and/or latency– Data block partition size

• Use performance measurement and profiling tools– Describe measured trends– Diagnose reasons for observed results

Page 5: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

Term Project – Synthetic Parametric Study

• Write a code expressly to exercise one or more system functions– Parallelism– Network bandwidth– Memory bandwidth

• Allow at least one dimension to be independent and adjust– Message insert rate– Message packet size– Overhead time

• Show system operation with respect to parameter

Page 6: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

Term Project – Roll your own

• Write a small parallel application program– Preferably not one we’ve done in class– Can be something you’ve done in another class or research

project modified for MPI or OpenMP– Please! Do this yourself!!– Libraries permitted

• Use profiling tools to determine where most of the work is being done

• Demonstrate scaling wrt # processors

Page 7: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

7

Topics

• Introduction• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 8: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

8

Topics

• Introduction• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 9: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

9

Please understand when to use the following and what they mean :• API Elements :

– MPI_Init(), MPI_Finalize()– MPI_Comm_size(), MPI_Comm_rank() – MPI_COMM_WORLD– Error checking using MPI_SUCCESS– MPI basic data types (slide 27)– Blocking : MPI_Send(), MPI_Recv()– Non-Blocking : MPI_Isend(), MPI_Irecv(), MPI_Wait()– Collective Calls : MPI_Barrier(), MPI_Bcast(), MPI_Gather(),

MPI_Scatter(), MPI_Reduce()

• Commands : – Running MPI Programs : mpirun– Compile : mpicc – Compile : mpif77

Page 10: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

10

Where are we?

• Three classes of parallel computing– Capacity– Cooperative– Capability

• Three execution models– Throughput – Shared memory multithreaded– Communicating sequential processes (message passing)

• Three programming formalisms– Condor– OpenMP– MPI

• More performance modeling and measurement– For cooperative/message passing/MPI

Page 11: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

11

Topics

• Introduction

• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 12: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

12

What has changed? SMP to MPP

• SMP – symmetric multiprocessor– Shared memory

• UMA – uniform memory access with cache coherence

– Multithreaded parallelism– Communication through main memory– Not scalable– Programming in OpenMP– DSM and PGAS provide alternative shared memory structures

• DSM – distributed shared memory (with cache coherence)• PGAS – Partitioned global address space (without cache coherence)• Both are NUMA

• MPP – massively parallel processor– Distributed memory

• NUMA – non-uniform memory access

– Concurrent sequential processes parallelism– Communication through messages between nodes– Scalable– Programming in MPI– Same for commodity clusters but usually with weaker networks

Page 13: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

13

MPI Performance Characteristics

• Latency– Time to send first bits of data across link to remote node– Does not include overhead

• Bandwidth– Rate of data transfer across link to remote node

• Buffers– System or user buffers take up time to manage capacity etc.

• Blocking versus Asynchronous– Forced ordering of computation and communication

Page 14: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

14

Granularities of Time Measurements

Timer Usage Wallclock / CPU Time

Resolution Languages Portable?

time shell / script both 1/100th second any yes

timex shell / script both 1/100th second any yes

gettimeofday subroutine wallclock microsecond C/C++ yes

read_real_time subroutine wallclock nanosecond C/C++ no

rtc subroutine wallclock microsecond Fortran no

irtc subroutine wallclock nanosecond Fortran no

dtime_ subroutine CPU 1/100th second Fortran no

etime_ subroutine CPU 1/100th second Fortran no

mclock subroutine CPU 1/100th second Fortran no

timef subroutine wallclock millisecond Fortran no

MPI_Wtime subroutine wallclock microsecond C/C++, Fortran yes

AIX Trace Facility shell / script / subroutine

wallclock microsecond any no

time

Page 15: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

15

Performance Factors

• Platform / Architecture Related: – cpu - clock speed, number of cpus – Memory subsystem - memory and cache configuration,

memory-cache-cpu bandwidth, memory copy bandwidth – Network adapters - type, latency and bandwidth

characteristics – Operating system characteristics - many

• Network Related: – Hardware - ethernet, FDDI, switch, intermediate hardware

(routers) – Protocols - TCP/IP, UDP/IP, other – Configuration, routing, etc – Network tuning options ("no" command) – Network contention / saturation

source : http://www.llnl.gov/computing/tutorials/mpi_performance/

Page 16: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

16

Performance Factors (2)• Application Related:

– Algorithm efficiency and scalability – Communication to computation ratios – Load balance – Memory usage patterns – I/O – Message size used – Types of MPI routines used - blocking, non-blocking, point-to-point,

collective communications

• MPI Implementation Related: – Message buffering – Message passing protocols - eager, rendezvous, other – Sender-Receiver synchronization - polling, interrupt – Routine internals - efficiency of algorithm used to implement a given

routine

source : http://www.llnl.gov/computing/tutorials/mpi_performance/

Page 17: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

17

Performance Impact of Message Sizes

• Message size can be a very significant contributor to MPI application performance. In most cases, increasing the message size will yield better performance.

• For communication intensive applications, algorithm modifications that take advantage of message size "economies of scale" may be worth the effort. Performance can often improve significantly within a relatively small range of message sizes.

• The following three graphs demonstrate how increasing message size can improve bandwidth for different message size ranges

Page 18: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

18

MPI Performance Models

• Hockney: Point to Point– Time to send: t=t0+m/rinf

– t0: fixed cost per message, startup cost

– m: message length

– rinf: bandwidth for very large messages

• Xu/Hwang: Collective– Time to send: t=t0 (n)+m/rinf(n)

– same parameters, but now they are functions of n, the number of nodes in the communication

• Source: http://wwwinfo.deis.unical.it/~talia/hpcn98.ps

Page 19: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

19

Topics

• Introduction• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 20: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

20

MPI Performance Models

• LogP (fixed message size)– time to send = L+2*o– L: Latency, min send/recv time– o: Overhead, time waiting on processor– g: Gap, min time between successive sends or recvs & does

include message length– P: Number of Processors– L/g: Max number of simultaneous messages

Page 21: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

21

Measuring LogP Parameters

• Finding g (implementation dependent)

– Proc 0: MPI_ISend() x N

– Proc 1: MPI_Recv() x N

– g = total time / N• Finding L+2*o

– Proc 0: (MPI_Send() then MPI_Recv()) x N

– Proc 1: (MPI_Recv() then MPI_Send()) x N

– L+2*o = total time/N• Finding o

– Proc 0: (MPI_Send() then MPI_Recv() then some_work) x N

– Proc 1: (MPI_Recv() then some_work then MPI_Send()) x N

– o = (1/2)total time/N – time(some_work)

– requires time(some_work) > 2*L+2*o

Page 22: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

22

Measuring LogP Parameters

• Finding L+2*o

– Proc 0: (MPI_Send() then MPI_Recv()) x N

– Proc 1: (MPI_Recv() then MPI_Send()) x N

– L+2*o = total time/N

Figure 1: Time diagram for benchmark 1(a) is Time diagram of processor 0(b) is Time diagram of processor 1

Page 23: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

23

Measuring LogP Parameters

• Finding o

– Proc 0: (MPI_Send() then some_work then MPI_Recv() ) x N

– Proc 1: (MPI_Recv() then MPI_Send() then some_work) x N

– o = (1/2)total time/N – time(some_work)

– requires time(some_work) > 2*L+2*o

Figure 2: Time diagram for benchmark 2 with X > 2*L+Or+Os(a) is Time diagram of processor 1(b) is Time diagram of processor 2

Page 24: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

24

Demo

• Measure LogP parameters

Page 25: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

25

Topics

• Introduction• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 26: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

26

MPI Performance Models

• LogGP (variable message size)– time to send = L+2*o+(m-1)*G– L: Latency, min send/recv time– o: Overhead, time waiting on processor– g: Gap, min time between send/recvs– G: Gap per byte = 1/Bandwidth– P: Number of Processors– L/g: Max number of simultaneous messages– http://citeseer.ist.psu.edu/cache/papers/cs/756/

http:zSzzSzwww.cs.berkeley.eduzSz~cullerzSzpaperszSzsort.pdf/dusseau96fast.pdf

Page 27: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

Effective Bandwidth (LogGP)

Toy Calculation– BW = m/(L+2*o+G(m-1))– let: L+2*o-G = 5– let: G = 3– Asymptotically approaches

bandwidth of 1/G for very large messages.

Page 28: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

28

Topics

• Introduction• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 29: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

29

HPC Challenge Benchmarks

• HPC Challenge: http://icl.cs.utk.edu/hpcc/– See results tab– b_eff benchmark is a part of this larger database– more info than just HPL!

Page 30: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

30

b_eff

• Standard Benchmark – part of HPC Challenge– Provides effective bandwidth and latency

• Averages a variety of message sizes and communication patterns

• Determines an effective latency and bandwidth• b_eff depends on:

– hardware: interconnect, memory– software: MPI implementation– tuneable parameters of the os: buffers– etc.

See : http://www.hlrs.de/organization/par/services/models/mpi/b_eff/

Page 31: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

31

Effective Bandwidth Benchmark

Page 32: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

32

Example: Send/Recv, ring & random

Page 33: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

33

Demo

• running of b_eff

Page 34: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

34

Topics

• Introduction• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 35: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

35

Portable MPI Tracing: PMPI

• An API to MPI for tracing, debugging, performance measurements of MPI applications

• MPI_<command>() calls PMPI_<command>()• MPI_Pcontrol(int)

– 0: disabled– 1: enabled – Default Level– 2: flush trace buffers

Page 36: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

36

Demo : MPI_Pcontrol…

int sends = 0;

int pcontrol = 1;

 int main(int argc,char **argv) {

MPI_Init(&argc,&argv);

int imax = 10000000;

int nmax = 8;

int rank;

int data = 27;

MPI_Status st;

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

time_t start,end;

double fac = (1.0/imax);

double g,lp2o,o;

  // Find g

time(&start);

for(int i=0;i<imax;i++) {

if(rank == 0) {

MPI_Send(&data,1,MPI_INT,1,1,MPI_COMM_WORLD);

} else {

MPI_Recv(&data,1,MPI_INT,0,1,MPI_COMM_WORLD,&st);

}

}

time(&end);

if(rank == 0) {

g = fac*(end-start);

printf("gap=%g sec\n",g);

}

 

// Find L+2*o

time(&start);

const int step = 5;

for(int i=0;i<imax;i+=step) {

if(rank == 0) {

MPI_Send(&data,1,MPI_INT,1,1,MPI_COMM_WORLD);

MPI_Recv(&data,1,MPI_INT,1,1,MPI_COMM_WORLD,&st);

} else {

MPI_Recv(&data,1,MPI_INT,0,1,MPI_COMM_WORLD,&st);

MPI_Send(&data,1,MPI_INT,0,1,MPI_COMM_WORLD);

}

}

time(&end);

if(rank == 0) {

lp2o = 0.5*step*fac*(end-start);

printf("L+2*o=%g sec\n",lp2o);

if(sends > 0) printf("sends = %d\n",sends);

}

 

MPI_Finalize();

return 0;

}

 

int MPI_Pcontrol(int n) {

pcontrol = n;

return PMPI_Pcontrol(n);

}

int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm ) {

if(pcontrol >= 1) sends++;

return PMPI_Send(buf,count,datatype,dest,tag,comm );

}

Page 37: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

37

Demo

• MPI tracing, custom implementation

Page 38: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

38

Topics

• Introduction• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 39: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

39

TAU and MPI

• Tau uses the PMPI interface to track MPI calls• Jumpshot is used as the viewer

– Shows subroutine calls and mpi calls

Page 40: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

40

TAU Performance System Architecture

EPILOG

Paraver

Page 41: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

41

TAU Measurement Options

• Parallel profiling– Function-level, block-level, statement-level– Supports user-defined events– TAU parallel profile data stored during execution– Hardware counts values– Support for multiple counters– Support for callpath profiling

• Tracing– All profile-level events– Inter-process communication events– Timestamp synchronization– Trace merging and format conversion

Page 42: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

42

How To Use TAU?

• Instrumentation– Application code and libraries– Selective instrumentation

• Install, compile, and link with TAU measurement library– % configure; make clean install– Multiple configurations for different measurements options– Does not require change in instrumentation– Selective measurement control

• Execute “experiments” to produce performance data– Performance data generated at end or during execution

• Use analysis tools to look at performance results

Page 43: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

43

Using Tau

• Setup Environment:– source /home/packages/Tau/gcc-papi-mpi-slog2/env.sh– export COUNTER1=GET_TIME_OF_DAY

• Use tau_cc.sh, tau_f90.sh, etc. to compile• Run with mpirun• Post-process:

– tau_treemerge.pl– tau2slog2 tau.trc tau.edf -o tau.slog2

• Run: http://www.cct.lsu.edu/~sbrandt/perf_vis.html

Page 44: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

44

Demo

• Tau and Jumpshot

Page 45: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

45

Topics

• Introduction• Performance Characteristics & Models• Performance Models : LogP • Performance Models : LogGP• Benchmarks : b_eff• MPI Tracing : PMPI• TAU & MPI• Summary – Materials for Test

Page 46: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

46

Summary – Material for the Test

• Essential MPI - Slide: 9• Performance Models - Slide: 12, 15, 16, 18 (Hockney)• LogP - Slide: 20 – 23• Effective Bandwidth – Slide: 30• Tau/MPI – Slide: 41, 43

Page 47: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

47

Sources

• http://www.cs.uoregon.edu/research/tau/docs.php (tau)• http://www.llnl.gov/computing/tutorials/mpi_performance/• http://www.netlib.org/utk/papers/mpi-book/node182.html (mpi profiling interface)• http://www-unix.mcs.anl.gov/mpi/tutorial/perf/index.html (Gropp course)• http://www.ecs.umass.edu/ece/ssa/papers/jpdcmpi.ps (LogP paper with figures)• http://www.netlib.org/utk/people/JackDongarra/PAPERS/coll-perf-analysis-

cluster-2005.pdf (more LogP stuff)• http://www.hlrs.de/organization/par/services/models/mpi/b_eff/ (b_eff bench)• http://icl.cs.utk.edu/hpcc/ (hpc challenge)

Page 48: High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement

48