XC V3.0 HP-MPI Update

81
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice XC V3.0 HP-MPI Update HP-MPI HPCD r1.3 January 2006

description

XC V3.0 HP-MPI Update. HP-MPI HPCD r1.3 January 2006. Usability Xc jobs, srun, lustre, ssh, 32 bit mode, Debuggability and Profiling Message Profiling Message validation Library Communication and Cluster Health MPI Communication Interconnect health check Scaleout - PowerPoint PPT Presentation

Transcript of XC V3.0 HP-MPI Update

Page 1: XC V3.0   HP-MPI Update

© 2004 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

XC V3.0 HP-MPI Update

HP-MPIHPCD r1.3January 2006

Page 2: XC V3.0   HP-MPI Update

2

HP-MPI 2.2 and XC 3.0• Usability

− Xc jobs, srun, lustre, ssh, 32 bit mode,• Debuggability and Profiling

− Message Profiling− Message validation Library

• Communication and Cluster Health− MPI Communication− Interconnect health check

• Scaleout− rank to core binding− Startup, message buffers, licensing

• Performance Improvements− InfiniBand, Ethernet

Page 3: XC V3.0   HP-MPI Update

3

HP-MPI Usability

Page 4: XC V3.0   HP-MPI Update

4

XC Job Control

LSF, SLURM, HP-MPI are tightly coupled, built to interact with a remote login program.LSF determine WHEN the job will run LSF talks with SLURM to determine WHICH resources will be

used. SLURM - Determines WHERE the job runs. It controls things like which host each rank runs on.

SLURM also starts the executables on each host as requested by HP-MPI's mpirun HP-MPI - Determines HOW the job runs, part of the application, handles communication. Can also

pinpoint the processor on which each rank runs.SSH/rsh - The KEY that opens up remote hosts.

LSFJob queueing system

SLURMcluster management

& job schedulingHP-MPI

Message Passing

ssh remote login

Key

When HowWhere

Page 5: XC V3.0   HP-MPI Update

5

HP-MPI mpirunUseful options:

-prot Prints the communication protocol-np # - Number of processors to use-h host - Set host to use-e <var>[=<val>] - Set environment variable-d - Debug mode-v - Verbose-i file - Write profile of MPI functions-T - Prints user and system times for each MPI rank.-srun - Use SLURM-mpi32 - Use 32-bit interconnect libraries on X86-64-mpi64 - Use 64-bit interconnect libraries on X86-64

(default)-f appfile - Parallelism directed from instructions in

appfile

Page 6: XC V3.0   HP-MPI Update

6

mpirun: appfile mode• appfile mode does not propagate the environment

• Need to pass LD_LIBRARY_PATH• Need to pass application environment variables

• appfile mode with XC V 2.0 or Elan interconnect requires:

• export MPI_USESRUN 1 • All lines in appfile need to do the same thing, but on different hosts

• ssh/appfiles requires export MPI_REMSH /usr/bin/ssh

• Find host names with either:

• /opt/hptc/bin/expandnodes “$SLURM_NODELIST”• $LSB_HOSTS

Page 7: XC V3.0   HP-MPI Update

7

SLURM srun utilitysrun – SLURM utility to run parallel jobssrun usage on XC:

− Implied mode not available with pre- MPI V2.1.1 (XC V2.1) or Elan interconnect

− hpmpi option• Use as: -srun options exe args

− hpmpi implied srun mode• Use as: export MPI_USESRUN 1 • Set options by: export MPI_SRUNOPTIONS options

− Stand alone executable• srun options exe args

Page 8: XC V3.0   HP-MPI Update

8

Implied srun mode• The implied srun mode allows the

user to omit the -srun argument from the mpirun command line with the use of the environment variable MPI_USESRUN.

− XC systems only

Set the environment variable: % setenv MPI_USESRUN 1• HP-MPI will insert the -srun

argument.• The following arguments are

considered to be srun arguments: − -n -N -m -w -x− Any argument that starts with -- and

is not followed by a space − -np will be translated to –n− -srun will be accepted without

warning.• The implied srun mode allows the

use of HP-MPI appfiles.

• An appfile must be homogenous in its arguments with the exception of -h and -np. The -h and -np arguments within the appfile are discarded. All other arguments are promoted to the mpirun command line. Arguments following – are also processed

Additional environment variables provided:

• MPI_SRUNOPTIONS− Allows additional srun options to be

specified such as --label.

% MPI_USESRUN_IGNORE_ARGS• Provides an easy way to modify the

arguments contained in an appfile by supplying a list of space-separated arguments that mpirun should ignore.

See Release Notes for more options to srun command

Page 9: XC V3.0   HP-MPI Update

9

Lustre support for SFS for XC• Lustre allows individual files to be striped over

multiple OSTs (Object Storage Targets) to improve overall throughput

• “striping_unit” = <value>− Specifies number of consecutive bytes of a file that

are stored on a particular IO device as part of a stripe set

• “striping_factor” = <value>− Specifies the number of IO devices over which the

file is striped. Cannot exceed the maximum defined by the system administrator

• “start_iodevice” = <value>− Specifies the IO device from which striping will

begin

Page 10: XC V3.0   HP-MPI Update

10

Lustre support for SFS for XC - cont• These need to be defined prior to file creation

so that the call to MPI_File_open can access them:

/* set new info values. */value = randomize_start(); MPI_Info_create(&info);MPI_Info_set(info, "striping_factor", "16");

MPI_Info_set(info, "striping_unit", "131072");MPI_Info_set(info, "start_iodevice", value );/* open the file and set new info */MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_RDWR,

info, &fh);

Page 11: XC V3.0   HP-MPI Update

11

Improved ssh support• By default, HP-MPI attempts to use rsh

(remsh on HP-UX). An alternate remote execution tool can be used by setting the environment variable MPI_REMSH.

• % export MPI_REMSH=sshor% $MPI_ROOT/bin/mpirun –e MPI_REMSH= \

“ssh –x” <options> -f <appfile>Considering making ssh the default on linux

in next release.

Page 12: XC V3.0   HP-MPI Update

12

Remote login - ssh To Disable password prompting• $ ssh-keygen –t dsa  (to prevent password prompting, just return for each

password prompting)• $ cd ~/.ssh• $ cat id_dsa.pub >> authorized_keys• $ chmod go-rwx authorized_keys• $ chmod go-w ~ ~/.ssh• You may need to repeat the above with rsa rather than dsa• $ cp /etc/ssh/ssh_config $HOME/.ssh/config• $ echo "CheckHostIP no" >> $HOME/.ssh/config• $ echo "StrictHostKeyChecking no" >> $HOME/.ssh/configTo Specify remote execution over ssh:

Set MPI_REMSH to /usr/bin/ssh

Page 13: XC V3.0   HP-MPI Update

13

32- and 64-bit selection• Options have been added to indicate the

bitness of the application so the proper interconnect library can be invoked.

• Use –mpi32 or –mpi64 on the mpirun command line for AMD64 and EM64T.

• Default is –mpi64.• Mellanox only provides a 64-bit IB driver.

− 32-bit apps are not supported for IB on AMD64 & EM64T systems.

Page 14: XC V3.0   HP-MPI Update

14

HP-MPI Parallel Compiler Options

Useful options:-mpi32 - build 32-bit

Useful environment variables:setenv MPI_CC cc - set C compilersetenv MPI_CXX C++ - set C++ compiler setenv MPI_F90 f90 - set Fortran compilersetenv MPI_ROOT dir - useful when MPI not installed in /opt/[hpmpi|mpi]

Page 15: XC V3.0   HP-MPI Update

15

Problematic Compiler Options

INTEL PGI Description-static -Bstatic Link static – does not allow

HP-MPI to determine interconnect

-i8 -i8 If you compile with this, be sure to link with it. Intel and AMD math libraries do not support Integer*8.

Page 16: XC V3.0   HP-MPI Update

16

Debugging

Page 17: XC V3.0   HP-MPI Update

17

Debugging Scripts: Use hello_world Test case

#include <stdio.h>#include <mpi.h>main(argc, argv)int                     argc;char                    *argv[];{        int             rank, size, len;        char          name[MPI_MAX_PROCESSOR_NAME];         MPI_Init(&argc, &argv);        MPI_Comm_rank(MPI_COMM_WORLD, &rank);        MPI_Comm_size(MPI_COMM_WORLD, &size);        MPI_Get_processor_name(name, &len);        printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);        MPI_Finalize();        exit(0);}

Page 18: XC V3.0   HP-MPI Update

18

How to debug HP-MPI applications with a single-process debugger• export MPI_DEBUG_CONT=1• Set the MPI_FLAGS environment variable to choose

debugger. Values are:− eadb – Start under adb− exdb – Start under xdb− edde – Start under dde− ewdb – Start under wdb− egdb – Start under gdb

• Set DISPLAY to point to your console. • Use xhost to allow remote hosts to redirect their

windows to your console.• Run application.

• export DISPLAY=hostname:1• xhost +hostname:1• mpirun –e MPI_FLAGS=egdb –np 2 ./a.out

Page 19: XC V3.0   HP-MPI Update

19

Attaching Debuggers to HP-MPI Applications• HP-MPI conceptually creates processes in

MPI_Init, and each process instantiates a debugger session.

• Each debugger session in turn attaches to the process that created it.

• HP-MPI provides MPI_DEBUG_CONT to control the point at which debugger attachment occurs via breakpoint.

• MPI_DEBUG_CONT is a variable that HP-MPI uses to temporarily spin the processes awaiting the user to allow execution to proceed via debugger commands.

• By default, MPI_DEBUG_CONT is set to 0 and you must set it to 1 to allow the debug session to continue past this ‘spin barrier’ in MPI_Init.

Page 20: XC V3.0   HP-MPI Update

20

Debugging HPMPI apps cont:The following procedure outlines the steps to

follow when you use a single-process debugger:Step 1. Set the eadb, exdb, edde, ewdb, or egdb

option in the MPI_FLAGS environment variable to use the ADB, XDB, DDE, WDB, or GDB debugger.

Step 2. On remote hosts, set DISPLAY to point to your console. In addition, use xhost to allow remote hosts to redirect their windows to your console.

Step 3. Run your application.export DISPLAY=hostname:1mpirun –e MPI_FLAGS=egdb –np 2 ./a.out

Page 21: XC V3.0   HP-MPI Update

21

Debugging HP-MPI apps cont:•

Page 22: XC V3.0   HP-MPI Update

22

Debugging HP-MPI apps cont:•

Page 23: XC V3.0   HP-MPI Update

23

Profiling

Page 24: XC V3.0   HP-MPI Update

24

Profiling• Instrumentation

− Lightweight method for cumulative runtime statistics

− Profiles for applications linked with standard HP-MPI library

− Profiles for applications linked with the thread-compliant library

Page 25: XC V3.0   HP-MPI Update

25

HP-MPI instrumentation profile:

-i <myfile>[:opt] - produces a rank by rank summary of where MPI spends its time and places result in file name myfile.trace

bsub –I –n4 mpirun –i myfile -srun ./a.out Application Summary by Rank (second):

Rank Proc CPU Time User Portion System Portion ----------------------------------------------------------------------------- 0 0.040000 0.030000( 75.00%) 0.010000( 25.00%) 1 0.050000 0.040000( 80.00%) 0.010000( 20.00%) 2 0.050000 0.040000( 80.00%) 0.010000( 20.00%) 3 0.050000 0.040000( 80.00%) 0.010000( 20.00%)

Page 26: XC V3.0   HP-MPI Update

26

HP-MPI instrumentation continued • Routine Summary by Rank:

Rank Routine Statistic Calls Overhead(ms) Blocking(ms) ----------------------------------------------------------------------------- 0 MPI_Bcast 4 7.127285 0.000000 min 0.033140 0.000000 max 5.244017 0.000000 avg 1.781821 0.000000 MPI_Finalize 1 0.034094 0.000000 MPI_Init 1 1080.793858 0.000000 MPI_Recv 2010 3.236055 0.000000

Page 27: XC V3.0   HP-MPI Update

27

HP-MPI instrumentation continued • Message Summary by Rank Pair:

SRank DRank Messages (minsize,maxsize)/[bin] Totalbytes ----------------------------------------------------------------------------- 0 1 1005 (0, 0) 0 1005 [0..64] 0

3 1005 (0, 0) 0 1005 [0..64] 0

Page 28: XC V3.0   HP-MPI Update

28

OPROFILE Profiling example• oprofile configured in XC, but not enabled• Need to be root to enable on a node

# opcontrol --no-vmlinux# opcontrol --startUsing default event:

GLOBAL_POWER_EVENTS:100000:1:1:1Using 2.6+ OProfile kernel interface.Using log file /var/lib/oprofile/oprofiled.logDaemon started.Profiler running.

Clear out old performance data.# opcontrol --reset Signalling daemon... done

Page 29: XC V3.0   HP-MPI Update

29

OPROFILE Profiling example cont.• Run your application # bsub -I -n4 -ext

"SLURM[nodelist=xcg14]" ./run_linux_amd_intel 4 121 test

• find the name of your executable # opreport --long-filenames

• Generate a report for that executable image# opreport -l /mlibscratch/lieb/mpi2005.kit23/benchspec/MPI2005/121.pop2/run/run_base_test_intel.0001/pop2_base.intel | more

Page 30: XC V3.0   HP-MPI Update

30

OPROFILE Profiling example cont.

Page 31: XC V3.0   HP-MPI Update

31

OPROFILE Profiling kernel symbolsThe actual version of the rpm may change

• The vmlinux file is contained in the kernel debug RPM:

− kernel-debuginfo-2.6.9-11.4hp.XC.x86_64.rpm

• Kernel symbols file is installed in:− /usr/lib/debug/lib/modules/2.6.9-11.4hp.XCsmp/vmlinux

• opcontrol --vmlinux=\− /usr/lib/debug/lib/modules/2.6.9-11.4hp.XCsmp/vmlinux

Page 32: XC V3.0   HP-MPI Update

32

Diagnostic Library

− Advanced run time error checking and analysis

− Message signature analysis detects type mismatches

− Object-space corruption detects attempts to write into objects

− Detects operations that causes MPI to write to a user buffer more than once

Page 33: XC V3.0   HP-MPI Update

33

HP-MPI Diagnostic Library• Link with –ldmpi to enable diagnostic library, or

use• ld_preload on an existing pre-linked application

(shared libs)• This will dynamically insert diagnostic lib

• mpirun -e LD_PRELOAD=libdmpi.so:libmpi.so -srun ./a.out

• This will also dump message formats (could be REALLY Large)• mpirun -e LD_PRELOAD=libdmpi.so:libmpi.so -e

MPI_DLIB_FLAGS=dump:foof -srun ./a.out

• See “MPI_DLIB_FLAGS” on page 46 of Users Guide or man mpienv for more information on controlling features.

Page 34: XC V3.0   HP-MPI Update

34

HP-MPI Communication

Page 35: XC V3.0   HP-MPI Update

35

HP-MPI CommunicationMovement of data depends on relative location

of destination and interconnect. Paths are:• Communication within a Node (shared

memory)• Communication from Node to Node over

TCP/IP • Communication from Node to Node over high

speed interconnects InfiniBand, Quadrics, Myrinet

Page 36: XC V3.0   HP-MPI Update

36

HP-MPI Communication within a Node

To Send data from Core 1 to Core 4:Core 1 -> Core 1 Local MemoryCore 1 Local Memory* -> System Shared Memory**System Shared Memory -> Core 4 Local MemoryCore 4 Local Memory -> Core 4

*The operating system makes Local Memory available to a single process**The operating system makes Shared Memory available to multiple processes

Memory

Core 1 Core 2

Memory

Core 3 Core 4

Bus

data

Page 37: XC V3.0   HP-MPI Update

37

HP-MPI Communication to another Node via TCP/IP

To Send data from Core 1, Node 1 to Core 1, Node 2:Core 1, Node 1 -> Core 1, Node 1 Local MemoryCore 1, Node 1 Local Memory -> Node 1 Shared MemoryNode 1 Shared Memory -> InterconnectInterconnect -> Node 2 Shared MemoryNode 2 Shared Memory -> Core 1, Node 2 Local MemoryCore 1, Node 2 Local Memory -> Core 1, Node 2

The core is used to send data to the TCP/IP Interconnect

TCP/IP

Memory

Core 1 Core 2

Memory

Core 3 Core 4

Bus

Memory

Core 1 Core 2

Memory

Core 3 Core 4

Bus

data

Page 38: XC V3.0   HP-MPI Update

38

HP-MPI Communication to another Node via other Interconnects

To Send data from Core 1, Node 1 to Core 1, Node 2:Core 1, Node 1 -> Core 1, Node 1 Local MemoryCore 1, Node 1 Local Memory -> Node 1 Shared MemoryNode 1 Shared Memory -> InterconnectInterconnect -> Node 2 Shared MemoryNode 2 Shared Memory -> Core 1, Node 2 Local MemoryCore 1, Node 2 Local Memory -> Core 1, Node 2

Interconnect

Memory

Core 1 Core 2

Memory

Core 3 Core 4

RDMA

Memory

Core 1 Core 2

Memory

Core 3 Core 4

RDMA

data

Page 39: XC V3.0   HP-MPI Update

39

HP-MPI options to set interconnect

mpirun options: default MPI_IC_ORDER-vapi/-VAPI – InfiniBand with the VAPI protocol-udapl/-UDAPL - InfiniBand with the uDAPL protocol-itapi/-ITAPI – InfiniBand with the IT-API protocol (deprecated on linux)-elan/-ELAN – Elan (Quadrics)-mx/-MX – Myrinet MX-gm/-GM – Myrinet GM-TCP - TCP/IP

Lower case for request (tries in order of MPI_IC_ORDER)Upper case for demand (failure to meet demand is error)

Page 40: XC V3.0   HP-MPI Update

40

X86-64: 32-bit versus 64-bit Interconnect Support

• Supported 64-bit interconnects:

• TCP/IP

• GigE

• InfiniBand

• Elan

• Myrinet

• Supported 32-bit interconnects:

• TCP/IP

• Myrinet

• InfiniBand (but not 32 bit mode on 64 bit architectures)

Page 41: XC V3.0   HP-MPI Update

41

Cluster Interconnect Status Check

Page 42: XC V3.0   HP-MPI Update

42

Cluster Interconnect Status

• ‘-prot’ displays the protocol in use − possibilities: VAPI SHM UDPL GM MX IT ELAN− mpirun –prot –srun ./hello.x

• Measure bandwidth between pairs of nodes using ping_pong_ring.c− copy shipped in /opt/hpmpi/help/ping_pong_ring.c –o ppring.x− bsub –I –n12 -ext “SLURM[nodes=12]”

/opt/hpmpi/bin/mpirun –srun ./ppring.x 300000

• Exclude “suspect” nodes explicitly − bsub –ext “SLURM[nodes=12;exclude=n[1-4]]”

• Include “suspect” nodes explicitly− bsub –ext “SLURM[nodes=12;include=n[1-4]]”

Page 43: XC V3.0   HP-MPI Update

43

Cluster Interconnect Statusbsub -I -n4 -ext “SLURM[nodes=4]” /opt/hpmpi/bin/mpirun -prot -srun ./a.out 300000… host | 0 1 2 3======|===================== 0 : SHM VAPI VAPI VAPI 1 : VAPI SHM VAPI VAPI 2 : VAPI VAPI SHM VAPI 3 : VAPI VAPI VAPI SHM

[0:xcg1] ping-pong 300000 bytes ...300000 bytes: 345.70 usec/msg300000 bytes: 867.80 MB/sec[1:xcg2] ping-pong 300000 bytes ...300000 bytes: 700.44 usec/msg300000 bytes: 434.46 MB/sec[2:xcg3] ping-pong 300000 bytes ...

======300000 bytes: 690.76 usec/msg300000 bytes: 434.64 MB/sec[3:xcg4] ping-pong 300000 bytes ...300000 bytes: 345.79 usec/msg300000 bytes: 867.59 MB/sec

Communication from xc2 to xcg3is off by 50%, as is communicationfrom xcg3 to xcg4. xcg3 likely has a bad IB cable.

Page 44: XC V3.0   HP-MPI Update

44

Cluster Interconnect Status• Expected Performance by Interconnect and

recommended message size to use Interconnect Expected Perf Msg size

− InfiniBand PCI-X 650-700 MB/sec 200000− InfiniBand PCI-Express 850-900 MB/sec 400000− Quadrics Elan4 800-820 MB/sec 200000− Myrinet GM rev E 420-450 MB/sec 100000− Myrinet GM rev D/F 220-250 MB/sec 100000− GigE Ethernet 100-110 MB/sec 30000− 100BaseT Ethernet 10-12 MB/sec 5000− 10BaseT Ethernet 1-2 MB/sec 500

• who would want to test 10BaseT

Page 45: XC V3.0   HP-MPI Update

45

Cluster Interconnect Status• The following failure signature indicates

Node n611 has bad HCA or Driver• p.x: Rank 0:455: MPI_Init: EVAPI_get_hca_hndl() failed

p.x: Rank 0:455: MPI_Init: didn't find active interface/port p.x: Rank 0:455: MPI_Init: Can't initialize RDMA device p.x: Rank 0:455: MPI_Init: MPI BUG: Cannot initialize RDMA protocol srun: error: n611: task455: Exited with exit code 1

• Rerun and exclude the node in question, also report the suspect node to your sysadmin

− bsub –ext “SLURM[exclude=n611]” mpirun …

Page 46: XC V3.0   HP-MPI Update

46

HP-MPI CPU Affinity control

Page 47: XC V3.0   HP-MPI Update

47

HP-MPI support for Process binding • • distributes ranks across nodes

−mpirun -cpu_bind=[v,][policy[:maplist]] -srun a.out −[v] requests info on what binding is performed

• Policy is one of− LL|RANK|LDOM|RR|RR_LL|CYCLIC|FILL|FILL_LL| − BLOCK|MAP_CPU|MAP_LDOM|PACKED|HELP− MAP_CPU and MAP_LDOM list of cpu#s

• Example: bsub –I –n8 mpirun -cpu_bind=v,MAP_CPU:0,2,1,3 –srun ./a.out

… This is the map info for the 2nd nodeMPI_CPU_AFFINITY set to RANK, setting affinity of rank 4 pid 7156 on host dlcore1.rsn.hp.com to cpu 0MPI_CPU_AFFINITY set to RANK, setting affinity of rank 5 pid 7159 on host dlcore1.rsn.hp.com to cpu 2MPI_CPU_AFFINITY set to RANK, setting affinity of rank 6 pid 7157 on host dlcore1.rsn.hp.com to cpu 1MPI_CPU_AFFINITY set to RANK, setting affinity of rank 7 pid 7158 on host dlcore1.rsn.hp.com to cpu 3…

Page 48: XC V3.0   HP-MPI Update

48

HP-MPI support for Process binding • $MPI_ROOT/bin/mpirun -cpu_bind=help ./a.out

-cpu_binding help info cpu binding methods available: rank - schedule ranks on cpus according to packed rank id map_cpu - schedule ranks on cpus in cycle thru MAP variable mask_cpu - schedule ranks on cpu masks in cycle thru MAP variable ll - bind each rank to cpu each is currently running on for numa based systems the following are also available: ldom - schedule ranks on ldoms according to packed rank id cyclic - cyclic dist on each ldom according to packed rank id block - block dist on each ldom according to packed rank id rr - same as cyclic, but consider ldom load avg. fill - same as block, but consider ldom load avg. packed - bind all ranks to the same ldom as lowest rank slurm - slurm binding ll - bind each rank to ldom each is currently running on map_ldom - schedule ranks on ldoms in cycle thru MAP variable

Page 49: XC V3.0   HP-MPI Update

49

Memory Models

Examples of NUMA or NUMA-like systems:• Dual-core Opteron has (in effect) local and remote memories,

is considered a NUMA • Single-core Opteron with memory controller is considered as a

NUMA-like system• Cell-based Itanium SMP system, is considered a NUMA system.

LDOM(Local Memory)

Core Core Core

LDOM(Local Memory)

NUMA NUMA-like

LDOM(Local Memory)

Core Core Core

LDOM(Local Memory)

Page 50: XC V3.0   HP-MPI Update

50

Example of Rank and LDOM distributions

mpirun –np 8 –srun -m=cycliccauses ranks and Packed Rank IDs to be

distributed across 2 4-Core hosts as:

LDOM 0

Rank 0

Packed Rank ID 0

LDOM 1

Rank 2

Packed Rank ID 1

Rank 4

Packed Rank ID 2

Rank 6

Packed Rank ID 3

LDOM 0

Rank 1

Packed Rank ID 0

LDOM 1

Rank 3

Packed Rank ID 1

Rank 5

Packed Rank ID 2

Rank 7

Packed Rank ID 3

HOST 1

HOST 2

Page 51: XC V3.0   HP-MPI Update

51

Another Example of Rank and LDOM distributions

mpirun –np 8 –srun -m=blockcauses ranks and Packed Rank IDs to be

distributed across 2 4-Core hosts as:

LDOM 0

Rank 0

Packed Rank ID 0

LDOM 1

Rank 1

Packed Rank ID 1

Rank 1

Packed Rank ID 2

Rank 3

Packed Rank ID 3

LDOM 0

Rank 4

Packed Rank ID 0

LDOM 1

Rank 5

Packed Rank ID 1

Rank 6

Packed Rank ID 2

Rank 7

Packed Rank ID 3

HOST 1

HOST 2

Page 52: XC V3.0   HP-MPI Update

52

Options for NUMA or NUMA-like systems

rank – Assign MPI process to N to cpu N.(default)map_cpu:MAP - schedule ranks on cores in cycle thru

MAP. MAP is a comma separated list giving the order in which to use cores on a machine.

mask_cpu:MAP- schedule ranks on core masks in cycle thru MASK. MAP is a comma separated list. If (MASK value .and. Processor number) used to determine groups of core to use.

ll - MPI process spins before assigning. The OS moves the process to the least loaded processor. MPI will stay were moved.

V - verbose

Page 53: XC V3.0   HP-MPI Update

53

Options for NUMA systemsThese options choose cores to bind to based on the

core’s ldom (local memory):ldom - schedule ranks on ldoms according to packed rank idcyclic – round robin dist on each ldom according to packed rank idblock - block dist on each ldom according to packed rank id (default)Starting from the least loaded core:rr – round robin dist on each ldom according to packed rank idfill - block dist on each ldom according to packed rank idpacked - bind all MPI processes to the chosen ldommap_ldom:MAP - schedule ranks on ldoms in cycle thru comma separated list.

Page 54: XC V3.0   HP-MPI Update

54

ccNUMA and I/O buffer-cache Interaction

• On Opteron systems, memory can either be 100% interleaved among processors or 100% processor-local

− For best performance, we use processor-local memory• Linux can use all available memory for IO buffering• When a user process requests local memory and the local memory is in use for IO

buffering, • LINUX assigns the memory on another processor worst-case latency• Given user demand for local memory, LINUX frees the IO buffers over time – at which

point the best runtime is achieved

LDOM

Core Core

LDOM

Core Core

LDOM

Core Core

LDOM

Core Core

DL585/4p8c

Page 55: XC V3.0   HP-MPI Update

55

HP-MPI Scaleout

Page 56: XC V3.0   HP-MPI Update

56

HP-MPI Scaleout Challenges

• Scalable process startup− reducing number of open sockets− Tree structure of MPI Daemons− now handles > 256 MPI ranks (srun and appfile)

• Scalable teardown of proceses• Scalable Licensing

− rank 0 checks for an N rank license.• Scalable setup data

− reduced Init4 Message size by 96%• Managing IB Buffer requirements

− physical memory pinning• 1-sided lock/unlock now over IB if using VAPI

Page 57: XC V3.0   HP-MPI Update

57

Managing IB Buffer requirements

• Two modes: RDMA and Shared-Receive-Queue

• The amount of memory pinned (locked in physical memory)

− 1) memory which is always pinned (base)− 2) memory that may be pinned depending on communication. (dynamic)

• maximum_dynamic_pinned_memory = min(2 * max_messages * chunksize), (physical_memory / local_ranks) * pin_percentage);

− max_messages is 3 * remote connections and chunksize varies depending on the protocol. • for IB it is 4MB and for GM it is 1MB.

− maximum_dynamic_pinned_memory <= MPI_PIN_PERCENTAGE of rank's portion of physical memory. For large clusters, the limit will generally be based on the pin_percentage as 2*max_messages*chunksize gets large for even moderate clusters.− MPI_PIN_PERCENTAGE is 20% by default, but can be changed by the user.

Page 58: XC V3.0   HP-MPI Update

58

Managing IB Buffer reqs cont

• Default is -rdma from 1 to 1024 ranks.• Default is -srq mode for 1025 ranks or larger.

• "base" memory is based on the number of off-host connections.

• Without –srq (aka -rdma): − base_pinned_memory = envelopes * 2 * shortlen * N

• With -srq: − base_pinned_memory = min(N * 8 , 2048) * 2 * shortlen

• envelopes = # of envelopes for each connection, default is 8 (can be changed by the user) • shortlen = short message length, default is 16K for infiniband (uDAPL and VAPI).•

Page 59: XC V3.0   HP-MPI Update

59

Managing IB Buffer reqs cont

• For a 2048 CPU job (memory per rank):

8 * 2 * 16K * 2047 = 524,032K (WITHOUT srq)2048 * 2 * 16K = 65,536K (WITH srq)

• If we have two ranks on a node, then the total pre-pinned memory will be− around 1G without srq and 128MB with srq.

• For 4 ranks per node (still 2048 CPU's total)− 2048 ranks --> roughly 2GB without SRQ and 256MB with

SRQ.

Page 60: XC V3.0   HP-MPI Update

60

Shared-Receive-Queue model for Dynamic Message Buffer

• HP-MPI default mode for more than 1024 ranks • Also triggered with –srq option for mpirun• Shared-Receive-Queue

− A single shared memory communication queue on each node • Other processes write directly to this buffer.• Buffer is in shared memory

− Size of queue grows with the number of ranks in the job up to maximum size at 1024 ranks

SRQ_dynamic_memory = min(Nranks, 1024) * 4 * shortlen * RanksPerNode

− shortlen = short message length. Determined by interconnect− Nranks = Number of MPI ranks in the job− RanksPerNode = Number of ranks per node

Page 61: XC V3.0   HP-MPI Update

61

Comparison of Dynamic Memory Requirements for Various Jobs• shortlen =16K for InfiniBand (uDAPL and VAPI)• envelopes = 8 (default)

• Dynamic Memory Buffer size per node for a 2048 rank job− 1024 nodes, 2 ranks/node.

• Memory per rank: RDMA: envelopes * 2 * shortlen * (Nranks -1) *RanksPerNode

8 * 2 * 16K * 2047 * 2 = 1048064K SRQ: min(Nranks,1024) * 4 * shortlen *RanksPerNode

min(1024,2048) * 4 * 16K * 2 = 131072K• 2 ranks/node, then the dynamic pre-pinned memory will be:

RDMA: ~1 GBSRQ: ~128MB

− 512 nodes, 4 ranks/node• Memory per rank:

RDMA: 8 * 2 * 16K * 2047 * 4 = 2096128K SRQ: min(1024,2048) * 4 * 16K * 4 = 262144K

• 4 ranks/node, then the dynamic pre-pinned memory will be:RDMA: ~2 GBSRQ: ~256MB

Page 62: XC V3.0   HP-MPI Update

62

Effect of PIN Percentage on Buffer Memory

Change PIN Percentage to increase amount of usable base memory

Problem:• a.out: Rank 0:23: MPI_Init: ERROR: The total amount of memory that may be pinned

(210583540 bytes), is insufficient to support even minimal rdma network transfers. This value was derived by taking 20% of physical memory (2105835520 bytes) and dividing by the

number of local ranks (2). A minimum of 253882484 bytes must be able to be pinned. Solution:• These values can be changed by setting environment variables

− MPI_PIN_PERCENTAGE − MPI_PHYSICAL_MEMORY (Mbytes).

• In this case, 210583540 bytes is about 83% of the 253882484 bytes required.• Increasing the MPI_PIN_PERCENTAGE from the default of 20% to 24% is sufficient to

allow the application to run. Here is how to set to 30%:$MPI_ROOT/bin/mpirun -e MPI_PIN_PERCENTAGE=30 –srun ./a.out

Page 63: XC V3.0   HP-MPI Update

63

Managing InfiniBand Message Buffer Example1200 ranks over InfiniBand

used for this exampleRDMA ModeMemory footprint

measured with ‘top’PID USER PR NI VIRT RES SHR S %CPU %MEM TIME

MPI_RDMA_NENVELOPE=8 gives optimum performance at a reasonable memory footprint

MPI_RDMA_NENVELOPEvalue

Memory footprint(MB)

CPUTimeSec

2 201 BAD IDEA!

4 279 27

6 356 25

8 432 21.4

10 508 25.4

Page 64: XC V3.0   HP-MPI Update

64

Managing IB Buffer reqs cont

• Latency for RDMA vs SRQ

rdma srq0 byte latency : 3.97us 7.09us4M bandwidth: 903.61 902.63

Page 65: XC V3.0   HP-MPI Update

65

SRQ Dynamics Message Memory ProjectionsExamples using InfiniBand:

1024 nodes with 4096 ranks (4 sockets)Memory per rank = min(1024, 1024) * 4 * 16K = 64MMemory per node = 64M * 4 = 256 M1024 nodes with 8192 ranks (dual core 4 sockets)Memory per rank = min(1024, 1024) * 4 * 16K = 64MMemory per node = 64M * 8 = 512 M2048 nodes with 8192 ranks (4 sockets)Memory per rank = min(2048, 1024) * 4 * 16K = 64MMemory per node = 64 M * 4 = 256 M2048 nodes with 16384 ranks (dual core 4 sockets) Memory per rank = min(2048, 1024) * 4 * 16K = 64MMemory per node = 64 M * 8 = 512M

Page 66: XC V3.0   HP-MPI Update

66

Managing InfiniBand Buffer reqs cont • example of 1536 rank run memory use:

− using -srq by default • mpirun –srun ./xhpl• top gives a footprint of 406m

− Using -rdma and MIN_PIN_PERCENTAGE=25• mpirun –e MIN_PIN_PERCENTAGE=25 –rdma –

srun ./xhpl• top gives a footprint of 737m

Page 67: XC V3.0   HP-MPI Update

67

TCP Scaleout • TCP provides challenges in terms of number

of open sockets. • Default: each rank opens a socket to all

other ranks off host. − For a 64 node 4-core cluster, each rank opens

64*4 sockets.− A node with 4 ranks would require 4*64*4 sockets.

• HP-MPI provides a communication deamon − The mpirun option -commd invokes− Provides communication proxy from ranks on

same host to all other ranks (off-host). − Reduces # of open sockets per commd to # of

nodes.

Page 68: XC V3.0   HP-MPI Update

68

HP-MPI Performance Improvements

Page 69: XC V3.0   HP-MPI Update

69

Startup Performance Data

0

2

4

6

8

10

12

32-srun

32-appfile

64-srun

64-appfile

128-srun

128-appf ile

256-srun

256-appfile

512-srun

512-appf ile

1024-srun

1024-appf ile

1300-srun

1300-appf ile

Time to rdma_connect

Time to get init4 broadcast (estimated)

Time receiving init3 messages

Time to broadcast init2 and get f irst init 3 back

mpids connect to mpirun

w aiting for f irst mpid to connect back

Page 70: XC V3.0   HP-MPI Update

70

V3.0 x86_64 using InfiniBand interconnect XC pingpong bandwidth

0

100

200

300

400

500

600

700

800

900

1000

message size

MB/

Sec

mpi2.1.0-8

mpi2.1.1

mpi2.1.0-8 0 0.18 0.35 0.7 1.38 2.735.39 9 17.5 33.1 59.8 102 179 277 382 473 539 703 796 852 883 900 909 913 915

mpi2.1.1 0 0.24 0.47 0.92 1.81 3.566.98 13.6 25.3 46.6 76.4 126 208 310 410 494 618 740 818 865 890 904 910 914 915

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

HP-MPI V2.2 vs HP-MPI V2.1Pallas ping-pong Bandwidth

Page 71: XC V3.0   HP-MPI Update

71

Pingpong Latency

1

10

100

message size

µsec

mpi2.1.0-8

mpi2.1.1

mpi2.1.0-8 5 5 5 5 6 6 6 7 7 7 8 10 11 14 20 33 58 89

mpi2.1.1 4 4 4 4 4 4 4 4 5 5 6 8 9 13 19 32 51 84

0 1 2 4 8 16 32 64 12 25 51 10 20 40 81 16 32 65

HP-MPI V2.2 vs HP-MPI V2.1 Pallas PingPong Latency

V3.0 x86_64 using InfiniBand interconnect XC

Page 72: XC V3.0   HP-MPI Update

72

alltoall performance Improvements• HP-MPI V2.2 includes an implementation of MPI_Alltoall

and MPI_Alltoallv which has been shown to perform better than prior releases for TCP/IP, ITAPI and InfiniBand.

• The improvement avoids switch congestion by limiting the number of ranks that may send to a single rank at once.

• The improvement has been shown to improve the performance for most message sizes, but particularly those greater than 16KB in length.

• Measured message transmission time vs. message size, for message sizes ranging from 1 byte to 1.5MB.

• Each test was run using 3 cluster sizes: 16, 32, and 64 nodes. For each cluster size, the test was run for 2 variations – using 1 MPI process/node and using 2 MPI processes/node (1 per CPU).

Page 73: XC V3.0   HP-MPI Update

73

V3.0 x86_64 using InfiniBand interconnect XC

MPI_Alltoall performance

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

bytes

time

(use

c)

new algorithm

old algorithm

HP-MPI AllToAll Latency

Page 74: XC V3.0   HP-MPI Update

74

Improved socket progression for TCP/IP• An environment variable is now available to

improve the performance of many applications running on TCP/IP. − MPI_SOCKBUFSIZE

• Allows the amount of system space used for buffering within sockets to be specified.

• Using a value larger than the typical system default has been shown to improve progression and thereby overall performance for many communication patterns on TCP/IP.

Page 75: XC V3.0   HP-MPI Update

75

Additional References

Page 76: XC V3.0   HP-MPI Update

76

References• HP-MPI User’s Guide• XC User’s Guide• XC How to Guide - C Currently in development,

email requests to [email protected]

Page 77: XC V3.0   HP-MPI Update

77

Page 78: XC V3.0   HP-MPI Update

78

HP-MPI ISV and Application Support

Page 79: XC V3.0   HP-MPI Update

79

HP-MPI Object Compatibility

MPICH V1.2.5MPI-1

ApplicationMPI-1

(built shared)

MPICH CompatibleMPI-1

Linux ItaniumLinux x86XC V2.0

HP-MPI V2.1MPI-1MPI-2

A compatibility is documented in the MPI V2.1 & later Release Note

HP-MPI V2.1 and later isobject compatible with MPICH V1.2.5

and later

Page 80: XC V3.0   HP-MPI Update

80

• Currently working with ISVs across multiple segments on integrating support for HP-MPI : signed up for linux

• Commitments to date:

Current Applications for HP-MPI

Developer ApplicationAnsys AnsysAbaqus Standard; ExplicitAdina AdinaLMS SysnoiseESI PamCrash; PamFlowMecalog RadiossAVL FireV8/SwiftAcusim AcusolveEXA PowerFLOWAccelrys CASTEP, DMol3,

MesoDyn, ONESTEPLSTC MPP-DynaMSC Nastran

Developer ApplicationFluent FluentSCM ADFUCSF AmberAmes Lab Gamess -USScripps CHARMMUIUC NAMDLANL MPIblastUniv of VA FASTACPMD CPMDU Karlsruhe TurbomoleU Birmingham MOLPROCD-Adapco Star-CDUSG Nastran NastranUniversity of Texas

AMLS

Page 81: XC V3.0   HP-MPI Update

81

Backup Slides