XC V3.0 HP-MPI Update

© 2004 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

XC V3.0 HP-MPI Update

HP-MPIHPCD r1.3January 2006

2

HP-MPI 2.2 and XC 3.0• Usability

− Xc jobs, srun, lustre, ssh, 32 bit mode,• Debuggability and Profiling

− Message Profiling− Message validation Library

• Communication and Cluster Health− MPI Communication− Interconnect health check

• Scaleout− rank to core binding− Startup, message buffers, licensing

• Performance Improvements− InfiniBand, Ethernet

3

HP-MPI Usability

4

XC Job Control

LSF, SLURM, HP-MPI are tightly coupled, built to interact with a remote login program.LSF determine WHEN the job will run LSF talks with SLURM to determine WHICH resources will be

used. SLURM - Determines WHERE the job runs. It controls things like which host each rank runs on.

SLURM also starts the executables on each host as requested by HP-MPI's mpirun HP-MPI - Determines HOW the job runs, part of the application, handles communication. Can also

pinpoint the processor on which each rank runs.SSH/rsh - The KEY that opens up remote hosts.

LSFJob queueing system

SLURMcluster management

& job schedulingHP-MPI

Message Passing

ssh remote login

Key

When HowWhere

5

HP-MPI mpirunUseful options:

-prot Prints the communication protocol-np # - Number of processors to use-h host - Set host to use-e <var>[=<val>] - Set environment variable-d - Debug mode-v - Verbose-i file - Write profile of MPI functions-T - Prints user and system times for each MPI rank.-srun - Use SLURM-mpi32 - Use 32-bit interconnect libraries on X86-64-mpi64 - Use 64-bit interconnect libraries on X86-64

(default)-f appfile - Parallelism directed from instructions in

appfile

6

mpirun: appfile mode• appfile mode does not propagate the environment

• Need to pass LD_LIBRARY_PATH• Need to pass application environment variables

• appfile mode with XC V 2.0 or Elan interconnect requires:

• export MPI_USESRUN 1 • All lines in appfile need to do the same thing, but on different hosts

• ssh/appfiles requires export MPI_REMSH /usr/bin/ssh

• Find host names with either:

• /opt/hptc/bin/expandnodes “$SLURM_NODELIST”• $LSB_HOSTS

7

SLURM srun utilitysrun – SLURM utility to run parallel jobssrun usage on XC:

− Implied mode not available with pre- MPI V2.1.1 (XC V2.1) or Elan interconnect

− hpmpi option• Use as: -srun options exe args

− hpmpi implied srun mode• Use as: export MPI_USESRUN 1 • Set options by: export MPI_SRUNOPTIONS options

− Stand alone executable• srun options exe args

8

Implied srun mode• The implied srun mode allows the

user to omit the -srun argument from the mpirun command line with the use of the environment variable MPI_USESRUN.

− XC systems only

Set the environment variable: % setenv MPI_USESRUN 1• HP-MPI will insert the -srun

argument.• The following arguments are

considered to be srun arguments: − -n -N -m -w -x− Any argument that starts with -- and

is not followed by a space − -np will be translated to –n− -srun will be accepted without

warning.• The implied srun mode allows the

use of HP-MPI appfiles.

• An appfile must be homogenous in its arguments with the exception of -h and -np. The -h and -np arguments within the appfile are discarded. All other arguments are promoted to the mpirun command line. Arguments following – are also processed

Additional environment variables provided:

• MPI_SRUNOPTIONS− Allows additional srun options to be

specified such as --label.

% MPI_USESRUN_IGNORE_ARGS• Provides an easy way to modify the

arguments contained in an appfile by supplying a list of space-separated arguments that mpirun should ignore.

See Release Notes for more options to srun command

9

Lustre support for SFS for XC• Lustre allows individual files to be striped over

multiple OSTs (Object Storage Targets) to improve overall throughput

• “striping_unit” = <value>− Specifies number of consecutive bytes of a file that

are stored on a particular IO device as part of a stripe set

• “striping_factor” = <value>− Specifies the number of IO devices over which the

file is striped. Cannot exceed the maximum defined by the system administrator

• “start_iodevice” = <value>− Specifies the IO device from which striping will

begin

10

Lustre support for SFS for XC - cont• These need to be defined prior to file creation

so that the call to MPI_File_open can access them:

/* set new info values. */value = randomize_start(); MPI_Info_create(&info);MPI_Info_set(info, "striping_factor", "16");

MPI_Info_set(info, "striping_unit", "131072");MPI_Info_set(info, "start_iodevice", value );/* open the file and set new info */MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_RDWR,

info, &fh);

11

Improved ssh support• By default, HP-MPI attempts to use rsh

(remsh on HP-UX). An alternate remote execution tool can be used by setting the environment variable MPI_REMSH.

• % export MPI_REMSH=sshor% $MPI_ROOT/bin/mpirun –e MPI_REMSH= \

“ssh –x” <options> -f <appfile>Considering making ssh the default on linux

in next release.

12

Remote login - ssh To Disable password prompting• $ ssh-keygen –t dsa (to prevent password prompting, just return for each

password prompting)• $ cd ~/.ssh• $ cat id_dsa.pub >> authorized_keys• $ chmod go-rwx authorized_keys• $ chmod go-w ~ ~/.ssh• You may need to repeat the above with rsa rather than dsa• $ cp /etc/ssh/ssh_config $HOME/.ssh/config• $ echo "CheckHostIP no" >> $HOME/.ssh/config• $ echo "StrictHostKeyChecking no" >> $HOME/.ssh/configTo Specify remote execution over ssh:

Set MPI_REMSH to /usr/bin/ssh

13

32- and 64-bit selection• Options have been added to indicate the

bitness of the application so the proper interconnect library can be invoked.

• Use –mpi32 or –mpi64 on the mpirun command line for AMD64 and EM64T.

• Default is –mpi64.• Mellanox only provides a 64-bit IB driver.

− 32-bit apps are not supported for IB on AMD64 & EM64T systems.

14

HP-MPI Parallel Compiler Options

Useful options:-mpi32 - build 32-bit

Useful environment variables:setenv MPI_CC cc - set C compilersetenv MPI_CXX C++ - set C++ compiler setenv MPI_F90 f90 - set Fortran compilersetenv MPI_ROOT dir - useful when MPI not installed in /opt/[hpmpi|mpi]

15

Problematic Compiler Options

INTEL PGI Description-static -Bstatic Link static – does not allow

HP-MPI to determine interconnect

-i8 -i8 If you compile with this, be sure to link with it. Intel and AMD math libraries do not support Integer*8.

16

Debugging

17

Debugging Scripts: Use hello_world Test case

#include <stdio.h>#include <mpi.h>main(argc, argv)int argc;char *argv[];{ int rank, size, len; char name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(name, &len); printf ("Hello world! I'm %d of %d on %s\n", rank, size, name); MPI_Finalize(); exit(0);}

18

How to debug HP-MPI applications with a single-process debugger• export MPI_DEBUG_CONT=1• Set the MPI_FLAGS environment variable to choose

debugger. Values are:− eadb – Start under adb− exdb – Start under xdb− edde – Start under dde− ewdb – Start under wdb− egdb – Start under gdb

• Set DISPLAY to point to your console. • Use xhost to allow remote hosts to redirect their

windows to your console.• Run application.

• export DISPLAY=hostname:1• xhost +hostname:1• mpirun –e MPI_FLAGS=egdb –np 2 ./a.out

19

Attaching Debuggers to HP-MPI Applications• HP-MPI conceptually creates processes in

MPI_Init, and each process instantiates a debugger session.

• Each debugger session in turn attaches to the process that created it.

• HP-MPI provides MPI_DEBUG_CONT to control the point at which debugger attachment occurs via breakpoint.

• MPI_DEBUG_CONT is a variable that HP-MPI uses to temporarily spin the processes awaiting the user to allow execution to proceed via debugger commands.

• By default, MPI_DEBUG_CONT is set to 0 and you must set it to 1 to allow the debug session to continue past this ‘spin barrier’ in MPI_Init.

20

Debugging HPMPI apps cont:The following procedure outlines the steps to

follow when you use a single-process debugger:Step 1. Set the eadb, exdb, edde, ewdb, or egdb

option in the MPI_FLAGS environment variable to use the ADB, XDB, DDE, WDB, or GDB debugger.

Step 2. On remote hosts, set DISPLAY to point to your console. In addition, use xhost to allow remote hosts to redirect their windows to your console.

Step 3. Run your application.export DISPLAY=hostname:1mpirun –e MPI_FLAGS=egdb –np 2 ./a.out

21

Debugging HP-MPI apps cont:•

22

Debugging HP-MPI apps cont:•

23

Profiling

24

Profiling• Instrumentation

− Lightweight method for cumulative runtime statistics

− Profiles for applications linked with standard HP-MPI library

− Profiles for applications linked with the thread-compliant library

25

HP-MPI instrumentation profile:

-i <myfile>[:opt] - produces a rank by rank summary of where MPI spends its time and places result in file name myfile.trace

bsub –I –n4 mpirun –i myfile -srun ./a.out Application Summary by Rank (second):

Rank Proc CPU Time User Portion System Portion ----------------------------------------------------------------------------- 0 0.040000 0.030000( 75.00%) 0.010000( 25.00%) 1 0.050000 0.040000( 80.00%) 0.010000( 20.00%) 2 0.050000 0.040000( 80.00%) 0.010000( 20.00%) 3 0.050000 0.040000( 80.00%) 0.010000( 20.00%)

26

HP-MPI instrumentation continued • Routine Summary by Rank:

Rank Routine Statistic Calls Overhead(ms) Blocking(ms) ----------------------------------------------------------------------------- 0 MPI_Bcast 4 7.127285 0.000000 min 0.033140 0.000000 max 5.244017 0.000000 avg 1.781821 0.000000 MPI_Finalize 1 0.034094 0.000000 MPI_Init 1 1080.793858 0.000000 MPI_Recv 2010 3.236055 0.000000

27

HP-MPI instrumentation continued • Message Summary by Rank Pair:

SRank DRank Messages (minsize,maxsize)/[bin] Totalbytes ----------------------------------------------------------------------------- 0 1 1005 (0, 0) 0 1005 [0..64] 0

3 1005 (0, 0) 0 1005 [0..64] 0

28

OPROFILE Profiling example• oprofile configured in XC, but not enabled• Need to be root to enable on a node

# opcontrol --no-vmlinux# opcontrol --startUsing default event:

GLOBAL_POWER_EVENTS:100000:1:1:1Using 2.6+ OProfile kernel interface.Using log file /var/lib/oprofile/oprofiled.logDaemon started.Profiler running.

Clear out old performance data.# opcontrol --reset Signalling daemon... done

29

OPROFILE Profiling example cont.• Run your application # bsub -I -n4 -ext

"SLURM[nodelist=xcg14]" ./run_linux_amd_intel 4 121 test

• find the name of your executable # opreport --long-filenames

• Generate a report for that executable image# opreport -l /mlibscratch/lieb/mpi2005.kit23/benchspec/MPI2005/121.pop2/run/run_base_test_intel.0001/pop2_base.intel | more

30

OPROFILE Profiling example cont.

31

OPROFILE Profiling kernel symbolsThe actual version of the rpm may change

• The vmlinux file is contained in the kernel debug RPM:

− kernel-debuginfo-2.6.9-11.4hp.XC.x86_64.rpm

• Kernel symbols file is installed in:− /usr/lib/debug/lib/modules/2.6.9-11.4hp.XCsmp/vmlinux

• opcontrol --vmlinux=\− /usr/lib/debug/lib/modules/2.6.9-11.4hp.XCsmp/vmlinux

32

Diagnostic Library

− Advanced run time error checking and analysis

− Message signature analysis detects type mismatches

− Object-space corruption detects attempts to write into objects

− Detects operations that causes MPI to write to a user buffer more than once

33

HP-MPI Diagnostic Library• Link with –ldmpi to enable diagnostic library, or

use• ld_preload on an existing pre-linked application

(shared libs)• This will dynamically insert diagnostic lib

• mpirun -e LD_PRELOAD=libdmpi.so:libmpi.so -srun ./a.out

• This will also dump message formats (could be REALLY Large)• mpirun -e LD_PRELOAD=libdmpi.so:libmpi.so -e

MPI_DLIB_FLAGS=dump:foof -srun ./a.out

• See “MPI_DLIB_FLAGS” on page 46 of Users Guide or man mpienv for more information on controlling features.

34

HP-MPI Communication

35

HP-MPI CommunicationMovement of data depends on relative location

of destination and interconnect. Paths are:• Communication within a Node (shared

memory)• Communication from Node to Node over

TCP/IP • Communication from Node to Node over high

speed interconnects InfiniBand, Quadrics, Myrinet

36

HP-MPI Communication within a Node

To Send data from Core 1 to Core 4:Core 1 -> Core 1 Local MemoryCore 1 Local Memory* -> System Shared Memory**System Shared Memory -> Core 4 Local MemoryCore 4 Local Memory -> Core 4

*The operating system makes Local Memory available to a single process**The operating system makes Shared Memory available to multiple processes

Memory

Core 1 Core 2

Memory

Core 3 Core 4

Bus

data

37

HP-MPI Communication to another Node via TCP/IP

To Send data from Core 1, Node 1 to Core 1, Node 2:Core 1, Node 1 -> Core 1, Node 1 Local MemoryCore 1, Node 1 Local Memory -> Node 1 Shared MemoryNode 1 Shared Memory -> InterconnectInterconnect -> Node 2 Shared MemoryNode 2 Shared Memory -> Core 1, Node 2 Local MemoryCore 1, Node 2 Local Memory -> Core 1, Node 2

The core is used to send data to the TCP/IP Interconnect

TCP/IP

Memory

Core 1 Core 2

Memory

Core 3 Core 4

Bus

Memory

Core 1 Core 2

Memory

Core 3 Core 4

Bus

data

38

HP-MPI Communication to another Node via other Interconnects

To Send data from Core 1, Node 1 to Core 1, Node 2:Core 1, Node 1 -> Core 1, Node 1 Local MemoryCore 1, Node 1 Local Memory -> Node 1 Shared MemoryNode 1 Shared Memory -> InterconnectInterconnect -> Node 2 Shared MemoryNode 2 Shared Memory -> Core 1, Node 2 Local MemoryCore 1, Node 2 Local Memory -> Core 1, Node 2

Interconnect

Memory

Core 1 Core 2

Memory

Core 3 Core 4

RDMA

Memory

Core 1 Core 2

Memory

Core 3 Core 4

RDMA

data

39

HP-MPI options to set interconnect

mpirun options: default MPI_IC_ORDER-vapi/-VAPI – InfiniBand with the VAPI protocol-udapl/-UDAPL - InfiniBand with the uDAPL protocol-itapi/-ITAPI – InfiniBand with the IT-API protocol (deprecated on linux)-elan/-ELAN – Elan (Quadrics)-mx/-MX – Myrinet MX-gm/-GM – Myrinet GM-TCP - TCP/IP

Lower case for request (tries in order of MPI_IC_ORDER)Upper case for demand (failure to meet demand is error)

40

X86-64: 32-bit versus 64-bit Interconnect Support

• Supported 64-bit interconnects:

• TCP/IP

• GigE

• InfiniBand

• Elan

• Myrinet

• Supported 32-bit interconnects:

• TCP/IP

• Myrinet

• InfiniBand (but not 32 bit mode on 64 bit architectures)

41

Cluster Interconnect Status Check

42

Cluster Interconnect Status

• ‘-prot’ displays the protocol in use − possibilities: VAPI SHM UDPL GM MX IT ELAN− mpirun –prot –srun ./hello.x

• Measure bandwidth between pairs of nodes using ping_pong_ring.c− copy shipped in /opt/hpmpi/help/ping_pong_ring.c –o ppring.x− bsub –I –n12 -ext “SLURM[nodes=12]”

/opt/hpmpi/bin/mpirun –srun ./ppring.x 300000

• Exclude “suspect” nodes explicitly − bsub –ext “SLURM[nodes=12;exclude=n[1-4]]”

• Include “suspect” nodes explicitly− bsub –ext “SLURM[nodes=12;include=n[1-4]]”

43

Cluster Interconnect Statusbsub -I -n4 -ext “SLURM[nodes=4]” /opt/hpmpi/bin/mpirun -prot -srun ./a.out 300000… host | 0 1 2 3======|===================== 0 : SHM VAPI VAPI VAPI 1 : VAPI SHM VAPI VAPI 2 : VAPI VAPI SHM VAPI 3 : VAPI VAPI VAPI SHM

[0:xcg1] ping-pong 300000 bytes ...300000 bytes: 345.70 usec/msg300000 bytes: 867.80 MB/sec[1:xcg2] ping-pong 300000 bytes ...300000 bytes: 700.44 usec/msg300000 bytes: 434.46 MB/sec[2:xcg3] ping-pong 300000 bytes ...

======300000 bytes: 690.76 usec/msg300000 bytes: 434.64 MB/sec[3:xcg4] ping-pong 300000 bytes ...300000 bytes: 345.79 usec/msg300000 bytes: 867.59 MB/sec

Communication from xc2 to xcg3is off by 50%, as is communicationfrom xcg3 to xcg4. xcg3 likely has a bad IB cable.

44

Cluster Interconnect Status• Expected Performance by Interconnect and

recommended message size to use Interconnect Expected Perf Msg size

− InfiniBand PCI-X 650-700 MB/sec 200000− InfiniBand PCI-Express 850-900 MB/sec 400000− Quadrics Elan4 800-820 MB/sec 200000− Myrinet GM rev E 420-450 MB/sec 100000− Myrinet GM rev D/F 220-250 MB/sec 100000− GigE Ethernet 100-110 MB/sec 30000− 100BaseT Ethernet 10-12 MB/sec 5000− 10BaseT Ethernet 1-2 MB/sec 500

• who would want to test 10BaseT

45

Cluster Interconnect Status• The following failure signature indicates

Node n611 has bad HCA or Driver• p.x: Rank 0:455: MPI_Init: EVAPI_get_hca_hndl() failed

p.x: Rank 0:455: MPI_Init: didn't find active interface/port p.x: Rank 0:455: MPI_Init: Can't initialize RDMA device p.x: Rank 0:455: MPI_Init: MPI BUG: Cannot initialize RDMA protocol srun: error: n611: task455: Exited with exit code 1

• Rerun and exclude the node in question, also report the suspect node to your sysadmin

− bsub –ext “SLURM[exclude=n611]” mpirun …

46

HP-MPI CPU Affinity control

47

HP-MPI support for Process binding • • distributes ranks across nodes

−mpirun -cpu_bind=[v,][policy[:maplist]] -srun a.out −[v] requests info on what binding is performed

• Policy is one of− LL|RANK|LDOM|RR|RR_LL|CYCLIC|FILL|FILL_LL| − BLOCK|MAP_CPU|MAP_LDOM|PACKED|HELP− MAP_CPU and MAP_LDOM list of cpu#s

• Example: bsub –I –n8 mpirun -cpu_bind=v,MAP_CPU:0,2,1,3 –srun ./a.out

… This is the map info for the 2nd nodeMPI_CPU_AFFINITY set to RANK, setting affinity of rank 4 pid 7156 on host dlcore1.rsn.hp.com to cpu 0MPI_CPU_AFFINITY set to RANK, setting affinity of rank 5 pid 7159 on host dlcore1.rsn.hp.com to cpu 2MPI_CPU_AFFINITY set to RANK, setting affinity of rank 6 pid 7157 on host dlcore1.rsn.hp.com to cpu 1MPI_CPU_AFFINITY set to RANK, setting affinity of rank 7 pid 7158 on host dlcore1.rsn.hp.com to cpu 3…

48

HP-MPI support for Process binding • $MPI_ROOT/bin/mpirun -cpu_bind=help ./a.out

-cpu_binding help info cpu binding methods available: rank - schedule ranks on cpus according to packed rank id map_cpu - schedule ranks on cpus in cycle thru MAP variable mask_cpu - schedule ranks on cpu masks in cycle thru MAP variable ll - bind each rank to cpu each is currently running on for numa based systems the following are also available: ldom - schedule ranks on ldoms according to packed rank id cyclic - cyclic dist on each ldom according to packed rank id block - block dist on each ldom according to packed rank id rr - same as cyclic, but consider ldom load avg. fill - same as block, but consider ldom load avg. packed - bind all ranks to the same ldom as lowest rank slurm - slurm binding ll - bind each rank to ldom each is currently running on map_ldom - schedule ranks on ldoms in cycle thru MAP variable

49

Memory Models

Examples of NUMA or NUMA-like systems:• Dual-core Opteron has (in effect) local and remote memories,

is considered a NUMA • Single-core Opteron with memory controller is considered as a

NUMA-like system• Cell-based Itanium SMP system, is considered a NUMA system.

LDOM(Local Memory)

Core Core Core

LDOM(Local Memory)

NUMA NUMA-like

LDOM(Local Memory)

Core Core Core

LDOM(Local Memory)

50

Example of Rank and LDOM distributions

mpirun –np 8 –srun -m=cycliccauses ranks and Packed Rank IDs to be

distributed across 2 4-Core hosts as:

LDOM 0

Rank 0

Packed Rank ID 0

LDOM 1

Rank 2

Packed Rank ID 1

Rank 4

Packed Rank ID 2

Rank 6

Packed Rank ID 3

LDOM 0

Rank 1

Packed Rank ID 0

LDOM 1

Rank 3

Packed Rank ID 1

Rank 5

Packed Rank ID 2

Rank 7

Packed Rank ID 3

HOST 1

HOST 2

51

Another Example of Rank and LDOM distributions

mpirun –np 8 –srun -m=blockcauses ranks and Packed Rank IDs to be

distributed across 2 4-Core hosts as:

LDOM 0

Rank 0

Packed Rank ID 0

LDOM 1

Rank 1

Packed Rank ID 1

Rank 1

Packed Rank ID 2

Rank 3

Packed Rank ID 3

LDOM 0

Rank 4

Packed Rank ID 0

LDOM 1

Rank 5

Packed Rank ID 1

Rank 6

Packed Rank ID 2

Rank 7

Packed Rank ID 3

HOST 1

HOST 2

52

Options for NUMA or NUMA-like systems

rank – Assign MPI process to N to cpu N.(default)map_cpu:MAP - schedule ranks on cores in cycle thru

MAP. MAP is a comma separated list giving the order in which to use cores on a machine.

mask_cpu:MAP- schedule ranks on core masks in cycle thru MASK. MAP is a comma separated list. If (MASK value .and. Processor number) used to determine groups of core to use.

ll - MPI process spins before assigning. The OS moves the process to the least loaded processor. MPI will stay were moved.

V - verbose

53

Options for NUMA systemsThese options choose cores to bind to based on the

core’s ldom (local memory):ldom - schedule ranks on ldoms according to packed rank idcyclic – round robin dist on each ldom according to packed rank idblock - block dist on each ldom according to packed rank id (default)Starting from the least loaded core:rr – round robin dist on each ldom according to packed rank idfill - block dist on each ldom according to packed rank idpacked - bind all MPI processes to the chosen ldommap_ldom:MAP - schedule ranks on ldoms in cycle thru comma separated list.

54

ccNUMA and I/O buffer-cache Interaction

• On Opteron systems, memory can either be 100% interleaved among processors or 100% processor-local

− For best performance, we use processor-local memory• Linux can use all available memory for IO buffering• When a user process requests local memory and the local memory is in use for IO

buffering, • LINUX assigns the memory on another processor worst-case latency• Given user demand for local memory, LINUX frees the IO buffers over time – at which

point the best runtime is achieved

LDOM

Core Core

LDOM

Core Core

LDOM

Core Core

LDOM

Core Core

DL585/4p8c

55

HP-MPI Scaleout

56

HP-MPI Scaleout Challenges

• Scalable process startup− reducing number of open sockets− Tree structure of MPI Daemons− now handles > 256 MPI ranks (srun and appfile)

• Scalable teardown of proceses• Scalable Licensing

− rank 0 checks for an N rank license.• Scalable setup data

− reduced Init4 Message size by 96%• Managing IB Buffer requirements

− physical memory pinning• 1-sided lock/unlock now over IB if using VAPI

57

Managing IB Buffer requirements

• Two modes: RDMA and Shared-Receive-Queue

• The amount of memory pinned (locked in physical memory)

− 1) memory which is always pinned (base)− 2) memory that may be pinned depending on communication. (dynamic)

• maximum_dynamic_pinned_memory = min(2 * max_messages * chunksize), (physical_memory / local_ranks) * pin_percentage);

− max_messages is 3 * remote connections and chunksize varies depending on the protocol. • for IB it is 4MB and for GM it is 1MB.

− maximum_dynamic_pinned_memory <= MPI_PIN_PERCENTAGE of rank's portion of physical memory. For large clusters, the limit will generally be based on the pin_percentage as 2*max_messages*chunksize gets large for even moderate clusters.− MPI_PIN_PERCENTAGE is 20% by default, but can be changed by the user.

58

Managing IB Buffer reqs cont

• Default is -rdma from 1 to 1024 ranks.• Default is -srq mode for 1025 ranks or larger.

• "base" memory is based on the number of off-host connections.

• Without –srq (aka -rdma): − base_pinned_memory = envelopes * 2 * shortlen * N

• With -srq: − base_pinned_memory = min(N * 8 , 2048) * 2 * shortlen

• envelopes = # of envelopes for each connection, default is 8 (can be changed by the user) • shortlen = short message length, default is 16K for infiniband (uDAPL and VAPI).•

59


• For a 2048 CPU job (memory per rank):

8 * 2 * 16K * 2047 = 524,032K (WITHOUT srq)2048 * 2 * 16K = 65,536K (WITH srq)

• If we have two ranks on a node, then the total pre-pinned memory will be− around 1G without srq and 128MB with srq.

• For 4 ranks per node (still 2048 CPU's total)− 2048 ranks --> roughly 2GB without SRQ and 256MB with

SRQ.

60

Shared-Receive-Queue model for Dynamic Message Buffer

• HP-MPI default mode for more than 1024 ranks • Also triggered with –srq option for mpirun• Shared-Receive-Queue

− A single shared memory communication queue on each node • Other processes write directly to this buffer.• Buffer is in shared memory

− Size of queue grows with the number of ranks in the job up to maximum size at 1024 ranks

SRQ_dynamic_memory = min(Nranks, 1024) * 4 * shortlen * RanksPerNode

− shortlen = short message length. Determined by interconnect− Nranks = Number of MPI ranks in the job− RanksPerNode = Number of ranks per node

61

Comparison of Dynamic Memory Requirements for Various Jobs• shortlen =16K for InfiniBand (uDAPL and VAPI)• envelopes = 8 (default)

• Dynamic Memory Buffer size per node for a 2048 rank job− 1024 nodes, 2 ranks/node.

• Memory per rank: RDMA: envelopes * 2 * shortlen * (Nranks -1) *RanksPerNode

8 * 2 * 16K * 2047 * 2 = 1048064K SRQ: min(Nranks,1024) * 4 * shortlen *RanksPerNode

min(1024,2048) * 4 * 16K * 2 = 131072K• 2 ranks/node, then the dynamic pre-pinned memory will be:

RDMA: ~1 GBSRQ: ~128MB

− 512 nodes, 4 ranks/node• Memory per rank:

RDMA: 8 * 2 * 16K * 2047 * 4 = 2096128K SRQ: min(1024,2048) * 4 * 16K * 4 = 262144K

• 4 ranks/node, then the dynamic pre-pinned memory will be:RDMA: ~2 GBSRQ: ~256MB

62

Effect of PIN Percentage on Buffer Memory

Change PIN Percentage to increase amount of usable base memory

Problem:• a.out: Rank 0:23: MPI_Init: ERROR: The total amount of memory that may be pinned

(210583540 bytes), is insufficient to support even minimal rdma network transfers. This value was derived by taking 20% of physical memory (2105835520 bytes) and dividing by the

number of local ranks (2). A minimum of 253882484 bytes must be able to be pinned. Solution:• These values can be changed by setting environment variables

− MPI_PIN_PERCENTAGE − MPI_PHYSICAL_MEMORY (Mbytes).

• In this case, 210583540 bytes is about 83% of the 253882484 bytes required.• Increasing the MPI_PIN_PERCENTAGE from the default of 20% to 24% is sufficient to

allow the application to run. Here is how to set to 30%:$MPI_ROOT/bin/mpirun -e MPI_PIN_PERCENTAGE=30 –srun ./a.out

63

Managing InfiniBand Message Buffer Example1200 ranks over InfiniBand

used for this exampleRDMA ModeMemory footprint

measured with ‘top’PID USER PR NI VIRT RES SHR S %CPU %MEM TIME

MPI_RDMA_NENVELOPE=8 gives optimum performance at a reasonable memory footprint

MPI_RDMA_NENVELOPEvalue

Memory footprint(MB)

CPUTimeSec

2 201 BAD IDEA!

4 279 27

6 356 25

8 432 21.4

10 508 25.4

64


• Latency for RDMA vs SRQ

rdma srq0 byte latency : 3.97us 7.09us4M bandwidth: 903.61 902.63

65

SRQ Dynamics Message Memory ProjectionsExamples using InfiniBand:

1024 nodes with 4096 ranks (4 sockets)Memory per rank = min(1024, 1024) * 4 * 16K = 64MMemory per node = 64M * 4 = 256 M1024 nodes with 8192 ranks (dual core 4 sockets)Memory per rank = min(1024, 1024) * 4 * 16K = 64MMemory per node = 64M * 8 = 512 M2048 nodes with 8192 ranks (4 sockets)Memory per rank = min(2048, 1024) * 4 * 16K = 64MMemory per node = 64 M * 4 = 256 M2048 nodes with 16384 ranks (dual core 4 sockets) Memory per rank = min(2048, 1024) * 4 * 16K = 64MMemory per node = 64 M * 8 = 512M

66

Managing InfiniBand Buffer reqs cont • example of 1536 rank run memory use:

− using -srq by default • mpirun –srun ./xhpl• top gives a footprint of 406m

− Using -rdma and MIN_PIN_PERCENTAGE=25• mpirun –e MIN_PIN_PERCENTAGE=25 –rdma –

srun ./xhpl• top gives a footprint of 737m

67

TCP Scaleout • TCP provides challenges in terms of number

of open sockets. • Default: each rank opens a socket to all

other ranks off host. − For a 64 node 4-core cluster, each rank opens

64*4 sockets.− A node with 4 ranks would require 4*64*4 sockets.

• HP-MPI provides a communication deamon − The mpirun option -commd invokes− Provides communication proxy from ranks on

same host to all other ranks (off-host). − Reduces # of open sockets per commd to # of

nodes.

68

HP-MPI Performance Improvements

69

Startup Performance Data

0

2

4

6

8

10

12

32-srun

32-appfile

64-srun

64-appfile

128-srun

128-appf ile

256-srun

256-appfile

512-srun

512-appf ile

1024-srun

1024-appf ile

1300-srun

1300-appf ile

Time to rdma_connect

Time to get init4 broadcast (estimated)

Time receiving init3 messages

Time to broadcast init2 and get f irst init 3 back

mpids connect to mpirun

w aiting for f irst mpid to connect back

70

V3.0 x86_64 using InfiniBand interconnect XC pingpong bandwidth

0

100

200

300

400

500

600

700

800

900

1000

message size

MB/

Sec

mpi2.1.0-8

mpi2.1.1

mpi2.1.0-8 0 0.18 0.35 0.7 1.38 2.735.39 9 17.5 33.1 59.8 102 179 277 382 473 539 703 796 852 883 900 909 913 915

mpi2.1.1 0 0.24 0.47 0.92 1.81 3.566.98 13.6 25.3 46.6 76.4 126 208 310 410 494 618 740 818 865 890 904 910 914 915

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

HP-MPI V2.2 vs HP-MPI V2.1Pallas ping-pong Bandwidth

71

Pingpong Latency

1

10

100

message size

µsec

mpi2.1.0-8

mpi2.1.1

mpi2.1.0-8 5 5 5 5 6 6 6 7 7 7 8 10 11 14 20 33 58 89

mpi2.1.1 4 4 4 4 4 4 4 4 5 5 6 8 9 13 19 32 51 84

0 1 2 4 8 16 32 64 12 25 51 10 20 40 81 16 32 65

HP-MPI V2.2 vs HP-MPI V2.1 Pallas PingPong Latency

V3.0 x86_64 using InfiniBand interconnect XC

72

alltoall performance Improvements• HP-MPI V2.2 includes an implementation of MPI_Alltoall

and MPI_Alltoallv which has been shown to perform better than prior releases for TCP/IP, ITAPI and InfiniBand.

• The improvement avoids switch congestion by limiting the number of ranks that may send to a single rank at once.

• The improvement has been shown to improve the performance for most message sizes, but particularly those greater than 16KB in length.

• Measured message transmission time vs. message size, for message sizes ranging from 1 byte to 1.5MB.

• Each test was run using 3 cluster sizes: 16, 32, and 64 nodes. For each cluster size, the test was run for 2 variations – using 1 MPI process/node and using 2 MPI processes/node (1 per CPU).

73

V3.0 x86_64 using InfiniBand interconnect XC

MPI_Alltoall performance

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

bytes

time

(use

c)

new algorithm

old algorithm

HP-MPI AllToAll Latency

74

Improved socket progression for TCP/IP• An environment variable is now available to

improve the performance of many applications running on TCP/IP. − MPI_SOCKBUFSIZE

• Allows the amount of system space used for buffering within sockets to be specified.

• Using a value larger than the typical system default has been shown to improve progression and thereby overall performance for many communication patterns on TCP/IP.

75

Additional References

76

References• HP-MPI User’s Guide• XC User’s Guide• XC How to Guide - C Currently in development,

email requests to [email protected]

78

HP-MPI ISV and Application Support

79

HP-MPI Object Compatibility

MPICH V1.2.5MPI-1

ApplicationMPI-1

(built shared)

MPICH CompatibleMPI-1

Linux ItaniumLinux x86XC V2.0

HP-MPI V2.1MPI-1MPI-2

A compatibility is documented in the MPI V2.1 & later Release Note

HP-MPI V2.1 and later isobject compatible with MPICH V1.2.5

and later

80

• Currently working with ISVs across multiple segments on integrating support for HP-MPI : signed up for linux

• Commitments to date:

Current Applications for HP-MPI

Developer ApplicationAnsys AnsysAbaqus Standard; ExplicitAdina AdinaLMS SysnoiseESI PamCrash; PamFlowMecalog RadiossAVL FireV8/SwiftAcusim AcusolveEXA PowerFLOWAccelrys CASTEP, DMol3,

MesoDyn, ONESTEPLSTC MPP-DynaMSC Nastran

Developer ApplicationFluent FluentSCM ADFUCSF AmberAmes Lab Gamess -USScripps CHARMMUIUC NAMDLANL MPIblastUniv of VA FASTACPMD CPMDU Karlsruhe TurbomoleU Birmingham MOLPROCD-Adapco Star-CDUSG Nastran NastranUniversity of Texas

AMLS

81

Backup Slides

XC V3.0 HP-MPI Update

Documents

Transcript of XC V3.0 HP-MPI Update