Optimizing Threaded MPI Execution on SMP Clusters

Optimizing Threaded MPI Optimizing Threaded MPI Execution on SMP ClustersExecution on SMP Clusters

Hong Tang and Tao YangHong Tang and Tao Yang

Department of Computer ScienceDepartment of Computer Science

University of California, Santa BarbaraUniversity of California, Santa Barbara

June, 20th, 2001 Hong Tang 2

Parallel Computation on SMP Parallel Computation on SMP ClustersClusters

Massively Parallel Machines Massively Parallel Machines SMP SMP Clusters Clusters Commodity Components: Off-the-shelf Commodity Components: Off-the-shelf

Processors + Fast Network (Myrinet, Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet)Fast/GigaBit Ethernet)

Parallel Programming Model for SMP Parallel Programming Model for SMP ClustersClustersMPI: Portability, Performance, Legacy ProgramsMPI: Portability, Performance, Legacy ProgramsMPI+Variations: MPI+Multithreading, MPI+Variations: MPI+Multithreading,

MPI+OpenMPMPI+OpenMP

Threaded MPI ExecutionThreaded MPI Execution

MPI Paradigm: Separated Address Spaces for MPI Paradigm: Separated Address Spaces for Different MPI Nodes Different MPI Nodes

Natural Solution: MPI Nodes Natural Solution: MPI Nodes Processes Processes What if we map MPI nodes to threads?What if we map MPI nodes to threads?

Faster synchronization among MPI nodes running on Faster synchronization among MPI nodes running on the same machine.the same machine.

Demonstrated in previous work [PPoPP ’99] for a Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed single shared memory machine. (Developed techniques to safely execute MPI programs using techniques to safely execute MPI programs using threads.)threads.)

Threaded MPI Execution on SMP ClustersThreaded MPI Execution on SMP ClustersIntra-Machine Comm. through Shared MemoryIntra-Machine Comm. through Shared MemoryInter-Machine Comm. through NetworkInter-Machine Comm. through Network

Threaded MPI Execution Benefits Threaded MPI Execution Benefits Inter-Machine CommunicationInter-Machine Communication

Common IntuitionCommon Intuition

Our Findings Our Findings

Inter-machine communication cost is Inter-machine communication cost is dominated by network delay, so the dominated by network delay, so the

advantage of executing MPI nodes as advantage of executing MPI nodes as threads diminishes.threads diminishes.

Using threads can significantly reduce the Using threads can significantly reduce the buffering and orchestration overhead for buffering and orchestration overhead for

inter-machine communications.inter-machine communications.

Related WorkRelated Work

MPI on Network ClustersMPI on Network ClustersMPICH MPICH –– a portable MPI implementation.a portable MPI implementation.LAM/MPILAM/MPI – communication through a standalone RPI – communication through a standalone RPI server.server.

Collective Communication Optimization Collective Communication Optimization SUN-MPI and MPI-StarTSUN-MPI and MPI-StarT – modify MPICH ADI layer; target – modify MPICH ADI layer; target for SMP clusters.for SMP clusters.

MagPIeMagPIe – target for SMP clusters connected through – target for SMP clusters connected through WAN.WAN.

Lower Communication Layer OptimizationLower Communication Layer OptimizationMPI-FM and MPI-AMMPI-FM and MPI-AM..

Threaded Execution of Message Passing ProgramsThreaded Execution of Message Passing ProgramsMPI-Lite, LPVM, TPVMMPI-Lite, LPVM, TPVM..

Background: MPICH DesignBackground: MPICH Design

MPI Collective

MPI Point-to-Point

ChameleonInterface

T3D SGI others

TCP shmem

MPI Collective

MPI Point to Point

Abstract Device Interface

Devices

MPICH Communication MPICH Communication StructureStructure

MPICH without shared memoryMPICH with shared memory

WS- A cluster node- MPI node (process)- MPICH daemon process

- Inter-process pipe- Shared memory- TCP connection

TMPI Communication TMPI Communication StructureStructureWS

- TCP connection- Direct mem access and thread sync

WS- A cluster node- MPI node (thread)- TMPI daemon thread

Comparison of TMPI and Comparison of TMPI and MPICHMPICH

Drawbacks of MPICH w/ Shared MemoryIntra-node communication limited by shared memory size. Busy polling to check messages from either daemon or local

peer.Cannot do automatic resource clean-up.

Drawbacks of MPICH w/o Shared MemoryBig overhead for intra-node communication.Too many daemon processes and open connections.

Drawbacks of both MPICH SystemsExtra data copying for inter-machine communication.

TMPITMPI CommunicationCommunication DesignDesign

INTER INTRA

TCP others

MPI Communication

Inter- and Intra-MachineCommunication

Abstract Network andThread Sync Interface

OS Facilities

THREAD

pthreadother

thread impl

Separation of Point-to-Point and Separation of Point-to-Point and Collective Communication Collective Communication ChannelsChannels

Observations: MPI Point-to-point Communication and Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different.Collective Communication Semantics are Different.

Separated channels for pt2pt and collective comm.Separated channels for pt2pt and collective comm. Eliminate daemon intervention for collective communication.Eliminate daemon intervention for collective communication. Less effective for MPICH – no sharing of ports among Less effective for MPICH – no sharing of ports among

processes.processes.

Point-to-pointPoint-to-point CollectiveCollective

Unknown Source Unknown Source (MPI_ANY_SOURCE)(MPI_ANY_SOURCE)

Determined SourceDetermined Source

(Ancestor in the spanning tree.)(Ancestor in the spanning tree.)

Out-of-order Out-of-order

(Message Tag)(Message Tag)In order deliveryIn order delivery

Asynchronous Asynchronous

(Non-block Receive)(Non-block Receive)SynchronousSynchronous

- TCP connection- Direct mem access and thread sync

WS- A cluster node- MPI node (thread)- TMPI daemon thread

Observation: Two level communication Observation: Two level communication hierarchy.hierarchy.Inside an SMP node: shared memory (Inside an SMP node: shared memory (1010-8 -8 sec)sec)Between SMP nodes: network (Between SMP nodes: network (1010-6 -6 sec)sec)

Idea: Building the communication spanning Idea: Building the communication spanning tree in two stepstree in two stepsChoose a root MPI node on each cluster node Choose a root MPI node on each cluster node

and build a spanning tree among all the cluster and build a spanning tree among all the cluster nodes.nodes.

Second, all other MPI nodes connect to the local Second, all other MPI nodes connect to the local root node.root node.

Hierarchy-Aware Collective Hierarchy-Aware Collective CommunicationCommunication

Spanning trees for an MPI program with 9 nodes on three cluster nodes.The three cluster nodes contain MPI node 0-2, 3-5 and 6-8 respectively.

Thick edges are network edges.

MPICH(balanced binary tree)

MPICH(hypercube)

5 7 84

Question: How do we manage temporary Question: How do we manage temporary buffering of message data when the remote buffering of message data when the remote receiver is not ready to accept data?receiver is not ready to accept data?

Choices:Choices:Send the data with the request – eager push.Send the data with the request – eager push.Send request only and send data when the Send request only and send data when the

receiver is ready – three-phase protocol.receiver is ready – three-phase protocol.TMPI – adapt between both methods.TMPI – adapt between both methods.

One Step Eager-push ProtocolRemote node can buffer the msg.

req/datS

RQgot dat

Three-phase ProtocolRemote node cannot buffer the msg.

Qgot reqD

receiver ready

got datD

Adaptive Buffer ManagementAdaptive Buffer Management

Graceful Degradation from Eager-push to Three-phase Protocol

Experimental StudyExperimental Study

Goal: Illustrate the advantage of threaded Goal: Illustrate the advantage of threaded MPI execution on SMP clusters.MPI execution on SMP clusters.

Hardware SettingHardware SettingA cluster of 6 Quad-Xeon 500MHz SMPs, with A cluster of 6 Quad-Xeon 500MHz SMPs, with

1GB main memory and 2 fast Ethernet cards per 1GB main memory and 2 fast Ethernet cards per machine.machine.

Software SettingSoftware SettingOS: RedHat Linux 6.0, kernel version 2.2.15 w/ OS: RedHat Linux 6.0, kernel version 2.2.15 w/

channel bonding enabled.channel bonding enabled.Process-based MPI System: MPICH 1.2Process-based MPI System: MPICH 1.2Thread-based MPI System: TMPI (45 functions in Thread-based MPI System: TMPI (45 functions in

MPI 1.1 standard)MPI 1.1 standard)

Inter-Cluster-Node Point-to-Inter-Cluster-Node Point-to-PointPoint

Ping-ping, TMPI vs MPICH w/ shared memoryPing-ping, TMPI vs MPICH w/ shared memory

0 200 400 600 800 1000200

Message Size (bytes)

(a) Ping-Pong Short Message

TMPIMPICH

0 200 400 600 800 1000

Message Size (KB)

(b) Ping-Pong Long Message

TMPIMPICH

Intra-Cluster-Node Point-to-Intra-Cluster-Node Point-to-PointPoint

0 200 400 600 800 1000

Message Size (bytes)

(a) Ping-Pong Short Message

TMPIMPICH1MPICH2

0 200 400 600 800 100020

Message Size (KB)

(b) Ping-Pong Long Message

TMPIMPICH1MPICH2

Ping-pong, TMPI vs MPICH1 (MPICH w/ shared Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory)memory) and MPICH2 (MPICH w/o shared memory)

Collective CommunicationCollective Communication Reduce, Bcast, Allreduce.Reduce, Bcast, Allreduce. TMPITMPI / / MPICH_SHMMPICH_SHM / / MPICH_NOSHMMPICH_NOSHM Three node distributions, three root node settings.Three node distributions, three root node settings.

(us)(us) rootroot ReduceReduce BcastBcast AllreduceAllreduce

4x14x1

samesame 99//121121//43844384 1010//137137//79137913

160 160 //175175//627627rotaterotate 3333//8181//36993699 129129//9191//42384238

combcomboo

2525//102102//34363436 1717//3232//966966

1x41x4

samesame 2828//19991999//18441844 2121//16101610//15515511

571571//675675//775775rotaterotate 146146//19441944//18781878 164164//17741774//18318344

combcomboo

167167//19771977//18541854 4343//409409//392392

4x44x4

samesame 3939//25322532//48094809 5656//27922792//1024102466

736736//14121412//1991419914rotaterotate 161161//17181718//85668566 216216//22042204//80380366

combcomboo

141141//22422242//85158515 6262//489489//20542054

1) MPICH w/o shared memory performs the worst.

2) TMPI is 70+ times faster than MPICH w/ Shared Memory for MPI_Bcast and MPI_Reduce.

3) For TMPI, the performance of 4X4 cases is roughly the summation of that of the 4X1 cases and that of the 1X4 cases.

Macro-Benchmark PerformanceMacro-Benchmark Performance

0 5 10 15 200

Number of MPI Nodes

(a) Matrix Multiplication

TMPIMPICH

0 5 10 15 20 250

Number of MPI Nodes

(b) Gaussian Elimination

TMPIMPICH

ConclusionsConclusions

http://www.cs.ucsb.edu/projects/tmpi/http://www.cs.ucsb.edu/projects/tmpi/

Great Advantage of Threaded MPI Great Advantage of Threaded MPI Execution on SMP ClustersExecution on SMP ClustersMicro-benchmark: 70+ times faster than Micro-benchmark: 70+ times faster than

MPICH.MPICH.Macro-benchmark: 100% faster than MPICH.Macro-benchmark: 100% faster than MPICH.

Optimization TechniquesOptimization TechniquesSeparated Collective and Point-to-Point Separated Collective and Point-to-Point

Communication ChannelsCommunication ChannelsAdaptive Buffer ManagementAdaptive Buffer ManagementHierarchy-Aware Communications Hierarchy-Aware Communications

Background: Safe Execution of Background: Safe Execution of MPI Programs using ThreadsMPI Programs using Threads

Program Transformation: Program Transformation: Eliminate global and Eliminate global and static variables (called static variables (called permanent variablespermanent variables).).

Thread-Specific Data (TSD)Thread-Specific Data (TSD)Each thread can associate a pointer-sized Each thread can associate a pointer-sized datadata variable variable with a commonly defined with a commonly defined keykey value (an integer). With value (an integer). With the same key, different threads can set/get the values the same key, different threads can set/get the values of their own copy of the data variable.of their own copy of the data variable.

TSD-based TransformationTSD-based Transformation Each permanent variable declaration is replaced with a Each permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. of the permanent variable with the corresponding key. In places where global variables are referenced, use the In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the global keys to retrieve the per-thread copies of the variables.variables.

Source Program Program After Transformation int X=1; int kX=0; void main_init()

{ if (kX==0) kX=key_create(); }

void user_init() { int *pX=malloc(sizeof(int)); *pX=1; setval(kX, pX); }

int f() {

int f() { int *pX=getval(kX);

return X++; }

return (*pX)++; }

Program Transformation Program Transformation –– An ExampleAn Example

Optimizing Threaded MPI Execution on SMP Clusters

Documents

Transcript of Optimizing Threaded MPI Execution on SMP Clusters

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Specialist Music Programme Information and Course Bookletspecialistmusicprogramme.co.nz/wp-content/uploads/2019/03/SMP... · SMP Jazz SMP Contemporary SMP Alumni Contact Information

Understanding performance of SMP clusters running MPI programs ...

Hybrid MPI and OpenMP • Programming models on clusters ...€¦ · • Must support clusters of SMP nodes e.g , Int l® Clust er Op nMP – Shared memory parallel inside of SMP

Performance Comparison of Public Domain MPI ... · PDF filewith its installation on Solaris SMP machine. ... Table 8.1 LAMMPS low-level Makefile flag values of different MPI on e-Science

Communication Characteristics and Hybrid MPI/OpenMP ... · – Multi-Zone NAS Parallel Benchmarks (MZ-NPB) Optimal ? Loop-worksharing on 8 threads Exa.: 2 SMP nodes, 8 cores/node

MAGNETIC SENSORS SMC-SMP SERIES AND MAGNETS filemagnetic sensors smc-smp series and magnets smooth and threaded cylindrical models ... 250 3x0.25 2 ± 0.3 - 25 ÷ + 100 67 nickelled

SMP-94RZ, SMP-94RZ1 SMP-94RZ2 & SMP-94RZ3 · Configuratio n .....14 4. SMP-94RZ2 ... installations plus simples. Permet aussi une très facile redistribution des zones sans ... 94RS/RC

THREADED ROD THREADED ROD THREADED ROD & ACCESSORIES

SMP, Longwipe-SMP, Mini-SMP - Rosenberger · Rosenberger Hochfreuentechnik bH Co. eran Phone 49 (0)8684 18-0 inforosenberger.de 19 SMP Longwipe-SMP Mini-SMP SMP Longwipe-SMP Mini-SMP

SBID8055i-G5-SMP / SBID8065i-G5-SMP BoM

THREADED INSERTS - Jergens · PDF file51 THREADED INSERTS THREADED INSERTS Threaded Inserts Bolster Plate Bushings..... 55 Installation and Removal

SMP / SMP-COM / SMP-MAX series D1C004XEe.pdf · 2012-05-26 · 3-4 SMP-MAX INTRODUCTION SMP-MAX Receptacle IMP SMP-MAX Adapter The Best Systems Radiall’s engineers work with your

SMP Handbook DRAFT Contents SMP - Northern … Handbook DRAFT Contents SMP Handbook Simultaneous Membership Program i SMP Handbook DRAFT Contents Simultaneous Membership Program (SMP)

MPI MELT PRESSURE PRODUCTS MPI

SMP - Cristek Interconnects, Inc. · 2018. 8. 3. · SMP 888.265.9162 CRISTEK.COM SMP SMP High Frequency Push-on The SMP connector is a multi-functional, miniature, high frequency,

Symantec IT Management Suite Platform Support Matrix...Table 1-3 Microsoft SQL Server(continued) SMP 7.5 SMP 7.6 SMP 8.01 SMP 8.11 SMP 8.5 1 SP1 Microsoft SQL Server SMP 7.5 Version

with MPI and OpenMP - Bogdan Pasca this lesson, ... Symmetric multi-processing (SMP) Concepts: Uniform memory access (UMA) ... number of direct and indirect neighbors

Supporting Student's Mathematical Reasoning with the … · 2019-10-09 · •SMP 3 - A •SMP 5 - B •SMP 1 - D •SMP 4 - E •SMP 6 - F •SMP 8 - G •SMP 2 - H •SMP 7 - C

Week 14 Lesson Plan - mrtejadamath.weebly.commrtejadamath.weebly.com/uploads/8/0/8/5/80859808/week_14_lesson… · SMP 1 SMP 2 SMP 3 SMP 4 SMP 5 SMP 6 SMP 7 SMP 8 Literacy Standard(s)