Post on 15-Jan-2016
description
Optimizing Threaded MPI Optimizing Threaded MPI Execution on SMP ClustersExecution on SMP Clusters
Hong Tang and Tao YangHong Tang and Tao Yang
Department of Computer ScienceDepartment of Computer Science
University of California, Santa BarbaraUniversity of California, Santa Barbara
June, 20th, 2001 Hong Tang 2
Parallel Computation on SMP Parallel Computation on SMP ClustersClusters
Massively Parallel Machines Massively Parallel Machines SMP SMP Clusters Clusters Commodity Components: Off-the-shelf Commodity Components: Off-the-shelf
Processors + Fast Network (Myrinet, Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet)Fast/GigaBit Ethernet)
Parallel Programming Model for SMP Parallel Programming Model for SMP ClustersClustersMPI: Portability, Performance, Legacy ProgramsMPI: Portability, Performance, Legacy ProgramsMPI+Variations: MPI+Multithreading, MPI+Variations: MPI+Multithreading,
MPI+OpenMPMPI+OpenMP
June, 20th, 2001 Hong Tang 3
Threaded MPI ExecutionThreaded MPI Execution
MPI Paradigm: Separated Address Spaces for MPI Paradigm: Separated Address Spaces for Different MPI Nodes Different MPI Nodes
Natural Solution: MPI Nodes Natural Solution: MPI Nodes Processes Processes What if we map MPI nodes to threads?What if we map MPI nodes to threads?
Faster synchronization among MPI nodes running on Faster synchronization among MPI nodes running on the same machine.the same machine.
Demonstrated in previous work [PPoPP ’99] for a Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed single shared memory machine. (Developed techniques to safely execute MPI programs using techniques to safely execute MPI programs using threads.)threads.)
Threaded MPI Execution on SMP ClustersThreaded MPI Execution on SMP ClustersIntra-Machine Comm. through Shared MemoryIntra-Machine Comm. through Shared MemoryInter-Machine Comm. through NetworkInter-Machine Comm. through Network
June, 20th, 2001 Hong Tang 4
Threaded MPI Execution Benefits Threaded MPI Execution Benefits Inter-Machine CommunicationInter-Machine Communication
Common IntuitionCommon Intuition
Our Findings Our Findings
Inter-machine communication cost is Inter-machine communication cost is dominated by network delay, so the dominated by network delay, so the
advantage of executing MPI nodes as advantage of executing MPI nodes as threads diminishes.threads diminishes.
Using threads can significantly reduce the Using threads can significantly reduce the buffering and orchestration overhead for buffering and orchestration overhead for
inter-machine communications.inter-machine communications.
June, 20th, 2001 Hong Tang 5
Related WorkRelated Work
MPI on Network ClustersMPI on Network ClustersMPICH MPICH –– a portable MPI implementation.a portable MPI implementation.LAM/MPILAM/MPI – communication through a standalone RPI – communication through a standalone RPI server.server.
Collective Communication Optimization Collective Communication Optimization SUN-MPI and MPI-StarTSUN-MPI and MPI-StarT – modify MPICH ADI layer; target – modify MPICH ADI layer; target for SMP clusters.for SMP clusters.
MagPIeMagPIe – target for SMP clusters connected through – target for SMP clusters connected through WAN.WAN.
Lower Communication Layer OptimizationLower Communication Layer OptimizationMPI-FM and MPI-AMMPI-FM and MPI-AM..
Threaded Execution of Message Passing ProgramsThreaded Execution of Message Passing ProgramsMPI-Lite, LPVM, TPVMMPI-Lite, LPVM, TPVM..
June, 20th, 2001 Hong Tang 6
Background: MPICH DesignBackground: MPICH Design
MPI Collective
MPI Point-to-Point
ADI
ChameleonInterface
T3D SGI others
P4
TCP shmem
MPI Collective
MPI Point to Point
Abstract Device Interface
Devices
June, 20th, 2001 Hong Tang 7
MPICH Communication MPICH Communication StructureStructure
WS
WSWS
WS WS
WSWS
WS
MPICH without shared memoryMPICH with shared memory
WS- A cluster node- MPI node (process)- MPICH daemon process
- Inter-process pipe- Shared memory- TCP connection
June, 20th, 2001 Hong Tang 8
TMPI Communication TMPI Communication StructureStructureWS
WSWS
WS
- TCP connection- Direct mem access and thread sync
WS- A cluster node- MPI node (thread)- TMPI daemon thread
June, 20th, 2001 Hong Tang 9
Comparison of TMPI and Comparison of TMPI and MPICHMPICH
Drawbacks of MPICH w/ Shared MemoryIntra-node communication limited by shared memory size. Busy polling to check messages from either daemon or local
peer.Cannot do automatic resource clean-up.
Drawbacks of MPICH w/o Shared MemoryBig overhead for intra-node communication.Too many daemon processes and open connections.
Drawbacks of both MPICH SystemsExtra data copying for inter-machine communication.
June, 20th, 2001 Hong Tang 10
TMPITMPI CommunicationCommunication DesignDesign
MPI
INTER INTRA
NETD
TCP others
MPI Communication
Inter- and Intra-MachineCommunication
Abstract Network andThread Sync Interface
OS Facilities
THREAD
pthreadother
thread impl
June, 20th, 2001 Hong Tang 11
Separation of Point-to-Point and Separation of Point-to-Point and Collective Communication Collective Communication ChannelsChannels
Observations: MPI Point-to-point Communication and Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different.Collective Communication Semantics are Different.
Separated channels for pt2pt and collective comm.Separated channels for pt2pt and collective comm. Eliminate daemon intervention for collective communication.Eliminate daemon intervention for collective communication. Less effective for MPICH – no sharing of ports among Less effective for MPICH – no sharing of ports among
processes.processes.
Point-to-pointPoint-to-point CollectiveCollective
Unknown Source Unknown Source (MPI_ANY_SOURCE)(MPI_ANY_SOURCE)
Determined SourceDetermined Source
(Ancestor in the spanning tree.)(Ancestor in the spanning tree.)
Out-of-order Out-of-order
(Message Tag)(Message Tag)In order deliveryIn order delivery
Asynchronous Asynchronous
(Non-block Receive)(Non-block Receive)SynchronousSynchronous
WS
WSWS
WS
- TCP connection- Direct mem access and thread sync
WS- A cluster node- MPI node (thread)- TMPI daemon thread
June, 20th, 2001 Hong Tang 12
Observation: Two level communication Observation: Two level communication hierarchy.hierarchy.Inside an SMP node: shared memory (Inside an SMP node: shared memory (1010-8 -8 sec)sec)Between SMP nodes: network (Between SMP nodes: network (1010-6 -6 sec)sec)
Idea: Building the communication spanning Idea: Building the communication spanning tree in two stepstree in two stepsChoose a root MPI node on each cluster node Choose a root MPI node on each cluster node
and build a spanning tree among all the cluster and build a spanning tree among all the cluster nodes.nodes.
Second, all other MPI nodes connect to the local Second, all other MPI nodes connect to the local root node.root node.
Hierarchy-Aware Collective Hierarchy-Aware Collective CommunicationCommunication
Spanning trees for an MPI program with 9 nodes on three cluster nodes.The three cluster nodes contain MPI node 0-2, 3-5 and 6-8 respectively.
Thick edges are network edges.
MPICH(balanced binary tree)
0
87
6543
21
MPICH(hypercube)
7
6 53
41
0
2 8
TMPI
5 7 84
23
0
1 6
June, 20th, 2001 Hong Tang 13
Question: How do we manage temporary Question: How do we manage temporary buffering of message data when the remote buffering of message data when the remote receiver is not ready to accept data?receiver is not ready to accept data?
Choices:Choices:Send the data with the request – eager push.Send the data with the request – eager push.Send request only and send data when the Send request only and send data when the
receiver is ready – three-phase protocol.receiver is ready – three-phase protocol.TMPI – adapt between both methods.TMPI – adapt between both methods.
One Step Eager-push ProtocolRemote node can buffer the msg.
req/datS
RQgot dat
DD
Three-phase ProtocolRemote node cannot buffer the msg.
reqS
Qgot reqD
D
receiver ready
dat
RQ
got datD
D
D
R
Adaptive Buffer ManagementAdaptive Buffer Management
Graceful Degradation from Eager-push to Three-phase Protocol
S
QD
D
RQ
D
D
R
D
June, 20th, 2001 Hong Tang 14
Experimental StudyExperimental Study
Goal: Illustrate the advantage of threaded Goal: Illustrate the advantage of threaded MPI execution on SMP clusters.MPI execution on SMP clusters.
Hardware SettingHardware SettingA cluster of 6 Quad-Xeon 500MHz SMPs, with A cluster of 6 Quad-Xeon 500MHz SMPs, with
1GB main memory and 2 fast Ethernet cards per 1GB main memory and 2 fast Ethernet cards per machine.machine.
Software SettingSoftware SettingOS: RedHat Linux 6.0, kernel version 2.2.15 w/ OS: RedHat Linux 6.0, kernel version 2.2.15 w/
channel bonding enabled.channel bonding enabled.Process-based MPI System: MPICH 1.2Process-based MPI System: MPICH 1.2Thread-based MPI System: TMPI (45 functions in Thread-based MPI System: TMPI (45 functions in
MPI 1.1 standard)MPI 1.1 standard)
June, 20th, 2001 Hong Tang 15
Inter-Cluster-Node Point-to-Inter-Cluster-Node Point-to-PointPoint
Ping-ping, TMPI vs MPICH w/ shared memoryPing-ping, TMPI vs MPICH w/ shared memory
0 200 400 600 800 1000200
300
400
500
600
700
Message Size (bytes)
Ro
un
d T
rip
Tim
e ( s
)
(a) Ping-Pong Short Message
TMPIMPICH
0 200 400 600 800 1000
8
10
12
14
16
18
20
Message Size (KB)
Tra
nsf
er R
ate
(MB
)
(b) Ping-Pong Long Message
TMPIMPICH
June, 20th, 2001 Hong Tang 16
Intra-Cluster-Node Point-to-Intra-Cluster-Node Point-to-PointPoint
0 200 400 600 800 1000
50
100
150
200
Message Size (bytes)
Ro
un
d T
rip
Tim
e (
s)
(a) Ping-Pong Short Message
TMPIMPICH1MPICH2
0 200 400 600 800 100020
40
60
80
100
120
140
160
180
Message Size (KB)
Tra
nsf
er R
ate
(MB
)
(b) Ping-Pong Long Message
TMPIMPICH1MPICH2
Ping-pong, TMPI vs MPICH1 (MPICH w/ shared Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory)memory) and MPICH2 (MPICH w/o shared memory)
June, 20th, 2001 Hong Tang 17
Collective CommunicationCollective Communication Reduce, Bcast, Allreduce.Reduce, Bcast, Allreduce. TMPITMPI / / MPICH_SHMMPICH_SHM / / MPICH_NOSHMMPICH_NOSHM Three node distributions, three root node settings.Three node distributions, three root node settings.
(us)(us) rootroot ReduceReduce BcastBcast AllreduceAllreduce
4x14x1
samesame 99//121121//43844384 1010//137137//79137913
160 160 //175175//627627rotaterotate 3333//8181//36993699 129129//9191//42384238
combcomboo
2525//102102//34363436 1717//3232//966966
1x41x4
samesame 2828//19991999//18441844 2121//16101610//15515511
571571//675675//775775rotaterotate 146146//19441944//18781878 164164//17741774//18318344
combcomboo
167167//19771977//18541854 4343//409409//392392
4x44x4
samesame 3939//25322532//48094809 5656//27922792//1024102466
736736//14121412//1991419914rotaterotate 161161//17181718//85668566 216216//22042204//80380366
combcomboo
141141//22422242//85158515 6262//489489//20542054
1) MPICH w/o shared memory performs the worst.
2) TMPI is 70+ times faster than MPICH w/ Shared Memory for MPI_Bcast and MPI_Reduce.
3) For TMPI, the performance of 4X4 cases is roughly the summation of that of the 4X1 cases and that of the 1X4 cases.
June, 20th, 2001 Hong Tang 18
Macro-Benchmark PerformanceMacro-Benchmark Performance
0 5 10 15 200
200
400
600
800
1000
Number of MPI Nodes
MF
LO
P
(a) Matrix Multiplication
TMPIMPICH
0 5 10 15 20 250
200
400
600
800
1000
Number of MPI Nodes
MF
LO
P
(b) Gaussian Elimination
TMPIMPICH
June, 20th, 2001 Hong Tang 19
ConclusionsConclusions
http://www.cs.ucsb.edu/projects/tmpi/http://www.cs.ucsb.edu/projects/tmpi/
Great Advantage of Threaded MPI Great Advantage of Threaded MPI Execution on SMP ClustersExecution on SMP ClustersMicro-benchmark: 70+ times faster than Micro-benchmark: 70+ times faster than
MPICH.MPICH.Macro-benchmark: 100% faster than MPICH.Macro-benchmark: 100% faster than MPICH.
Optimization TechniquesOptimization TechniquesSeparated Collective and Point-to-Point Separated Collective and Point-to-Point
Communication ChannelsCommunication ChannelsAdaptive Buffer ManagementAdaptive Buffer ManagementHierarchy-Aware Communications Hierarchy-Aware Communications
June, 20th, 2001 Hong Tang 20
Background: Safe Execution of Background: Safe Execution of MPI Programs using ThreadsMPI Programs using Threads
Program Transformation: Program Transformation: Eliminate global and Eliminate global and static variables (called static variables (called permanent variablespermanent variables).).
Thread-Specific Data (TSD)Thread-Specific Data (TSD)Each thread can associate a pointer-sized Each thread can associate a pointer-sized datadata variable variable with a commonly defined with a commonly defined keykey value (an integer). With value (an integer). With the same key, different threads can set/get the values the same key, different threads can set/get the values of their own copy of the data variable.of their own copy of the data variable.
TSD-based TransformationTSD-based Transformation Each permanent variable declaration is replaced with a Each permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. of the permanent variable with the corresponding key. In places where global variables are referenced, use the In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the global keys to retrieve the per-thread copies of the variables.variables.
June, 20th, 2001 Hong Tang 21
Source Program Program After Transformation int X=1; int kX=0; void main_init()
{ if (kX==0) kX=key_create(); }
void user_init() { int *pX=malloc(sizeof(int)); *pX=1; setval(kX, pX); }
int f() {
int f() { int *pX=getval(kX);
return X++; }
return (*pX)++; }
Program Transformation Program Transformation –– An ExampleAn Example