Zero Copy MPI Derived Datatype Communication Over...
Transcript of Zero Copy MPI Derived Datatype Communication Over...
![Page 1: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/1.jpg)
Zero Copy MPI Derived Datatype Communication Over InfiniBand
Gopalakrishnan SanthanaramanJiesheng WuD.K.Panda
Network Based Computing LabThe Ohio State University
![Page 2: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/2.jpg)
Presentation Layout
� Introduction
� Background and Existing approaches
� Motivation for new Scatter/Gather (SGRS) approach
� Design and implementation issues
� Performance Evaluation
� Conclusions and Future work
![Page 3: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/3.jpg)
Introduction
� Non-contiguous data communication is common in scientific applications.
� Decomposition of multi dimensional volumes, FFT, finite elementcodes
� NAS BENCHMARKS, LINPACK
� MPI provides derived datatype interface to facilitate this kind of data movement
� Current Implementations of derived datatypes not very efficient
![Page 4: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/4.jpg)
Presentation Layout
� Introduction
� Background and Existing Approaches
� Motivation for new Scatter/Gather(SGRS) approach
� Design and Implementation Issues
� Performance Evaluation
� Conclusions
![Page 5: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/5.jpg)
Related Work
� Improve datatype processing
� Optimized packing and Unpacking Procedures
� Taking advantage of network features to improve non contiguous datatype communication
![Page 6: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/6.jpg)
InfiniBand Overview
� Emerging interconnect based on Open standards
� Provides low latency and high Bandwidth
� Several Novel features
� RDMA
� Scatter/Gather
� Atomic operations
� VAPI – low level interface (API) over InfiniBand
![Page 7: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/7.jpg)
Our Previous Work
� Different Approaches
� Pack/Unpack Based Approach
� Copy on both sides
� Pipeline packing, network communication and unpacking
� Reduced Copy
� RDMA write with Gather on sender side
� RDMA read with Scatter on receiver side
� Zero Copy
� Multiple RDMA writes on sender side (Multi-W scheme)
Jiesheng Wu, Pete Wyckoff, and Dhabaleswar K. Panda. High Performance Implementation of MPI Datatype Communication over InfiniBand. In Int'l Parallel and Distributed Processing Symposium (IPDPS 04), April, 2004
![Page 8: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/8.jpg)
Conclusions of Previous Work
� For small messages with eager protocol, segment pack/unpack is best.
� For messages in rendezvous protocol range, zero copy schemes are beneficial.
�
Multi-W zero copy scheme was proposed.
��� � � �� ��� � �� � �
�� �� ��� � � � � �� �� ��� � � � �
�� �� �� � �
�� �� �� � �
�� �� �� � �
�� �� �� � �
![Page 9: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/9.jpg)
Limitations of Earlier Approaches
� RDMA write/gather, RDMA read/scatter
�
Needs copy in order to handle non-contiguity on both sides
� Multi-W
�
For large number of small segments, performance degrades.
� Overhead of large number of RDMA operations
� Poor network utilization
� Motivation to explore other zero copy schemes
� Problem statementHow can we utilize the advanced features provided by modern
interconnects like InfiniBand to handle non-contiguous data communication efficiently and overcome the above limitations?
![Page 10: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/10.jpg)
Presentation Layout
� Introduction� Background and Existing approaches� Motivation for New Scatter/Gather (SGRS)
Approach� Design and Implementation issues� Performance Evaluation� Conclusions and Future work
![Page 11: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/11.jpg)
Semantics of send/gather, receive/scatter feature
� Based on send/receive channel semantics
� Handles non-contiguity on both send/receive sides which is the most generic case
� To implement datatype using this feature needs a synchronization phase. Hence applicable for messages which fall under the rendezvous protocol
![Page 12: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/12.jpg)
VAPI level Comparison Multi-W vs SGRS
Observations
� For a fixed number of segments SGRS approach outperforms the Multi-W approach for different message sizes
� For a fixed message size with increasing degree of non-contiguity,
�
SGRS scheme degradation is negligible
�
Multi-W degradation is significant
![Page 13: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/13.jpg)
Presentation Layout
� Introduction� Background and Existing approaches� Motivation for new Scatter/Gather
approach� Design and Implementation issues� Performance Evaluation� Conclusions
![Page 14: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/14.jpg)
MVAPICH Overview
� High Performance Implementation of MPI over InfiniBand
� Design based on MPICH and MVICH
� Eager protocol for small messages
� Rendezvous protocol for large messages
� Datatype Implementation currently uses the generic packing and unpacking scheme.
� small datatype messages are packed/unpacked
� large datatype messages both sides allocate pack/ unpack buffers dynamically
![Page 15: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/15.jpg)
� Open Source (current version is 0.9.4 released last week)
� Have been directly downloaded by more than 119 organizations and industry
� Available in the software stack distributions of IBA vendors
MVAPICH Software Distribution
��� � ��� � � � �� �� � �� �� �� � � � � � � ��
����� � �� ��� � �� �� ! � "� � � �� �#$� � � � % & �� � # $� � � � �$� � � � � '� � (� � & � )� � ��* + � � , $� )-. � � � /* � � �* �0% & � � � � & � � � � , + 12 � + �43 '� � 5�6 - � � � )� � �� 7 &# + ��* + 08 � � )� �# 1
2 � + �43 '� � 7 �� � � � ) / ��. * �. � � + � � ,9 � �� :�� � � ��; � � �� �08 � � )� �# 1
! �< � � �* � =�� � >� � # ��� � �� �� ! � "� � � �� �#!� + � � )� + �� � �� �� ! � "� � � �� � #(� 6 7 ?� �* > 2 � + � � �. � � '� � � + ��� �� )# 0 8 � � )� �# 1
� � / � � ) � + @� +� � � * & $� � �� �� $ / ��� � �� �� $� � � � � '� � � � )� +- & � � �* @ � +� � � * &
: & �� /. - � � *� )-. � � � $� � �� �7 � * � ' �* �� � � &< � + � ��� � �� �� ! � "� � � �� � #7 � � � + ". �� & /. - � � *� ) -. � � �� $� � � � �@� +� � � * &A 9 �B � � - )� � � 2 � + � � �. � � CB � � � 0 @. + + � � 1
/* � � �* � � - - �* � � �� � + 2 � � � � �� � �� �� $� � -� � � � �� �/� � , ��� ��� � �� �� ! � "� � � �� � #
D � ��E � �� � � � ��
8 �� � � � � % � * &
2 � , ��� �� F � �B � � + � �#C� � � � F � �B 3 0 C� � � � 1
C� � � � 2 � + �43 : ' /* � � �* � � � ,% � * & 3 0 C� � � � 1
C# . + &. F � �B 3 0G � - � � 1( � + + � + + � - - � / �� � � F � �B � � + � �#(� +*� < / �� � � F � �B � � + � �# 0 @. + + � � 1
�� � � & � � + � � � � F � �B � � + � �#7 � � � / �� � � F � �B � � + � �#@. + + � � � �* � , � )# � ' /* � � �* � + 0 @. + + � � 1
/ �� � '� � , F � �B � � + � �#% � * & � �� � 0 2 +� � � 1
% � * & � �* � F � �B 3� ' (. �* & � � 0 8 � � )� �# 1
% � * & � �* � F � �B 3� ' $ & � ) � � �; 0 8 � � )� �# 1
F � �B 3� ' 8 � � �B � 0 /< � �; � � � � , 1
F � �B 3� 'H�� . + �� �F � �B 3� ' C� � +�. & � 0 8 � � )� �# 1
F � �B 3� ' (� + +� * &. +� � � +!� < �
F � �B 3� '7 � , � � "� � � 0 8 � � )� �# 1
F � �B 3� '7� � + ,� ) 0 8 � � )� �# 1
F � �B 3� ' @ �� 8 � � � ,?� 0 =� � ; � 1
F � �B 3� ' / & � � "�� � >� 0 $� �� ,� 1
F � �B 3� ' / �. � �� � � � 0 8 � � )� �# 1
F � �B 3� '%� �� � �� 0 $� �� ,� 1
![Page 16: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/16.jpg)
MVAPICH Users (Cont’d)
� " "� % � * & �� � � #� ,B � �* � , $ . + � � � � �� % � * & 3� (9� ) )� + +�� - - ���� � � # /# + � � ) + $� )- 3 0 $� �� ,� 1
� � � - � % � * & �� � � � � +�� � � � �% � * & �� � � � � +$ . + �� � + /. - � � *� )-. � � ���% � * & �� � � # 2 �* 3 0 $ & � �� 1
$ . + � � �B � + �� � 0 � � � & � � � � , + 1$� )-. +# + 0 F C 1
$ / /! � "� � � �� � � � +�� 2 �* 39 � 9 � �� $� )-. �� � 0 8 � � )� �# 1
5 )- �* + 0 8 � � )� �# 1
� . � � � 2 �* 356 � � � � 0 2 +� � � 1
8 � � - & / �� � � ) � 2 �* 3H 7H 7 0 �� � �* � 1
� � ��� � � ��
2 = (2 = ( 0 �� � �* � 1
2 = ( 0 8 � � )� �# 1
2 �% 5 @ / 59 0 �� � �* � 1
2 � ' � � � $� �2 � � � 2 � � � 0 $ & � �� 1
2 � � � 0 8 � � )� �# 1
2 � � � /� . � �� � /�� �B ��* � + 0H�� �� C� �� 1
2 � � � /� . � �� � /�� �B ��* � + 0G � - � � 1G � 2C� � ' �< � # 0 @. + + � � 1
! � �� * &�� 0 $ & � �� 1
! � �. 6 � � �< � � 6! � �B � + �� � 0 � � � & � � � � , + 1( � � �< � � � 0 8 � � )� �# 1
( � � * . �# $� )-. �� � /# + �� ) +( � � �� 6 % � * & �� � � � � +( � �� +# + 0 �� � �* � 1
( ��* �� < � # � 2 �* 3� 5 $ 0G � - � � 1� 5 $ /� . � �� � + � 2 �* 3� 5 $ 0 / � �� � -� � � 1
� 2 $ 5 % 0 @. + + � � 1
: $ � - * 0 F � � � � , C � �� ,� ) 1
:* � �� � =� # 0 $� �� ,� 1
7 � �% � /# + �� ) +7 � � % � * 0 8 � � )� �# 1
7 � � & /* � � � 2 �* 37 . �� * 0G � - � � 17 # � � ) � , $� )-. � � � 08 � � )� �# 1
. + � � � + 0 2 +� � � 1
@� # � & �� � 2 �* 3@! �% � * & �� � � � � +@� + �� ! � ,3 0 @. + + � � 1
/ = $% � * & �� � � � � +�� 2 �* 3/* # , /� ' �< � � �/ 8 2 0 / � ��*� � 8 � � - & �* + � 2 �* 3 1
/ C� $� ) -. � � � +/ �� � � ) � �� $� )-. � � �� 0 F C 1
/# + �� � �%� )� �% � ?*� � , ��� � - - � � , @ � +� � � * &
% &� � + F � , � �< � � � � /# + � � ) + 0 F C 1
% � � � + � � * 0 8 � � )� �# 1
%� 7 � � '� � ) + 0 @. + + � � 1
%� - +- � �F � � +# +
� �� �� � � � > + �� � �� � + F C� ! � ,3 0 F C 1
� B � � /# + � � ) + � 2 �* 3
![Page 17: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/17.jpg)
Larger IBA Clusters using MVAPICH and Top500 Rankings
� 1105-node cluster at Virginia Tech
� 3rd in Nov. ’03 ranking
� 192-node cluster at Mississippi State University
� 150th in June ’04 ranking
� 128-node cluster at Sandia/Livermore
� 111th in Nov ’03 ranking and 211th in June ’04 ranking
� 256-node cluster at Los Alamos
� 116th in Nov ’03 ranking and 218th in June ’04 ranking
� 128-node cluster at Ohio Supercomputer Center (OSC)
� 272th in June ’04 ranking
� More are getting installed ….
![Page 18: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/18.jpg)
Framework For Handling Datatypes
MPI INTERFACE
INFINIBAND LAYER
Rendezvous
Reduced CopyPipeline Zero copyPack
Small messages Large messages
Eager
![Page 19: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/19.jpg)
��� � � � � �� � �� �
� �� � ��
��� � � � �
�� � � ��� � � � � �� �� ��� � � � �
��� � �
� �� � � �
Basic Idea
![Page 20: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/20.jpg)
Design Issues
� Exchanging layout information
�
MPI datatype has only local semantics
�
Optimizing layout exchange
� Layout matching decision needs to be conveyed
� Registration and deregistration on user datatype message buffers
�
Unique issue due to non-contiguity in buffers
� Posting Descriptors
�
Upper limit on number of scatter gather descriptors.
�
Needs a secondary connection for transmitting non-contiguous data
![Page 21: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/21.jpg)
SG
RS
CO
MM
UN
ICA
TIO
N P
RO
TO
CO
L
SE
ND
ER
RE
CE
IVE
R
RE
QU
ES
T C
TR
L M
ES
G+
LA
YO
UT
(P
RIM
AR
Y C
ON
NE
CT
ION
)
PO
ST
SC
AT
TE
R
PO
ST
GA
TH
ER
RE
PL
Y C
TR
L M
ES
G +
DE
CIS
ION
INF
O (
PR
IMA
RY
CO
NN
EC
TIO
N)
DA
TA
(S
EC
ON
D C
ON
NE
CT
ION
)
![Page 22: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/22.jpg)
Layout Exchange and Matching Decision
� Take advantage of handshake messages in the rendezvous protocol to achieve this
� Sender’s datatype layout is appended to Rendezvous start control message
� The matching decision information is conveyed in the Rendezvous reply/clear to send message
� A layout cache mechanism is implemented to reduce overhead of layout transfer
� Datatype information is exchanged only once
� Only the index needs to be sent for future messages
� Datatype Cache mechanism proposed by Traff et al.
![Page 23: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/23.jpg)
Registration
� Registration and Deregistration on user datatype message buffers
� Common issues in both the zero copy schemes
� Unique issue due to non-contiguity in buffers
� Use Optimistic Group Registration scheme
J. Wu, P. Wyckoff, and D. K. Panda. “Supporting Efficient Noncontiguous Access in PVFS over InfiniBand”. IEEE Cluster Computing 2003, Dec. 2003
![Page 24: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/24.jpg)
Posting Descriptors
� Needs a separate Queue pair connection
�
Ordering
�
Scalability
� Upper limit on number of gather/scatter descriptor
�
Message might need to be chopped into multiple gather/scatter descriptors
�
Number of posted gather descriptors must be equal to the number of posted scatter
�
Needs a negotiation phase
![Page 25: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/25.jpg)
Presentation Layout
� Introduction� Background and Existing approaches� Motivation for new Scatter/Gather (SGRS)
approach� Design and Implementation issues� Performance Evaluation� Conclusions and Future work
![Page 26: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/26.jpg)
Experimental Evaluation
� Experimental Test bed
�
Cluster of 8 Supermicro nodes
� Dual Xeon 3.0 GHz processors
� 512 KB L2 Cache, PCI-X 64bit 133 MHz bus
� InfiniHost SDK version 3.0.1
� Physical memory 1GB DDR-SDRAM memory
� Experiments conducted
�
Latency, Bandwidth with vector datatype
�
Collective latency (MPI_Alltoall)
�
CPU overhead tests
�
Impact of layout cache
![Page 27: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/27.jpg)
Vector Datatype TestA vector (multiple columns in a 64x4096 integer array) test
![Page 28: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/28.jpg)
MPI Level Vector Latency
0
100
200
300
400
500
600
700
800
900
2k 4k 8k 16k 32k 64k 128k 256k 512k
Message size(bytes)L
aten
cy (u
sec)
SGRS-128
Multi-W-128
Generic-128
Contiguous
0
300
600
900
1200
1500
1800
2100
2k 4k 8k 16k 32k 64k 128k 256k 512k
Message size (bytes)
Lat
ency
(u
secs
)
SGRS-64
MultiW-64
Contiguous
Generic-64
� SGRS scheme reduces latency by up to 62% as compared to Multi-W
![Page 29: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/29.jpg)
MPI Level Vector Bandwidth
0
100
200
300
400
500
600
700
800
900
2k 4k 8k 16k 32k 64k 128k 256k 512k
Message size(bytes)
Ban
dw
idth
(M
egab
ytes
/sec
)
SGRS-64
SGRS-128
MultiW-64
MultiW-128
Contiguous
Generic-128
Generic-64
� SGRS scheme gives the best performance
� For large messages we get Bandwidth close to that of contiguous Bandwidth
![Page 30: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/30.jpg)
0
10
20
30
40
50
60
2k 4k 8k 16k 32k 64k 128k 256k 512k
Message size(bytes)
CP
U o
verh
ead
(use
c)
MultiW-64 segments
MultiW-128 segments
SGRS-64 segments
SGRS-128 segments
• The CPU overhead associated with SGRS protocol is relatively low
CPU Overhead
Receiver side OverheadSender side Overhead
0
4
8
12
16
20
2k 4k 8k 16k 32k 64k 128k 256k 512k
Message size(bytes)C
PU
ove
rhea
d(us
ec)
MultiW-64 segments
MultiW-128 segments
SGRS-64 segments
SGRS-128 segments
![Page 31: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/31.jpg)
MPI_Alltoall Latency
• The Alltoall latency test shows significant improvement for the SGRS approach
0
2000
4000
6000
8000
10000
12000
4k 8k 16k 32k 64k 128k 256k 512k
Message size(bytes)
Lat
ency
(u
sec)
Multi-W 64
Multi-W-128
SGRS-64
SGRS-128
![Page 32: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/32.jpg)
Synthetic Benchmark to Measure Impact of Layout Caching
� Need to transfer the two diagonals of a square matrix.
� Diagonal elements are actually blocks.
� Need significant layout size to describe it
![Page 33: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/33.jpg)
0
5
10
15
20
25
30
500 750 1000 1250 1500 1750 2000
Num of blocks
Per
centa
ge
of O
verh
ead
blocksize:4bytes
blocksize:8bytes
blocksize:16bytes
Effect of Layout Cache
� Layout cache shows benefits for certain scenarios
� Layout itself is contiguous as compared to the data that it describes
![Page 34: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/34.jpg)
Presentation Layout
� Introduction� Background and Existing approaches� Motivation for new Scatter/Gather (SGRS)
approach� Design and implementation issues� Performance Evaluation� Conclusions and Future work
![Page 35: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/35.jpg)
Conclusions and Future Work
� Provided a new zero-copy scheme for datatype communication over InfiniBand
� The new scheme outperforms the existing schemes
�
Latency can be improved by up to 62%
�
Bandwidth can be increased by up to 400%
�
Collective communication like Alltoall can derive potential benefits
�
Layout cache is shown to be beneficial for some scenarios
� Future Work
�
Evaluate the effectiveness of this scheme at application level
�
Provide a comprehensive solution that internally uses multiple schemes to achieve best performance
![Page 36: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/36.jpg)
��� � � � � � ��� �� � � � ���� � � � �� � � � � �
� � � �� � � � ��� � � ��� � � � ��� � �� � ���
� � �� � � � � � !"� � � �� # $ � � � � �%
& � � ' � � � ( � ) � �� � � � � %
Thank You!
NBC Home Page
![Page 37: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/37.jpg)
BACKUP SLIDES
![Page 38: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/38.jpg)
Vapi Level Bandwidth Comparison SGRS vs. Multi-W
• SGRS scheme consistently outperforms the Multi-W
0
100
200
300
400
500
600
700
800
900
1000
2k 4k 8k 16k 32k 64k 128k 256k 512k
Message size (bytes)
Ban
dw
idth
(M
ega
byt
es/s
ec)
Multi-W-Bw
SGRS-Bw
![Page 39: Zero Copy MPI Derived Datatype Communication Over InfiniBandmvapich.cse.ohio-state.edu/static/media/... · Zero Copy MPI Derived Datatype Communication Over InfiniBand Gopalakrishnan](https://reader033.fdocuments.us/reader033/viewer/2022053002/5f0611e07e708231d416238f/html5/thumbnails/39.jpg)
Effect of degree of non-contiguity
• SGRS scheme fares better with increased non-contiguity
500
550
600
650
700
750
800
850
900
4 8 16 32 64
Num of Blocks
Ban
dw
idth
(M
egab
ytes
/sec
)
Multi-W-128k
Multi-W-256k
Multi-W-512k
SGRS-128k
SGRS-256k
SGRS-512k