The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

The Challenges of Using An Embedded MPI for Hardware-Based

Processing Nodes

Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1

1Department of Electrical and Computer Engineering University of Toronto2Arches Computing Systems, Toronto, Canada

2

Outline

• Background and Motivation• Embedded Processor-Based Optimizations• Hardware Engine-Based Optimizations• Conclusions and Future Work

Ly D, Saldaña M, Chow P. FPT 2009

3

Motivation

• Message Passing Interface (MPI) is a programming model for distributed memory systems

• Popular in high performance computing (HPC), cluster-based systems


4

Motivation




Processor 1Memory Processor 2 Memory

for (i = 1; i <= 100; i++) sum += i;

Problem: sum of numbers from 1 to 100

5

Motivation





sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;

MPI_Recv(sum2, ...);sum = sum1 + sum2;

sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;

MPI_Send(sum1, ...);

6

Motivation





sum1 = 0;for (i = 0; i <= 50; i++) sum1 += i;


sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;


7

Motivation





sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;


sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;


8

Motivation





sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;


sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;


9

Motivation

• Strong interest in adapting MPI for embedded designs:– Increasingly difficult to interface heterogeneous

resources as FPGA chip size increases• MPI provides key benefits: – Unified protocol– Low weight and overhead– Abstraction of end points (ranks)– Easy prototyping


10

MotivationProperty HPC Cluster Embedded FPGA

Processor

Clock Rate 2-3 GHz 100-200 MHz

Memory

Size per node > 1GB 1-20 MB

Interconnect

Protocol Robustness High None

Latency 10μs (20k cycles) 100ns (10 cycles)

Bandwidth 125 MB/s 400-800 MB/s

Components

Processing Nodes Homogenous Heterogeneous


11

Motivation• Interaction classes arising from heterogeneous designs:– Class I: Software-software interactions• Collections of embedded processors• Thoroughly investigated; will not be discussed

– Class II: Software-hardware interactions• Embedded processors with hardware engines• Large variety in processing speed

– Class III: Hardware-hardware interactions• Collections of hardware engines• Hardware engines are capable of significant

concurrency compared to processorsLy D, Saldaña M, Chow P. FPT 2009

12

Background

• Work builds on TMD-MPI[1]– Subset implementation of the MPI standard– Allows hardware engines to be part of the message

passing network– Ported to Amirix PCI, BEE2, BEE3, Xilinx ACP– Software libraries for MicroBlaze, PowerPC, Intel X86

[1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.


13

Class II: Processor-based Optimizations

• Background• Direct Memory Access MPI Hardware Engine• Non-Interrupting, Non-Blocking Functions• Series of MPI Messages• Results and Analysis


14

Class II: Processor-based OptimizationsBackground

• Problem 1– Standard message paradigm for HPC systems• Plentiful memory but high message latency• Favours combining data into a few, large messages,

which are stored in memory and retrieved as needed– Embedded designs provide different trade-off• Little memory but short message latency• ‘Just-in-time’ paradigm is preferred– Sending just enough data for one unit of

computation on demand


15


• Problem 2– Homogeneity of HPC systems• Each rank has similar processing capabilities

– Heterogeneity of FPGA systems• Hardware engines are tailored for a specific set of

functions – extremely fast processing• Embedded processors play vital role of control

and memory distribution – little processing


16


• ‘Just-in-time’ + Heterogeneity = producer-consumer model– Processors produce messages for hardware engines

to consume– Generally, the message production rate of the

processor is the limiting factor


17

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

• Typical MPI implementations use only software• DMA engine offloads time-consuming, message

tasks: memory transfers– Frees processor to continue execution– Can implement burst memory transactions– Time required to prepare a message is independent

of message length– Allows messages to be queued


18



19



20



MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

21






22






23






24






25






26






27






28






29






30






31






32



MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

33






34






35






36






37






38






39






40






41






42






43






44






45






46


• DMA engine is completely transparent to the user– Exact same MPI functions are called– DMA setup is handled by the implementation


47

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions


• Two types of MPI message functions– Blocking functions: returns only when buffer can

be safely reused– Non-blocking functions: returns immediately• Request handle is required so the message

status can be checked later• Non-blocking functions are used to overlap

communication and computation

48



• Typical HPC non-blocking use case:MPI_Request request;...MPI_Isend(..., &request);prepare_computation();MPI_Wait(&request, ...);finish_computation();

49



• Class II interactions have a different use case– Hardware engines are responsible for computation– Embedded processors only need to send messages

as fast as possible• DMA hardware allow messages to be queued• ‘Fire-and-forget’ message model– Message status is not important

– Request handles are serviced by expensive, interrupts

50



• Standard MPI protocol provides a mechanism for ‘fire-and-forget’:MPI_Request request_dummy;...MPI_Isend(..., &request_dummy);MPI_Request_free(&request_dummy);

51



• Standard implementation still incurs overhead:– Setup the interrupt– Remove the interrupt– Extra function call overhead– Memory space for the MPI_Request data structure

• For the ‘just-in-time’ message model on embedded processors, these overheads create a bottleneck

52



• Proposed modification to the MPI protocol:#define MPI_REQUEST_NULL NULL;...MPI_Isend(..., MPI_REQUEST_NULL);

• Non-blocking functions check that the request pointer is valid before setting interrupts

• Circumvents the overhead• Not standard, but minor modification that works

well for embedded processors with DMA

53

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()


• MPI message without DMA

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

54




MPI_Send()

Legend


55




MPI_Send()

Legend


56




MPI_Send()

Transfer data words

Legend


57




MPI_Send()

Transfer data words

return

Legend


58




MPI_Send()

Transfer data words

return

Legend


59




MPI_Send()

Transfer data words

return

Legend


60




MPI_Send()

Transfer lots of data words

return

Legend


61



• MPI message with DMA

MPI_Send()

Transfer four words, regardless of message length

return

Legend


62




55.6%

Legend


28.7% 15.6%

63




55.6%

Legend


28.7% 15.6%+ = 44.3%

64



• MPI message with DMA– Message queueing

Legend


msg 1 msg 2 msg 3

65



• Inline all MPI functions?

Legend


msg 1 msg 2 msg 3

66




Legend


msg 1 msg 2 msg 3

67




Legend


msg 1 msg 2 msg 3

68



• Inline all MPI functions?– Increases program length!

Legend


msg 1 msg 2 msg 3

69



• Standard MPI Functionsvoid *msg_buf;int msg_size;...

MPI_Isend(msg_buf, msg_size, ...);MPI_Irecv(msg_buf, msg_size, ...);

70



void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...

){

for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}

}

71




){


}

72




){


}

73




){


}

74




){


}

75




){


}

76



• MPI_Coalesce

Legend


msg 1 msg 2 msg 3

For loop

77



• MPI_Coalesce

Legend


msg 1 msg 2 msg 3

For loop

78



• MPI_Coalesce

Legend


msg 1 msg 2 msg 3

For loop

79



• MPI_Coalesce

Legend


msg 1 msg 2 msg 3

For loop

80


• MPI_Coalesce is not part of the MPI Standard• Behaviour can be easily reproduced– Even when source code is not available

• Maintains compatibility with MPI code


81

Class II: Processor-based OptimizationsResults

• Application: Restricted Boltzmann Machines[2]– Neural network FPGA implementation– Platform: Berkeley Emulation Engine 2 (BEE2)• Five Xilinx II-Pro XC2VP70 FPGA• Inter-FPGA communication:–Latency: 6 cycles–Bandwidth: 1.73GB/s

[1] D. Ly et al., “A Multi-FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009.


82



83



Message # Source Destination Size [# of words]

1 R0 R1 0

2 R0 R1 3

3 R0 R6 0

4 R0 R6 3

5 R0 R11 0

6 R0 R11 3

7 R0 R16 0

8 R0 R16 3

9 R0 R1 4

10 R0 R6 4

11 R0 R11 4

12 R0 R16 4

84



85



86



2.33x

87



2.33x

88



2.33x

3.94x

89



2.33x

3.94x

90



2.33x

3.94x

5.32x

91

Class III: Hardware-based Optimizations

• Background• Dataflow Message Passing Model– Case Study: Vector Addition


92

Class III: Hardware-based OptimizationsBackground

• Processor-based, software model– Function calls are atomic– Program flow is quantized in message function units– Cannot execute communication and computation

simultaneously• Hardware engines– Significantly more parallelism– Communication and computations can be

simultaneous


93

Class III: Hardware-based OptimizationsDataflow Message Passing Model

• Standard message processing modelMPI_Recv(...);compute();MPI_Send(...);

• Hardware uses dataflow-model


Logic

94

Class III: Hardware-based OptimizationsCase Study: Vector Addition

• Vector Addition:

• va comes from Rank 1, vb comes from Rank 2

• Compute vc, send result back to Rank 1 and 2


bac vvv ibiaic vvv ,,,

95


• Software model:int va[N], vb[N], vc[N];

MPI_Recv(va, N, MPI_INT, rank1, ...);MPI_Recv(vb, N, MPI_INT, rank2, ...);

for(int i = 0; i < N; i++)vc[i] = va[i] + vb[i];

MPI_Send(vc, N, MPI_INT, rank1, ...);MPI_Send(vc, N, MPI_INT, rank2, ...);


96



97



98



99



100



101



102



103



104



105



106



• Message transfer are atomic– Serializes computation and communication

• Vector addition has great data locality– Entire message is not required for computation– Only one element of each vector is required

• Higher granularity is required– Hardware dataflow approach would use pipelined

computation

107



108



109



110



111



112



113

Class III: Hardware-based OptimizationsDataflow Message Passing Model


• Natural extension of MPI for hardware designers– Increased granularity increased performance– Supports pipelining

• Single processing element represents multiple ranks– Capable of transferring data from multiple sources– Supports data streaming• Full-duplex data transfer

114

Conclusion and Future Work

• MPI can be very effective for FPGA designs– FPGAs have different trade-offs than HPC

• Considerations to deal with FPGA MPI– Class II: DMA, Non-Blocking, MPI_Coalesce()– Class III: Dataflow Message Passing Model

• Attempts to maintain compatibility with MPI standard– Some incremental optimizations do not comply– Can be reduced to legitimate MPI code

• Limit of where current MPI standard applies• Future work: message passing using fine-grain parallelism


115

Thank you

• Special thanks to:


116

Hardware Debugging Interfaces

• Background• Tee Cores• Message Watchdog Timers


117

• Code compatibility allows traditional MPI software-only debugging

• Porting to FPGA designs can still produce errors– Improper on-chip network setup– Message passing flaws in hardware cores

• Hardware has limited visibility– No debuggers– No standard output/printf()


Hardware Debugging InterfacesBackground

118

• Networks typically consists of point-to-point FIFOs


Hardware Debugging InterfacesTee Cores

MPICore

MPICore

119


• Tee Cores:



MPICore

MPICore

MPICore

MPICore

120


• Tee Cores:



MPICore

MPICore

MPICore

MPICore

Processor

121

• Transparent and does not affect original network performance

• Allows direct tracing of data link layer– Simple communication protocols– Easy to follow message transmissions



Rank 1 Rank n

122

• Transparent and does not affect original network performance

• Allows direct tracing of data link layer– Simple communication protocols– Easy to follow message transmissions



Rank 1 Rank n

123

• Unresponsive embedded systems cannot be recovered

• Message watchdog timers that are integrated with MPI implementation source code– Snoops incoming messages in a transparent manner– If there’s no activity after the timer expires, the

processor gets interrupted and control is returned• Excellent for post-mortem analysis– Connect with Tee Cores for a terse debugging report


Hardware Debugging InterfacesMessage Watchdog Timers

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

Documents

Transcript of The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes