Embedded operating systems for sensor nodes Hirdepal Singh Hunjan 6533352 [email protected].
The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes
description
Transcript of The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes
The Challenges of Using An Embedded MPI for Hardware-Based
Processing Nodes
Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1
1Department of Electrical and Computer Engineering University of Toronto2Arches Computing Systems, Toronto, Canada
2
Outline
• Background and Motivation• Embedded Processor-Based Optimizations• Hardware Engine-Based Optimizations• Conclusions and Future Work
Ly D, Saldaña M, Chow P. FPT 2009
3
Motivation
• Message Passing Interface (MPI) is a programming model for distributed memory systems
• Popular in high performance computing (HPC), cluster-based systems
Ly D, Saldaña M, Chow P. FPT 2009
4
Motivation
• Message Passing Interface (MPI) is a programming model for distributed memory systems
• Popular in high performance computing (HPC), cluster-based systems
Ly D, Saldaña M, Chow P. FPT 2009
Processor 1Memory Processor 2 Memory
for (i = 1; i <= 100; i++) sum += i;
Problem: sum of numbers from 1 to 100
5
Motivation
• Message Passing Interface (MPI) is a programming model for distributed memory systems
• Popular in high performance computing (HPC), cluster-based systems
Ly D, Saldaña M, Chow P. FPT 2009
Processor 1Memory Processor 2 Memory
sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;
MPI_Recv(sum2, ...);sum = sum1 + sum2;
sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;
MPI_Send(sum1, ...);
6
Motivation
• Message Passing Interface (MPI) is a programming model for distributed memory systems
• Popular in high performance computing (HPC), cluster-based systems
Ly D, Saldaña M, Chow P. FPT 2009
Processor 1Memory Processor 2 Memory
sum1 = 0;for (i = 0; i <= 50; i++) sum1 += i;
MPI_Recv(sum2, ...);sum = sum1 + sum2;
sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;
MPI_Send(sum1, ...);
7
Motivation
• Message Passing Interface (MPI) is a programming model for distributed memory systems
• Popular in high performance computing (HPC), cluster-based systems
Ly D, Saldaña M, Chow P. FPT 2009
Processor 1Memory Processor 2 Memory
sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;
MPI_Recv(sum2, ...);sum = sum1 + sum2;
sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;
MPI_Send(sum1, ...);
8
Motivation
• Message Passing Interface (MPI) is a programming model for distributed memory systems
• Popular in high performance computing (HPC), cluster-based systems
Ly D, Saldaña M, Chow P. FPT 2009
Processor 1Memory Processor 2 Memory
sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;
MPI_Recv(sum2, ...);sum = sum1 + sum2;
sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;
MPI_Send(sum1, ...);
9
Motivation
• Strong interest in adapting MPI for embedded designs:– Increasingly difficult to interface heterogeneous
resources as FPGA chip size increases• MPI provides key benefits: – Unified protocol– Low weight and overhead– Abstraction of end points (ranks)– Easy prototyping
Ly D, Saldaña M, Chow P. FPT 2009
10
MotivationProperty HPC Cluster Embedded FPGA
Processor
Clock Rate 2-3 GHz 100-200 MHz
Memory
Size per node > 1GB 1-20 MB
Interconnect
Protocol Robustness High None
Latency 10μs (20k cycles) 100ns (10 cycles)
Bandwidth 125 MB/s 400-800 MB/s
Components
Processing Nodes Homogenous Heterogeneous
Ly D, Saldaña M, Chow P. FPT 2009
11
Motivation• Interaction classes arising from heterogeneous designs:– Class I: Software-software interactions• Collections of embedded processors• Thoroughly investigated; will not be discussed
– Class II: Software-hardware interactions• Embedded processors with hardware engines• Large variety in processing speed
– Class III: Hardware-hardware interactions• Collections of hardware engines• Hardware engines are capable of significant
concurrency compared to processorsLy D, Saldaña M, Chow P. FPT 2009
12
Background
• Work builds on TMD-MPI[1]– Subset implementation of the MPI standard– Allows hardware engines to be part of the message
passing network– Ported to Amirix PCI, BEE2, BEE3, Xilinx ACP– Software libraries for MicroBlaze, PowerPC, Intel X86
[1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.
Ly D, Saldaña M, Chow P. FPT 2009
13
Class II: Processor-based Optimizations
• Background• Direct Memory Access MPI Hardware Engine• Non-Interrupting, Non-Blocking Functions• Series of MPI Messages• Results and Analysis
Ly D, Saldaña M, Chow P. FPT 2009
14
Class II: Processor-based OptimizationsBackground
• Problem 1– Standard message paradigm for HPC systems• Plentiful memory but high message latency• Favours combining data into a few, large messages,
which are stored in memory and retrieved as needed– Embedded designs provide different trade-off• Little memory but short message latency• ‘Just-in-time’ paradigm is preferred– Sending just enough data for one unit of
computation on demand
Ly D, Saldaña M, Chow P. FPT 2009
15
Class II: Processor-based OptimizationsBackground
• Problem 2– Homogeneity of HPC systems• Each rank has similar processing capabilities
– Heterogeneity of FPGA systems• Hardware engines are tailored for a specific set of
functions – extremely fast processing• Embedded processors play vital role of control
and memory distribution – little processing
Ly D, Saldaña M, Chow P. FPT 2009
16
Class II: Processor-based OptimizationsBackground
• ‘Just-in-time’ + Heterogeneity = producer-consumer model– Processors produce messages for hardware engines
to consume– Generally, the message production rate of the
processor is the limiting factor
Ly D, Saldaña M, Chow P. FPT 2009
17
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
• Typical MPI implementations use only software• DMA engine offloads time-consuming, message
tasks: memory transfers– Frees processor to continue execution– Can implement burst memory transactions– Time required to prepare a message is independent
of message length– Allows messages to be queued
Ly D, Saldaña M, Chow P. FPT 2009
18
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
19
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
20
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
21
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
22
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
23
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
24
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
25
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
26
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
27
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
28
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
29
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
30
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
31
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Send(...)1. Processor writes 4 words
• destination rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory
32
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
33
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
34
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
35
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
36
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
37
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
38
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
39
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
40
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
41
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
42
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
43
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
44
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
45
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Ly D, Saldaña M, Chow P. FPT 2009
MPI_Recv(...)1. Processor writes 4 words
• source rank• address of data buffer• message size• message tag
2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor
46
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
• DMA engine is completely transparent to the user– Exact same MPI functions are called– DMA setup is handled by the implementation
Ly D, Saldaña M, Chow P. FPT 2009
47
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions
Ly D, Saldaña M, Chow P. FPT 2009
• Two types of MPI message functions– Blocking functions: returns only when buffer can
be safely reused– Non-blocking functions: returns immediately• Request handle is required so the message
status can be checked later• Non-blocking functions are used to overlap
communication and computation
48
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions
Ly D, Saldaña M, Chow P. FPT 2009
• Typical HPC non-blocking use case:MPI_Request request;...MPI_Isend(..., &request);prepare_computation();MPI_Wait(&request, ...);finish_computation();
49
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions
Ly D, Saldaña M, Chow P. FPT 2009
• Class II interactions have a different use case– Hardware engines are responsible for computation– Embedded processors only need to send messages
as fast as possible• DMA hardware allow messages to be queued• ‘Fire-and-forget’ message model– Message status is not important
– Request handles are serviced by expensive, interrupts
50
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions
Ly D, Saldaña M, Chow P. FPT 2009
• Standard MPI protocol provides a mechanism for ‘fire-and-forget’:MPI_Request request_dummy;...MPI_Isend(..., &request_dummy);MPI_Request_free(&request_dummy);
51
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions
Ly D, Saldaña M, Chow P. FPT 2009
• Standard implementation still incurs overhead:– Setup the interrupt– Remove the interrupt– Extra function call overhead– Memory space for the MPI_Request data structure
• For the ‘just-in-time’ message model on embedded processors, these overheads create a bottleneck
52
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions
Ly D, Saldaña M, Chow P. FPT 2009
• Proposed modification to the MPI protocol:#define MPI_REQUEST_NULL NULL;...MPI_Isend(..., MPI_REQUEST_NULL);
• Non-blocking functions check that the request pointer is valid before setting interrupts
• Circumvents the overhead• Not standard, but minor modification that works
well for embedded processors with DMA
53
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message without DMA
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
54
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message without DMA
MPI_Send()
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
55
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message without DMA
MPI_Send()
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
56
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message without DMA
MPI_Send()
Transfer data words
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
57
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message without DMA
MPI_Send()
Transfer data words
return
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
58
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message without DMA
MPI_Send()
Transfer data words
return
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
59
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message without DMA
MPI_Send()
Transfer data words
return
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
60
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message without DMA
MPI_Send()
Transfer lots of data words
return
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
61
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message with DMA
MPI_Send()
Transfer four words, regardless of message length
return
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
62
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message with DMA
55.6%
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
28.7% 15.6%
63
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message with DMA
55.6%
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
28.7% 15.6%+ = 44.3%
64
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI message with DMA– Message queueing
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
65
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• Inline all MPI functions?
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
66
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• Inline all MPI functions?
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
67
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• Inline all MPI functions?
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
68
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• Inline all MPI functions?– Increases program length!
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
69
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• Standard MPI Functionsvoid *msg_buf;int msg_size;...
MPI_Isend(msg_buf, msg_size, ...);MPI_Irecv(msg_buf, msg_size, ...);
70
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...
){
for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}
}
71
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...
){
for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}
}
72
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...
){
for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}
}
73
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...
){
for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}
}
74
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...
){
for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}
}
75
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...
){
for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}
}
76
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI_Coalesce
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
For loop
77
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI_Coalesce
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
For loop
78
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI_Coalesce
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
For loop
79
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
Ly D, Saldaña M, Chow P. FPT 2009
• MPI_Coalesce
Legend
Non-MPI CodeFunction Preamble/PostambleMPI Function Code
msg 1 msg 2 msg 3
For loop
80
Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()
• MPI_Coalesce is not part of the MPI Standard• Behaviour can be easily reproduced– Even when source code is not available
• Maintains compatibility with MPI code
Ly D, Saldaña M, Chow P. FPT 2009
81
Class II: Processor-based OptimizationsResults
• Application: Restricted Boltzmann Machines[2]– Neural network FPGA implementation– Platform: Berkeley Emulation Engine 2 (BEE2)• Five Xilinx II-Pro XC2VP70 FPGA• Inter-FPGA communication:–Latency: 6 cycles–Bandwidth: 1.73GB/s
[1] D. Ly et al., “A Multi-FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009.
Ly D, Saldaña M, Chow P. FPT 2009
82
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
83
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
Message # Source Destination Size [# of words]
1 R0 R1 0
2 R0 R1 3
3 R0 R6 0
4 R0 R6 3
5 R0 R11 0
6 R0 R11 3
7 R0 R16 0
8 R0 R16 3
9 R0 R1 4
10 R0 R6 4
11 R0 R11 4
12 R0 R16 4
84
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
85
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
86
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
2.33x
87
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
2.33x
88
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
2.33x
3.94x
89
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
2.33x
3.94x
90
Class II: Processor-based OptimizationsResults
Ly D, Saldaña M, Chow P. FPT 2009
2.33x
3.94x
5.32x
91
Class III: Hardware-based Optimizations
• Background• Dataflow Message Passing Model– Case Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
92
Class III: Hardware-based OptimizationsBackground
• Processor-based, software model– Function calls are atomic– Program flow is quantized in message function units– Cannot execute communication and computation
simultaneously• Hardware engines– Significantly more parallelism– Communication and computations can be
simultaneous
Ly D, Saldaña M, Chow P. FPT 2009
93
Class III: Hardware-based OptimizationsDataflow Message Passing Model
• Standard message processing modelMPI_Recv(...);compute();MPI_Send(...);
• Hardware uses dataflow-model
Ly D, Saldaña M, Chow P. FPT 2009
Logic
94
Class III: Hardware-based OptimizationsCase Study: Vector Addition
• Vector Addition:
• va comes from Rank 1, vb comes from Rank 2
• Compute vc, send result back to Rank 1 and 2
Ly D, Saldaña M, Chow P. FPT 2009
bac vvv ibiaic vvv ,,,
95
Class III: Hardware-based OptimizationsCase Study: Vector Addition
• Software model:int va[N], vb[N], vc[N];
MPI_Recv(va, N, MPI_INT, rank1, ...);MPI_Recv(vb, N, MPI_INT, rank2, ...);
for(int i = 0; i < N; i++)vc[i] = va[i] + vb[i];
MPI_Send(vc, N, MPI_INT, rank1, ...);MPI_Send(vc, N, MPI_INT, rank2, ...);
Ly D, Saldaña M, Chow P. FPT 2009
96
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
97
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
98
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
99
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
100
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
101
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
102
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
103
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
104
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
105
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
106
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
• Message transfer are atomic– Serializes computation and communication
• Vector addition has great data locality– Entire message is not required for computation– Only one element of each vector is required
• Higher granularity is required– Hardware dataflow approach would use pipelined
computation
107
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
108
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
109
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
110
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
111
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
112
Class III: Hardware-based OptimizationsCase Study: Vector Addition
Ly D, Saldaña M, Chow P. FPT 2009
113
Class III: Hardware-based OptimizationsDataflow Message Passing Model
Ly D, Saldaña M, Chow P. FPT 2009
• Natural extension of MPI for hardware designers– Increased granularity increased performance– Supports pipelining
• Single processing element represents multiple ranks– Capable of transferring data from multiple sources– Supports data streaming• Full-duplex data transfer
114
Conclusion and Future Work
• MPI can be very effective for FPGA designs– FPGAs have different trade-offs than HPC
• Considerations to deal with FPGA MPI– Class II: DMA, Non-Blocking, MPI_Coalesce()– Class III: Dataflow Message Passing Model
• Attempts to maintain compatibility with MPI standard– Some incremental optimizations do not comply– Can be reduced to legitimate MPI code
• Limit of where current MPI standard applies• Future work: message passing using fine-grain parallelism
Ly D, Saldaña M, Chow P. FPT 2009
115
Thank you
• Special thanks to:
Ly D, Saldaña M, Chow P. FPT 2009
116
Hardware Debugging Interfaces
• Background• Tee Cores• Message Watchdog Timers
Ly D, Saldaña M, Chow P. FPT 2009
117
• Code compatibility allows traditional MPI software-only debugging
• Porting to FPGA designs can still produce errors– Improper on-chip network setup– Message passing flaws in hardware cores
• Hardware has limited visibility– No debuggers– No standard output/printf()
Ly D, Saldaña M, Chow P. FPT 2009
Hardware Debugging InterfacesBackground
118
• Networks typically consists of point-to-point FIFOs
Ly D, Saldaña M, Chow P. FPT 2009
Hardware Debugging InterfacesTee Cores
MPICore
MPICore
119
• Networks typically consists of point-to-point FIFOs
• Tee Cores:
Ly D, Saldaña M, Chow P. FPT 2009
Hardware Debugging InterfacesTee Cores
MPICore
MPICore
MPICore
MPICore
120
• Networks typically consists of point-to-point FIFOs
• Tee Cores:
Ly D, Saldaña M, Chow P. FPT 2009
Hardware Debugging InterfacesTee Cores
MPICore
MPICore
MPICore
MPICore
Processor
121
• Transparent and does not affect original network performance
• Allows direct tracing of data link layer– Simple communication protocols– Easy to follow message transmissions
Ly D, Saldaña M, Chow P. FPT 2009
Hardware Debugging InterfacesTee Cores
Rank 1 Rank n
122
• Transparent and does not affect original network performance
• Allows direct tracing of data link layer– Simple communication protocols– Easy to follow message transmissions
Ly D, Saldaña M, Chow P. FPT 2009
Hardware Debugging InterfacesTee Cores
Rank 1 Rank n
123
• Unresponsive embedded systems cannot be recovered
• Message watchdog timers that are integrated with MPI implementation source code– Snoops incoming messages in a transparent manner– If there’s no activity after the timer expires, the
processor gets interrupted and control is returned• Excellent for post-mortem analysis– Connect with Tee Cores for a terse debugging report
Ly D, Saldaña M, Chow P. FPT 2009
Hardware Debugging InterfacesMessage Watchdog Timers