The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

123
The Challenges of Using An Embedded MPI for Hardware- Based Processing Nodes Daniel L. Ly 1 , Manuel Saldaña 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada

description

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes. Daniel L. Ly 1 , Manuel Saldaña 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada. Outline. Background and Motivation - PowerPoint PPT Presentation

Transcript of The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

Page 1: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

The Challenges of Using An Embedded MPI for Hardware-Based

Processing Nodes

Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1

1Department of Electrical and Computer Engineering University of Toronto2Arches Computing Systems, Toronto, Canada

Page 2: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

2

Outline

• Background and Motivation• Embedded Processor-Based Optimizations• Hardware Engine-Based Optimizations• Conclusions and Future Work

Ly D, Saldaña M, Chow P. FPT 2009

Page 3: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

3

Motivation

• Message Passing Interface (MPI) is a programming model for distributed memory systems

• Popular in high performance computing (HPC), cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009

Page 4: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

4

Motivation

• Message Passing Interface (MPI) is a programming model for distributed memory systems

• Popular in high performance computing (HPC), cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009

Processor 1Memory Processor 2 Memory

for (i = 1; i <= 100; i++) sum += i;

Problem: sum of numbers from 1 to 100

Page 5: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

5

Motivation

• Message Passing Interface (MPI) is a programming model for distributed memory systems

• Popular in high performance computing (HPC), cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009

Processor 1Memory Processor 2 Memory

sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;

MPI_Recv(sum2, ...);sum = sum1 + sum2;

sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;

MPI_Send(sum1, ...);

Page 6: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

6

Motivation

• Message Passing Interface (MPI) is a programming model for distributed memory systems

• Popular in high performance computing (HPC), cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009

Processor 1Memory Processor 2 Memory

sum1 = 0;for (i = 0; i <= 50; i++) sum1 += i;

MPI_Recv(sum2, ...);sum = sum1 + sum2;

sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;

MPI_Send(sum1, ...);

Page 7: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

7

Motivation

• Message Passing Interface (MPI) is a programming model for distributed memory systems

• Popular in high performance computing (HPC), cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009

Processor 1Memory Processor 2 Memory

sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;

MPI_Recv(sum2, ...);sum = sum1 + sum2;

sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;

MPI_Send(sum1, ...);

Page 8: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

8

Motivation

• Message Passing Interface (MPI) is a programming model for distributed memory systems

• Popular in high performance computing (HPC), cluster-based systems

Ly D, Saldaña M, Chow P. FPT 2009

Processor 1Memory Processor 2 Memory

sum1 = 0;for (i = 1; i <= 50; i++) sum1 += i;

MPI_Recv(sum2, ...);sum = sum1 + sum2;

sum1 = 0;for (i = 51; i <= 100; i++) sum1 += i;

MPI_Send(sum1, ...);

Page 9: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

9

Motivation

• Strong interest in adapting MPI for embedded designs:– Increasingly difficult to interface heterogeneous

resources as FPGA chip size increases• MPI provides key benefits: – Unified protocol– Low weight and overhead– Abstraction of end points (ranks)– Easy prototyping

Ly D, Saldaña M, Chow P. FPT 2009

Page 10: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

10

MotivationProperty HPC Cluster Embedded FPGA

Processor

Clock Rate 2-3 GHz 100-200 MHz

Memory

Size per node > 1GB 1-20 MB

Interconnect

Protocol Robustness High None

Latency 10μs (20k cycles) 100ns (10 cycles)

Bandwidth 125 MB/s 400-800 MB/s

Components

Processing Nodes Homogenous Heterogeneous

Ly D, Saldaña M, Chow P. FPT 2009

Page 11: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

11

Motivation• Interaction classes arising from heterogeneous designs:– Class I: Software-software interactions• Collections of embedded processors• Thoroughly investigated; will not be discussed

– Class II: Software-hardware interactions• Embedded processors with hardware engines• Large variety in processing speed

– Class III: Hardware-hardware interactions• Collections of hardware engines• Hardware engines are capable of significant

concurrency compared to processorsLy D, Saldaña M, Chow P. FPT 2009

Page 12: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

12

Background

• Work builds on TMD-MPI[1]– Subset implementation of the MPI standard– Allows hardware engines to be part of the message

passing network– Ported to Amirix PCI, BEE2, BEE3, Xilinx ACP– Software libraries for MicroBlaze, PowerPC, Intel X86

[1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.

Ly D, Saldaña M, Chow P. FPT 2009

Page 13: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

13

Class II: Processor-based Optimizations

• Background• Direct Memory Access MPI Hardware Engine• Non-Interrupting, Non-Blocking Functions• Series of MPI Messages• Results and Analysis

Ly D, Saldaña M, Chow P. FPT 2009

Page 14: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

14

Class II: Processor-based OptimizationsBackground

• Problem 1– Standard message paradigm for HPC systems• Plentiful memory but high message latency• Favours combining data into a few, large messages,

which are stored in memory and retrieved as needed– Embedded designs provide different trade-off• Little memory but short message latency• ‘Just-in-time’ paradigm is preferred– Sending just enough data for one unit of

computation on demand

Ly D, Saldaña M, Chow P. FPT 2009

Page 15: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

15

Class II: Processor-based OptimizationsBackground

• Problem 2– Homogeneity of HPC systems• Each rank has similar processing capabilities

– Heterogeneity of FPGA systems• Hardware engines are tailored for a specific set of

functions – extremely fast processing• Embedded processors play vital role of control

and memory distribution – little processing

Ly D, Saldaña M, Chow P. FPT 2009

Page 16: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

16

Class II: Processor-based OptimizationsBackground

• ‘Just-in-time’ + Heterogeneity = producer-consumer model– Processors produce messages for hardware engines

to consume– Generally, the message production rate of the

processor is the limiting factor

Ly D, Saldaña M, Chow P. FPT 2009

Page 17: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

17

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

• Typical MPI implementations use only software• DMA engine offloads time-consuming, message

tasks: memory transfers– Frees processor to continue execution– Can implement burst memory transactions– Time required to prepare a message is independent

of message length– Allows messages to be queued

Ly D, Saldaña M, Chow P. FPT 2009

Page 18: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

18

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

Page 19: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

19

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

Page 20: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

20

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 21: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

21

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 22: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

22

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 23: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

23

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 24: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

24

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 25: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

25

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 26: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

26

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 27: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

27

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 28: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

28

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 29: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

29

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 30: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

30

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 31: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

31

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Send(...)1. Processor writes 4 words

• destination rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data from memory

Page 32: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

32

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 33: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

33

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 34: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

34

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 35: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

35

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 36: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

36

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 37: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

37

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 38: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

38

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 39: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

39

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 40: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

40

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 41: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

41

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 42: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

42

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 43: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

43

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 44: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

44

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 45: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

45

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Ly D, Saldaña M, Chow P. FPT 2009

MPI_Recv(...)1. Processor writes 4 words

• source rank• address of data buffer• message size• message tag

2. PLB_MPE decodes message header3. PLB_MPE transfers data to memory4. PLB_MPE notifies processor

Page 46: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

46

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

• DMA engine is completely transparent to the user– Exact same MPI functions are called– DMA setup is handled by the implementation

Ly D, Saldaña M, Chow P. FPT 2009

Page 47: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

47

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009

• Two types of MPI message functions– Blocking functions: returns only when buffer can

be safely reused– Non-blocking functions: returns immediately• Request handle is required so the message

status can be checked later• Non-blocking functions are used to overlap

communication and computation

Page 48: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

48

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009

• Typical HPC non-blocking use case:MPI_Request request;...MPI_Isend(..., &request);prepare_computation();MPI_Wait(&request, ...);finish_computation();

Page 49: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

49

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009

• Class II interactions have a different use case– Hardware engines are responsible for computation– Embedded processors only need to send messages

as fast as possible• DMA hardware allow messages to be queued• ‘Fire-and-forget’ message model– Message status is not important

– Request handles are serviced by expensive, interrupts

Page 50: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

50

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009

• Standard MPI protocol provides a mechanism for ‘fire-and-forget’:MPI_Request request_dummy;...MPI_Isend(..., &request_dummy);MPI_Request_free(&request_dummy);

Page 51: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

51

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009

• Standard implementation still incurs overhead:– Setup the interrupt– Remove the interrupt– Extra function call overhead– Memory space for the MPI_Request data structure

• For the ‘just-in-time’ message model on embedded processors, these overheads create a bottleneck

Page 52: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

52

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions

Ly D, Saldaña M, Chow P. FPT 2009

• Proposed modification to the MPI protocol:#define MPI_REQUEST_NULL NULL;...MPI_Isend(..., MPI_REQUEST_NULL);

• Non-blocking functions check that the request pointer is valid before setting interrupts

• Circumvents the overhead• Not standard, but minor modification that works

well for embedded processors with DMA

Page 53: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

53

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message without DMA

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 54: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

54

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message without DMA

MPI_Send()

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 55: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

55

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message without DMA

MPI_Send()

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 56: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

56

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message without DMA

MPI_Send()

Transfer data words

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 57: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

57

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message without DMA

MPI_Send()

Transfer data words

return

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 58: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

58

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message without DMA

MPI_Send()

Transfer data words

return

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 59: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

59

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message without DMA

MPI_Send()

Transfer data words

return

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 60: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

60

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message without DMA

MPI_Send()

Transfer lots of data words

return

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 61: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

61

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message with DMA

MPI_Send()

Transfer four words, regardless of message length

return

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

Page 62: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

62

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message with DMA

55.6%

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

28.7% 15.6%

Page 63: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

63

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message with DMA

55.6%

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

28.7% 15.6%+ = 44.3%

Page 64: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

64

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI message with DMA– Message queueing

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

Page 65: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

65

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• Inline all MPI functions?

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

Page 66: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

66

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• Inline all MPI functions?

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

Page 67: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

67

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• Inline all MPI functions?

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

Page 68: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

68

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• Inline all MPI functions?– Increases program length!

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

Page 69: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

69

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• Standard MPI Functionsvoid *msg_buf;int msg_size;...

MPI_Isend(msg_buf, msg_size, ...);MPI_Irecv(msg_buf, msg_size, ...);

Page 70: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

70

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...

){

for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}

}

Page 71: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

71

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...

){

for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}

}

Page 72: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

72

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...

){

for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}

}

Page 73: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

73

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...

){

for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}

}

Page 74: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

74

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...

){

for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}

}

Page 75: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

75

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

void MPI_Coalesce (//MPI_Coalesce specific argumentsMPI_Function *mpi_fn,int mpi_fn_count,//Array of point-to-point MPI function argumentsvoid **msg_buf,int *msg_size,...

){

for(int i = 0; i < mpi_fn_count; i++) {if (mpi_fn[i] == MPI_Isend)inline MPI_Isend(msg_buf[i], msg_size[i], ...);else if (mpi_fn[i] == MPI_Irecv)inline MPI_Irecv(msg_buf[i], msg_size[i], ...);}

}

Page 76: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

76

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI_Coalesce

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

For loop

Page 77: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

77

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI_Coalesce

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

For loop

Page 78: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

78

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI_Coalesce

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

For loop

Page 79: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

79

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

Ly D, Saldaña M, Chow P. FPT 2009

• MPI_Coalesce

Legend

Non-MPI CodeFunction Preamble/PostambleMPI Function Code

msg 1 msg 2 msg 3

For loop

Page 80: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

80

Class II: Processor-based OptimizationsSeries of messages – MPI_Coalesce()

• MPI_Coalesce is not part of the MPI Standard• Behaviour can be easily reproduced– Even when source code is not available

• Maintains compatibility with MPI code

Ly D, Saldaña M, Chow P. FPT 2009

Page 81: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

81

Class II: Processor-based OptimizationsResults

• Application: Restricted Boltzmann Machines[2]– Neural network FPGA implementation– Platform: Berkeley Emulation Engine 2 (BEE2)• Five Xilinx II-Pro XC2VP70 FPGA• Inter-FPGA communication:–Latency: 6 cycles–Bandwidth: 1.73GB/s

[1] D. Ly et al., “A Multi-FPGA Architecture for Restricted Boltzmann Machines,” FPL, Sept. 2009.

Ly D, Saldaña M, Chow P. FPT 2009

Page 82: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

82

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

Page 83: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

83

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

Message # Source Destination Size [# of words]

1 R0 R1 0

2 R0 R1 3

3 R0 R6 0

4 R0 R6 3

5 R0 R11 0

6 R0 R11 3

7 R0 R16 0

8 R0 R16 3

9 R0 R1 4

10 R0 R6 4

11 R0 R11 4

12 R0 R16 4

Page 84: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

84

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

Page 85: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

85

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

Page 86: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

86

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

2.33x

Page 87: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

87

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

2.33x

Page 88: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

88

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

2.33x

3.94x

Page 89: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

89

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

2.33x

3.94x

Page 90: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

90

Class II: Processor-based OptimizationsResults

Ly D, Saldaña M, Chow P. FPT 2009

2.33x

3.94x

5.32x

Page 91: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

91

Class III: Hardware-based Optimizations

• Background• Dataflow Message Passing Model– Case Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 92: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

92

Class III: Hardware-based OptimizationsBackground

• Processor-based, software model– Function calls are atomic– Program flow is quantized in message function units– Cannot execute communication and computation

simultaneously• Hardware engines– Significantly more parallelism– Communication and computations can be

simultaneous

Ly D, Saldaña M, Chow P. FPT 2009

Page 93: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

93

Class III: Hardware-based OptimizationsDataflow Message Passing Model

• Standard message processing modelMPI_Recv(...);compute();MPI_Send(...);

• Hardware uses dataflow-model

Ly D, Saldaña M, Chow P. FPT 2009

Logic

Page 94: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

94

Class III: Hardware-based OptimizationsCase Study: Vector Addition

• Vector Addition:

• va comes from Rank 1, vb comes from Rank 2

• Compute vc, send result back to Rank 1 and 2

Ly D, Saldaña M, Chow P. FPT 2009

bac vvv ibiaic vvv ,,,

Page 95: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

95

Class III: Hardware-based OptimizationsCase Study: Vector Addition

• Software model:int va[N], vb[N], vc[N];

MPI_Recv(va, N, MPI_INT, rank1, ...);MPI_Recv(vb, N, MPI_INT, rank2, ...);

for(int i = 0; i < N; i++)vc[i] = va[i] + vb[i];

MPI_Send(vc, N, MPI_INT, rank1, ...);MPI_Send(vc, N, MPI_INT, rank2, ...);

Ly D, Saldaña M, Chow P. FPT 2009

Page 96: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

96

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 97: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

97

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 98: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

98

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 99: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

99

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 100: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

100

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 101: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

101

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 102: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

102

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 103: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

103

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 104: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

104

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 105: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

105

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 106: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

106

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

• Message transfer are atomic– Serializes computation and communication

• Vector addition has great data locality– Entire message is not required for computation– Only one element of each vector is required

• Higher granularity is required– Hardware dataflow approach would use pipelined

computation

Page 107: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

107

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 108: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

108

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 109: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

109

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 110: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

110

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 111: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

111

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 112: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

112

Class III: Hardware-based OptimizationsCase Study: Vector Addition

Ly D, Saldaña M, Chow P. FPT 2009

Page 113: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

113

Class III: Hardware-based OptimizationsDataflow Message Passing Model

Ly D, Saldaña M, Chow P. FPT 2009

• Natural extension of MPI for hardware designers– Increased granularity increased performance– Supports pipelining

• Single processing element represents multiple ranks– Capable of transferring data from multiple sources– Supports data streaming• Full-duplex data transfer

Page 114: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

114

Conclusion and Future Work

• MPI can be very effective for FPGA designs– FPGAs have different trade-offs than HPC

• Considerations to deal with FPGA MPI– Class II: DMA, Non-Blocking, MPI_Coalesce()– Class III: Dataflow Message Passing Model

• Attempts to maintain compatibility with MPI standard– Some incremental optimizations do not comply– Can be reduced to legitimate MPI code

• Limit of where current MPI standard applies• Future work: message passing using fine-grain parallelism

Ly D, Saldaña M, Chow P. FPT 2009

Page 115: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

115

Thank you

• Special thanks to:

Ly D, Saldaña M, Chow P. FPT 2009

Page 116: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

116

Hardware Debugging Interfaces

• Background• Tee Cores• Message Watchdog Timers

Ly D, Saldaña M, Chow P. FPT 2009

Page 117: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

117

• Code compatibility allows traditional MPI software-only debugging

• Porting to FPGA designs can still produce errors– Improper on-chip network setup– Message passing flaws in hardware cores

• Hardware has limited visibility– No debuggers– No standard output/printf()

Ly D, Saldaña M, Chow P. FPT 2009

Hardware Debugging InterfacesBackground

Page 118: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

118

• Networks typically consists of point-to-point FIFOs

Ly D, Saldaña M, Chow P. FPT 2009

Hardware Debugging InterfacesTee Cores

MPICore

MPICore

Page 119: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

119

• Networks typically consists of point-to-point FIFOs

• Tee Cores:

Ly D, Saldaña M, Chow P. FPT 2009

Hardware Debugging InterfacesTee Cores

MPICore

MPICore

MPICore

MPICore

Page 120: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

120

• Networks typically consists of point-to-point FIFOs

• Tee Cores:

Ly D, Saldaña M, Chow P. FPT 2009

Hardware Debugging InterfacesTee Cores

MPICore

MPICore

MPICore

MPICore

Processor

Page 121: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

121

• Transparent and does not affect original network performance

• Allows direct tracing of data link layer– Simple communication protocols– Easy to follow message transmissions

Ly D, Saldaña M, Chow P. FPT 2009

Hardware Debugging InterfacesTee Cores

Rank 1 Rank n

Page 122: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

122

• Transparent and does not affect original network performance

• Allows direct tracing of data link layer– Simple communication protocols– Easy to follow message transmissions

Ly D, Saldaña M, Chow P. FPT 2009

Hardware Debugging InterfacesTee Cores

Rank 1 Rank n

Page 123: The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

123

• Unresponsive embedded systems cannot be recovered

• Message watchdog timers that are integrated with MPI implementation source code– Snoops incoming messages in a transparent manner– If there’s no activity after the timer expires, the

processor gets interrupted and control is returned• Excellent for post-mortem analysis– Connect with Tee Cores for a terse debugging report

Ly D, Saldaña M, Chow P. FPT 2009

Hardware Debugging InterfacesMessage Watchdog Timers