by Nariman Eskandari A thesis submitted in conformity with ... · Chapter 1 Introduction The...

A MODULAR HETEROGENEOUS COMMUNICATION LAYER FOR A

CLUSTER OF FPGAS AND CPUS

by

Nariman Eskandari

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

The Edward S. Rogers Sr. Department of Electrical and ComputerEngineering

University of Toronto

c© Copyright 2018 by Nariman Eskandari

Abstract

A Modular Heterogeneous Communication Layer for a Cluster of FPGAs and CPUs

Nariman Eskandari

Master of Applied Science

The Edward S. Rogers Sr. Department of Electrical and Computer Engineering

University of Toronto

2018

A key infrastructure required to make heterogeneous clusters easier to use is a standard

communication mechanism between computing nodes. Without this infrastructure, application

developers of multi-FPGA applications are forced to develop their own custom communica-

tion cores for hardware kernels that can communicate. This problem is significantly worse at a

data center scale. This thesis shows how introducing modularity into an existing orchestration

platform makes it possible to automatically and interchangeably support different underlying

network protocols and multiple communication models for a multi-FPGA platform. We show

that it is straightforward to connect the communication models implemented in the FPGAs to

compatible software implementations of the same models so that it is easy to build heteroge-

neous applications using CPUs and FPGAs. Our results have shown that the heterogeneous

communication has introduced very little area overhead in the FPGAs and can still perform at

line-rate, bottlenecked solely by the network protocols.

ii

Acknowledgements

I would like to thank my supervisor, Professor Paul Chow. The completion of this thesis and

the many valuable life skills I have acquired in the process is because of his guidance, and his

patience to work with me.

I would like to thank my parents Neda Mashhadi and Hamid Reza Eskandari. Without their

support and patience I would not be here.

I would like to thank my excellent colleague Naif Tarafdar for all the support and help, and

my other colleagues in PT477.

I would like to thank my wonderful friends. They have been like family to me over the pas

two years.

iii

Contents

Acknowledgements iii

Table of Contents vi

1 Introduction 1

2 Background 4

2.1 Field Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Data Center Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Network Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.5 AXI Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.7 Xilinx SDAccel Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.8 MicroBlaze Soft Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Related Work 10

3.1 Current Data Centers/Clouds with FPGAs . . . . . . . . . . . . . . . . . . . . 11

3.1.1 FPGA in the cloud: Booting Virtualized Hardware Accelerators with

OpenStack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.2 Catapult v1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.3 Catapult v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iv

3.1.4 Enabling FPGAs in the Hyperscale Data Center . . . . . . . . . . . . . 14

3.2 Base Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Communication Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.1 Streaming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.2 Message Passing Model . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Other Heterogeneous Communication Layers . . . . . . . . . . . . . . . . . . 18

3.4.1 TMD-MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.2 Novo-G# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Galapagos 20

4.1 Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.1 Base Infrastructure Hypervisor . . . . . . . . . . . . . . . . . . . . . . 21

4.1.2 Galapagos Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Application Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Base infrastructure Application Region . . . . . . . . . . . . . . . . . 25

4.2.2 Galapagos Application Region . . . . . . . . . . . . . . . . . . . . . . 26

5 HUMboldt Communication Layer 30

5.1 TMD-MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 MPI Communication Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4 Packetization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.5 Reliability Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.6 HUMboldt Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 System Interface and Tool Flow 40

6.1 Base Infrastructure System Interface . . . . . . . . . . . . . . . . . . . . . . . 40

6.1.1 Logical File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.1.2 Mapping File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

v

6.1.3 MAC Address File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Galapagos System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2.1 Logical File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2.2 Mapping File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 HUMboldt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4 HUMboldt Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.4.1 Software Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.4.2 Hardware Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7 Evaluation 51

7.1 Resource Utilization of Galapagos and HUMboldt . . . . . . . . . . . . . . . . 51

7.2 Latency and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.3 Scalability and Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8 Conclusion and Future Work 58

Bibliography 59

vi

Chapter 1

Introduction

The integration of accelerators in the data center has been shown to be beneficial [1, 2] but using

heterogeneity can be difficult for data center application developers and system administrators.

This heterogeneity is especially difficult for application developers when integrating FPGAs

as this usually requires the user to design the application and management circuitry, includ-

ing the network stack and memory management. For both application developers and system

administrators, the lack of a common communication standard among clusters of accelerators

and CPUs raises challenges in creating communication links between different devices, and

supporting the network connections between these devices. Another challenge for application

developers is the lack of functional portability of a distributed application across different de-

vices, which usually requires a developer to tailor an application specifically for each device

and manage its communication. These challenges become even more daunting at the data

center scale where we can have potentially thousands of nodes interacting. We believe the

challenges can be addressed with a portable standard communication layer across both CPUs

and FPGAs, which we address in this thesis.

The communication layer that a user needs for their distributed application is very specific

to the traffic patterns and programming models of their application. For example, some dis-

tributed applications are more suited to a streaming model of communication, such as video

1

CHAPTER 1. INTRODUCTION 2

processing applications. Other applications that have many nodes sending and receiving data

to each other are more amenable to a message passing model, such as many scientific applica-

tions. In software, these communication models are implemented as application network layers

built on top of the OSI network stack [3]. This gives the user the option to choose a commu-

nication layer suitable to their application and the underlying network layers used to support

the chosen communication layer. There is no equivalent stack in hardware, where typically,

each hardware application has to build its own custom communication layer. We would like a

standard FPGA/CPU communication layer to fit as a module on top of an implementation on

different layers of the OSI network stack in hardware.

The main contributions of this thesis are as follows. First, we build on top of an open source

scalable FPGA orchestration tool [4]. This tool allows us to easily map streaming FPGA ker-

nels and connect them across many FPGAs in a data center by changing a configuration file.

We first address scalability limitations in the tool, and also we re-implemented the networking

stack within this tool by making the network portion modular allowing us to support multiple

implementations of different layers of the networking stack. We demonstrate this by having

the same application communicate over both Ethernet and TCP without changing the appli-

cation. We call our rebuilt infrastructure Galapagos. Then we implement a message passing

communication layer on top of Galapagos, which we call HUMboldt. This is implemented

as a high-level synthesisizable (HLS) and software library allowing an application developed

with this library to be functionally portable across both CPUs and FPGAs. The functional

portability is important because it enables application development in a pure software environ-

ment. Once correct functionality has been achieved, parts of the code can be ported to run as

hardware without modifying the code. A standard communication protocol is used allowing

heterogeneous devices (CPUs and FPGAs) to communicate at a data center scale.

The remainder of the thesis is organized as follows. Chapter 2 provides information that

is necessary for better understanding this thesis. Chapter 3 explores related works in clusters

with FPGAs and cloud systems as well as other communication layers specifically on multi-

CHAPTER 1. INTRODUCTION 3

FPGA clusters. Chapter 4 explores our modular rebuild of an FPGA cluster generator, and

the implementation details of our communication layer is described in Chapter 5. The details

on how to interface with our system and tool flow is provided in Chapter 6. Chapter 7 shows

our results with microbenchmarks measuring the performance of our communication layer and

infrastructure between FPGAs and CPUs. Lastly, conclusion and future work of this thesis is

presented in Chapter 8.

Chapter 2

Background

This chapter presents necessary information for better understanding this work.

2.1 Field Programmable Gate Array

Field Programmable Gate Arrays (FPGAs) are components that have configurable logic that

can be used to implement different digital logic functions [5]. The main components of the

FPGAs are Logic Blocks, Block RAM (BRAMs), and Digital Signal Processing units (DSPs),

and some of them have a ARM processor. The logic blocks have several look-up tables and

Flip-Flops, and they are used to implement most of the logic. There is also a sophisticated

programmable routing network that is used to interconnect all of the logic components and

connect to the inputs and outputs of the logic function.

2.2 High-Level Synthesis

High-Level Synthesis (HLS) tools provide way for developers and hardware designers to build

hardware cores by using high-level programming languages such as C and C++ rather than

Hardware Description languages (HDLs) such as Verilog or VHDL. HLS tools generate HDL

out of higher level programming languages with equivalent functionality, and synthesizable

4

CHAPTER 2. BACKGROUND 5

for FPGAs much as compilers for software languages generate machine-specific assembly

code [6]. One of the most notable advantages of HLS is that a hardware designer can ex-

plore design space choices in a much shorter amount of time. This helps to achieve better

results in performance, area, and power trade-offs.

There are various implementations of HLS tools. Some of these tools are academic. For

example, LegUp [7] is an open-source HLS tool. Some industrial versions of these tools are

Vivado HLS [8] and the Intel HLS Compiler [9]. The work in this thesis uses Vivado HLS that

targets Xilinx FPGAs.

2.3 Data Center Architecture

Figure 2.1 shows the architecture of our testbed platform. All processing elements in our plat-

form, CPUs and FPGAs, have direct connections to the network. This allows any processing

element to communicate directly with any other processing element on the network. Some of

the FPGAs are hosted in a server so there would also be a link over PCIe between the FPGA

and the CPU. Other FPGAs are part of a System on Chip (SoC), where there is an ARM proces-

sor on the same chip as the FPGA, so in this case there is also a connection between the ARM

processor and the FPGA fabric. All of our network links use either 10G or 100G Ethernet.

2.4 Network Stack

The standard OSI network stack is shown in Figure 2.2. It uses a layered approach where

the lower layers provide services to the layers above [10]. The Physical layer is the actual

connection between nodes in a network such as optic fiber and Ethernet cable. The Data Link

layer is responsible for delivering packets within local hops using some addressing scheme. In

this work, we are using Ethernet-based communications as the Data Link layer so we will also

refer to this as the Ethernet layer. This layer does not provide reliability, so there is a chance

for a packet to be lost or corrupted. The layer above the Data link layer is the Network layer,


Figure 2.1: Data center architecture

which in our case is concerned with routing among networks or within a network using IP

addresses. The Network Layer is an end-to-end communication layer that does not care about

the path that a packet may take. The Transport Layer provides end to end communications with

more services such as reliable communications or fragmentation. The most popular Transport

protocols are TCP and UDP.

TCP guarantees the recipient will receive the packets in order by numbering them. The

recipient sends messages back to the sender saying it received the messages. If the sender

does not get a correct response, it will resend the packets to ensure the recipient received them.

Packets are also checked for errors. TCP is all about this reliability, so packets sent with TCP

are tracked so no data is lost or corrupted in transit.

When using UDP, packets are just sent to the recipient. The sender will not wait to make

sure the recipient received the packet, and it will just continue sending the next packets. If

the recipient misses some UDP packets, it cannot ask for those packets again. There is no

guarantee that receiver is getting all the packets and there is no way to ask for a packet again if

receiver misses it, but losing all this overhead means the nodes can communicate more quickly.

In this thesis we have support up to the Transport layer of the Network stack, and the user is


able to choose between using the raw Ethernet protocol (Data Link layer) and the TCP Protocol

(Transport layer) for the communication between nodes of an application using the network.

Figure 2.2: Network Stack

2.5 AXI Bus

AXI is part of the AMBA [11] bus specification that Xilinx uses it for interconnection between

different circuit blocks (IPs) in a large circuit system. The current version of AXI is AXI4

which has three versions [12].

• AXI4: for high-performance memory-mapped requirements.

• AXI4-Lite: for simple, low-throughput memory-mapped communication (for example,

to and from control and status registers).

• AXI4-Stream: for high-speed streaming data.

Most of the interconnections between the components in this work are AXI4-Stream. The

AXI4-Stream protocol defines a channel for transmission of streaming data. Unlike AXI4,


AXI4-Stream interfaces can burst an unlimited amount of data. AXI4-Stream also provides a

back-pressure signal to the previous producer IP, indicating that it is not ready to receive data.

2.6 Partial Reconfiguration

By using Partial Reconfiguration, a user is able to program a part of an FPGA without inter-

rupting the remaining logic [13]. This capability allows the user to change the functionality on

the fly, and it enhances the flexibility of the FPGA devices. It also makes the reconfiguration

time shorter, and could be used to reduce the power consumption of the FPGA. We also point

out a further advantage of using Partial reconfiguration in Section 4.1.2.

2.7 Xilinx SDAccel Platform

The Xilinx SDAccel Platform [14] provides several program interfaces (APIs), by which FP-

GAs can be used as accelerators for a host. A host is a CPU that can have several accelerator

devices connected to it, and it manages the devices. Using these APIs the user is able to pro-

gram the FPGA, send data to it, and read back the result of the computation on the FPGA. The

FPGAs in SDAccel are connected through a PCIe interface. The Xilinx SDAccel platform is

based on the OpenCL language. OpenCL is a similar language to C and C++ that is both a

programming language for parallel and heterogeneous devices, and a programming API for the

host to manage all the OpenCL devices. Xilinx has introduced its own OpenCL driver called

the Hardware Abstraction Layer (HAL) that provides functions by which it is possible to send

and receive data to and from the FPGA, and program the FPGA [15].

2.8 MicroBlaze Soft Processor

MicroBlaze is a 32-bit soft core processor that is optimized for Xilinx FPGAs [16]. Soft core

means that the processor is realized using the available resources in FPGAs rather than being


generated as a ASIC core. Using this core a user can design a system that contains a software

part. This core can connect to hardware cores of the system through the different types of AXI

bus.

Chapter 3

Related Work

Recently, cloud systems have deployed FPGAs, and provide services that includes FPGAs as a

computation node. The reason behind this incorporation is that FPGAs have been shown to be

excellent accelerators for many applications, and their power consumption is relatively low in

comparison against other cloud resources e.g. GPUs and CPUs. However, current tool flows

for deploying FPGAs in the cloud are hard to use. This is more cumbersome when the system

is scaled up to multiple FPGAs, or the FPGA is deployed in a heterogeneous system that has

other cloud resources, and it needs to communicate with them. Therefore, the creation of an

easy to use tool flow is necessary that has following characteristics.

• Provision resources (CPUs and FPGAs) from the cloud and deal with their connectivity.

• A heterogeneous communication layer for the given cluster to make communication high

performance and functionally portable.

There are different communication models such as streaming and message-passing, and in

this work we chose to implement a heterogeneous message-passing communication layer. A

communication layer for heterogeneous systems should be flexible enough to allow a user to

tradeoff application requirements against communication overheads. One dimension of this

flexibility is to allow the user to choose between different layers of the network stack, such as

10

CHAPTER 3. RELATED WORK 11

using raw Ethernet or TCP to tradeoff reliability and scalability versus network overhead. With

this context, the remainder of this chapter describes related work.

The base infrastructure used as the starting point for this work is presented in [4] that or-

chestrates an FPGA cluster in the cloud by taking cluster description files as input (more details

will be presented in Chapter 6.1), and returning to the user a handle to a multi-FPGA network

connected cluster. The layered architecture of baseline infrastructure makes it easy to improve

and add a modular and flexible layer for functionally portable and scalable communication

between hardware and software.

3.1 Current Data Centers/Clouds with FPGAs

In this section, we explore some works that have deployed FPGAs in cloud systems. All works

can be categorized into two major categories that are as follows:

1. Host/Slave: In this model, each FPGA is tightly coupled with a CPU, so the control

and data transmission to the FPGA is done by the CPU, and there is no direct network

communication to another FPGA or CPU in the cluster.

2. Standalone: In his model FPGAs have direct connections to the network, so there is a

potential for them to communicate to each other or CPUs. This approach is shown in

Figure 2.1.

The reminder of this section introduces some notable and representative works in this area and

describes their characteristics. There are many other works, but they do not introduce any other

features beyond the systems presented here.


3.1.1 FPGA in the cloud: Booting Virtualized Hardware Accelerators

with OpenStack

Byma et al. [17] introduced an approach for FPGAs to be used as accessible resources in

the SAVI testbed [18], which is an OpenStack-based research cloud system [19]. In this

OpenStack-based cloud, when a resource request arrives, OpenStack finds available resources

in the database, and finds which physical machine they are located on. The OpenStack con-

troller has been modified to communicate with the resource manager via a separate process

running beside the software hypervisor, called an Agent [17]. The Agent is a simple program

that can communicate with Nova, which is a component of OpenStack. In this system, a single

FPGA can be used as a resource for more than one cloud application. Each FPGA has four par-

tially reconfigurable regions called VFRs (Virtual FPGA Resources) that are directly connected

to the network over a Layer Two network (Ethernet). Since the FPGAs in this infrastructure

are directly connected to the network they could be considered as a standalone node. This in-

frastructure fits into Category 2. A load-balancing application was used to observe the impact

of using an FPGA in the cloud in terms of latency and throughput. The latency of the software

Virtual Machine (VM) load-balancer varies and increases when the injection rate increases. In

contrast, the VFR load balancer latency is constant with different injection rate. The VM load

balancer starts to drop packages even at a 25 MB/s injection rates; nevertheless, VFRs never

drop a single packet.

3.1.2 Catapult v1

Catapult v1 [1] of Microsoft was the most important pioneer for deploying FPGAs into large-

scale data centers, demonstrating the advantage of using FPGAs in data centers. They used

Microsoft Bing’s page ranking as a case study. A PCIe card with a Stratix V FPGA was added

to every CPU. The FPGAs in one rack are connected by a secondary network, which has a

two-dimensional, 6x8 torus topology [1].


In Catapult v1 the inter-FPGAs and host-to-FPGA interconnections are abstracted at the

user point of view by implementing a Shell in the FPGAs, which handles communications

with the CPU via the PCIe link, the network and DDR memory. The block diagram of this

shell is shown in Figure 3.1. In our work, a new open-source and general shell is implemented,

which is described in Chapter 4.1. Instead of shell, we use the term hypervisor as we intend to

provide more functionality and services than just a hardware abstraction.

Figure 3.1: Microsoft Catapult Shell.

Using Catapult v1, Microsoft implemented a significant fraction of Bing’s page ranking en-

gine on FPGAs. In comparison with the pure software implementation, at the expense of 10%

more power consumption, this system achieves a 95% improvement in throughput (latency is

equivalent) and they can reduce tail latency by 29% [1].

Catapult v1 does not exactly fit either of the two categories described above. It is a hy-

brid in that the PCIe link is used to access and control the FPGA but the torus network also

provides FPGA-FPGA communication between FPGAs in the same rack, but not beyond the

rack. Therefore, this is not generally scalable. Furthermore, this infrastructure is not elastic

and flexible, which is desired for a cloud platform.


3.1.3 Catapult v2

In Catapult v1 [1], the FPGAs have network access only to the other FPGAs in same rack (48

FPGAs). This limits elasticity and scalability. The architecture of Catapult v2 [2] is shown

in Figure 3.2. In the new cloud architecture the FPGA hardware plane is decoupled form the

traditional software plane. Furthermore, instead of directly connecting the server NIC to the

top-of-rack (ToR) switch, the NIC is connected to the FPGA, and the FPGA is connected to

the ToR switch. The PCIe connection still exists. With this topology, it is possible to make a

large pool of FPGAs resources. Two key components are added to the shell: an Elastic Router

for intra-FPGA connections and a Light-weight Transport Layer Protocol Engine that enables

inter-FPGA connections. With the direct network connection for the FPGA, it is possible to do

what Microsoft calls Remote Acceleration where tasks can be accelerated using an FPGA that is

not attached to the CPU requiring the acceleration. Using this approach, the Bing ranking was

implemented and it was shown that the local acceleration and remote acceleration throughputs

are the same. Catapult v2 fits into the second category because the FPGAs can be orchestrated

as standalone nodes in the cluster.

3.1.4 Enabling FPGAs in the Hyperscale Data Center

In this work, the authors proposes a method to decouple FPGAs from the CPUs in a cloud

environment [20]. In this cloud, the FPGA can be used as a standalone resource (Category 2).

In their system architecture, they introduce a high-level “shell” in FPGAs that can manage the

interconnections in all network layers. For example, with this shell, they support FPGAs that

are securely isolated in a multi-tenant environment by offering VLANs or OVNs in their cloud

structure. VLAN (Virtual LAN) and OVN (Open Virtual Network) provide a partitioned and

isolated network at the Ethernet layer (layer 2). In their cloud system, instead of using Nova

in OpenStack, which is responsible for provisioning resources in the cloud, they add their

service called the “Accelerator Service”. This service is responsible for employing FPGAs as a

standalone resource for a cloud application. Their cloud system network is a Software Defined


Figure 3.2: (a) Microsoft Catapult v2 decoupled the hardware plane from the traditional soft-ware plane. (b) Server plus FPGA schematic [2].

Network (SDN).

3.2 Base Infrastructure

In this section, we describe the open source framework as it was before we upgraded it to

support our communication layer. This framework is described in [4] and further extended

in [21]. In these works, Tarafdar et al. introduce a multi-FPGA abstraction layer that maps a

graph of streaming IP blocks connected by a large logical switch onto a multi-FPGA network-

connected cluster that is provisioned from an elastic pool of cloud resources. From a high-

level, the user provides a collection of kernels, a logical file describing the entire cluster and

a mapping file of kernels to physical FPGAs. The user is returned a network handle for their

FPGA cluster. The logical view of the kernels and how they are mapped into multiple FPGAs

is shown in Figure 3.3. The details of these files, which are the user interface to the base

infrastructure, is described in Chapter 6.1.


Figure 3.3: Logical view of the streaming kernels and how they are mapped into physicalFPGAs.

Each FPGA in the cluster has its physical resources abstracted with a Hypervisor. The Hy-

pervisor exposes a control interface through PCIe and a data interface through the 10G Ethernet

port, connecting to an application region. Within the application region, the framework places

an interconnect on each FPGA to interface with all kernels within the cluster (either directly

connects to kernels that are locally connected or encapsulates the packet with network infor-

mation to make a network hop). This direct connection to all kernels has an inherent limitation

as the interconnect has at most 16 ports, thus limiting the total number of kernels within the

entire cluster to 16. Furthermore the modularity of this implementation is limited as the user is

forced to use Ethernet packets between FPGAs, making the communication unreliable. More

details about the data center infrastructure, and network architecture can be found in [4], and

more details on the FPGA Hypervisor can be found in [21].


3.3 Communication Models

The communication model is very application-specific as different applications exhibit differ-

ent communication traffic patterns. Two popular communication models are streaming and

message passing.

3.3.1 Streaming Model

In a streaming model, data is being sent continuously through a point-to-point channel. Some

implementations of streaming communication models include the Real-time streaming-protocol

(RTSP) [22] and MPEG-DASH [23]. These protocols are typically built on top of network pro-

tocols like UDP, or even Ethernet as these protocols have better latency but lack reliability by

default.

3.3.2 Message Passing Model

In a message passing model, data can be transferred between arbitrary nodes. Generally, shorter

packets provide less latency while longer packets provide better throughput because message

overheads are minimized. Users must partition data into messages and include a destination

address when sending, unlike a stream where it is a continuous flow of data to one or more

preconfigured receivers. The most common implementation of a message passing model is

the Message Passing Interface (MPI) [24]. We have decided to implement our communication

layer as a subset of MPI because there has been prior work [25] in implementing MPI on an

FPGA cluster that was very successful. MPI is also a well-known standard API that is widely

used in various types of HPC applications. This helps us with our goals of heterogeneity,

functional portability and scalability.

Another messaging protocol that has a significant user community is ZeroMQ (0MQ) [26].

ZeroMQ uses a socket-like interface that supports multiple message patterns such as request-

reply and publish-subscribe. By using a socket-like interface, it would also be easier to use it


like a FIFO, or streaming interface.

3.4 Other Heterogeneous Communication Layers

In this section we explore other implementations of heterogeneous communication layers. We

look at two in particular: the first being TMD-MPI [27] and the second being Novo-G# [28].

3.4.1 TMD-MPI

The work presented in [25] explores an implementation of MPI within a multi-FPGA environ-

ment called TMD-MPI. TMD-MPI implements a subset of the MPI protocol to allow hardware

or software processing engines on the FPGA to communicate amongst each other on the same

FPGA and across multiple FPGAs. The hardware versions of the MPI functions are imple-

mented in VHDL. In this thesis, these functions are described in HLS-synthesizable C code.

TMD-MPI was created to be portable across multiple physical platforms so it was imple-

mented in several layers, including layers that correspond to the physical setup of the network

connected FPGAs. In this thesis, we avoid the need to create an equivalent of these layers as

our communication layer modularly builds on an improved cluster generator tool that handles

the communication of blocks within an FPGA and across multiple FPGAs in a data center.

3.4.2 Novo-G#

The work presented in [28] is a heterogeneous environment with 24 CPU servers that are con-

nected via PCIe to FPGA boards that contain four Stratix V FPGAs. There are direct connects

between the FPGAs on an individual board forming a 3D torus with a custom hardware net-

work stack to support these direct connections. If needed, communication between host nodes

can use MPI. The Novo-G# is a system that shows both the use of a custom hardware network

stack for FPGA-to-FPGA communication as well as a model where accelerators are connected

to host nodes and the host nodes can communicate using a standard software MPI library.


In this thesis, hardware and software components of the same application can communicate

with the same communication layer as peers, which makes it much easier to use hardware or

software for any computing kernel because the programming model is the same for either case.

Chapter 4

Galapagos

We rebuilt the open source framework described in [4] to improve scalability, modularity, and

reliability. These improvements allow the user to implement designs with different network

protocols (e.g Ethernet or TCP) and communication layers by changing a configuration file

describing the heterogeneous cluster. The user can target any number of available devices

(FPGA and/or CPU) with a limit of 16 kernels per FPGA due to the number of ports on the

Xilinx switch IP core being used. We call this rebuilt framework Galapagos.

In this chapter we will describe how we improved the framework from [4] by introducing a

new easily available, portable for other boards, and more general hypervisor. Furthermore, we

will describe how we addressed the limitations of the base infrastructure, and how we made

this platform reliable and more modular. These changes led to building a communication layer

on top of Galapagos that is described in Chapter 5.

4.1 Hypervisor

All the I/O interfaces on the FPGA are abstracted away from the user (such as network, PCIe,

and DDR memory) by introducing the notion of the Hypervisor. Note that this could also be

called a Shell in Microsoft terminology, but our goal is to add much more than an abstraction

layer to the hardware and provide other services in the same manner as a software hypervisor

20

CHAPTER 4. GALAPAGOS 21

is used to support virtual machines (VMs). The FPGA Hypervisor we implemented has two

regions: a static region that provides standard interfaces for the user such as network and

memory, as well as the application region where user kernels are placed. The static region

needs to be programmed only once, and it will remain intact when the user programs the

application region.

The Hypervisor achieves one of the main goals of this work, which is the ease of use.

The user, who develops the application, only needs to deal with some AXI interfaces, and all

the infrastructure below is already implemented and ready to use even for different types of

FPGAs. An application developer can enjoy all the services provided by the hypervisor with

low effort, as a lot of difficulties are removed through abstractions.

4.1.1 Base Infrastructure Hypervisor

The first iteration of the Hypervisor that is used in the base infrastructure was introduced in [4].

That Hypervisor is a modified version of the SDAccel Hypervisor described in Section 2.7. A

part of the SDAccel platform provides some high-level functions to send and receive data

through the PCIe link connecting to the CPU. Tarafdar et al. [4] added a 1G Ethernet capability

to this Hypervisor. Furthermore, they improved the Hypervisor described in [21] by upgrading

the Ethernet speed to 10G. They also defined the application region as a partial reconfiguration

region. The block diagram of the base infrastructure hypervisor is shown in Figure 4.1, and the

notable features that the base infrastructure hypervisor provides are as follows:

• Network: This Hypervisor provides access to the network through a 10G Ethernet link.

It interfaces with Application Region by an input and output AXI-stream port.

• PCIe: The SDAccel platform provides an easy way to use the PCIe link. Using the

SDAccel as a Hypervisor on the hardware side, and the HAL as a driver for PCIe (de-

scribed in Section 2.7), some high-level functions can be used to communicate with the

FPGA through the PCIe link. The interface of this block to the Application Region is


Figure 4.1: Base infrastructure Hypervisor block diagram

through an AXI-full slave.

• DDR Memory: The SDAccel platform has the needed infrastructure to give the user the

capability to use DDR memory. Using some high-level functions a CPU can send and

receive data to and from DDR memory through PCIe. The FPGA fabric can send and

receive data to and from DDR memory through an AXI-full master.

• Application Region Interface: The Application Region in this hypervisor has several

interfaces:

– AXI-Stream inputs and outputs connected to the network port of the FPGA.

– An AXI-full slave connection to the PCIe link.

– An AXI-full master connection to DDR memory.

In addition, the application region is a partial reconfiguration region, which is done for

several reasons. There is an advantage of shorter reconfiguration time because only part

of the FPGA is configured. Partial reconfiguration also solves another problem with the


PCIe interface.

After reconfiguration of the FPGA, the PCIe device is not going to be discoverable by

the host anymore. It is because some of the PCIe configuration information disappears

during the FPGA programming, and it only can be scanned by the root during booting.

This requires rebooting the system, which is not always an option. The application region

of this hypervisor is in a partial reconfiguration region that is described in Section 2.6.

This means that the PCIe hardware placed in the static region of the hypervisor, which

remains intact during the partial reconfiguration process of the application region. This

solves the PCIe problem, and the PCIe of the FPGA is going to be discoverable by the

host.

The reconfiguration process of a partial region should be at the time that the network is

not in the middle of sending a packet to the partial reconfiguration region. The responsi-

bility of the gates in this hypervisor is to make sure that the current packet transmission

is done before it starts programming the partial bitstream.

4.1.2 Galapagos Hypervisor

The Hypervisor in [4] was built specifically for the Alpha Data 7v3 boards [29], and it had

proprietary IPs that were not publicly available. This thesis is intended to be an open-source

project. Therefore, as the first step, a general Hypervisor is built for the Alpha Data 8k5

boards [30] using cores that are available in the Xilinx IP catalog versus using proprietary cores.

Figure 4.2 shows the block diagram of the Hypervisor. The characteristics of this Hypervisor

are as follows:

• Network: Same as the base infrastructure, this Hypervisor provides access to the network

through a 10G Ethernet link. Furthermore, by adding the Xilinx Network Stack [31], it

supports higher levels of network communication such as a IP layer and a Transport layer

(e.g UDP and TCP). However, in this thesis Ethernet is chosen as the main interface to the


application region via AXI streams due to its compatibility with the base infrastructure.

The modified version of the base infrastructure application region is still able to give the

user the option of using higher layers of the network in an easy manner. This will be

described in Section 4.2.

In this Hypervisor, there is a monitoring module called Minitor (mini-monitor) that mea-

sures network load on the FPGA. This simple module that counts the number of incoming

and outgoing packets in specific time intervals can help the cloud orchestration tool to

perform load balancing. A MicroBlaze in the hypervisor has access to this information

and the users may define their own use-cases for this information.

• PCIe: For the PCIe interface of this Hypervisor, the Xilinx DMA/Bridge Subsystem is

used. This module is available in the Xilinx IP catalog, and its driver and some sample

codes are provided as well. In the integration of this Hypervisor with the base infrastruc-

ture, the PCIe link is used as a control path for the system.

• DDR Memory: For the DDR memory I/O interface in this thesis, the Xilinx Memory

Interface Generator (MIG) is used. This module is available in the Xilinx IP catalog as

well. The memory management unit in this Hypervisor is a simple AXI interconnect

because this hypervisor does not support multiple concurrent users.

• Application Region Interface: The application region interface is exactly the same as the

base infrastructure Hypervisor.

This thesis is focused in heterogeneous platforms, in which different boards and hosts are

able to communicate with each other. All the modules in this hypervisor are from the Xilinx IP

catalog and so with some simple changes in configurations and constraint files, the Hypervisor

can be easily ported. The link to this work can be found at

https://github.com/eskandarinariman/8k5-shell.

https://github.com/eskandarinariman/8k5-shell


Figure 4.2: Hypervisor/Shell block diagram

4.2 Application Region

After building a new Hypervisor for the base infrastructure, some modifications are required for

making the application region of the base infrastructure more scalable, modular, and reliable.

These changes are applied to the application region and the automated flow that creates it. In

this chapter, we focus on the hardware design of the application region, and tool flow will be

described in Chapter 6.

4.2.1 Base infrastructure Application Region

This framework takes a description of a cluster composed of streaming kernels with a unique

ID and maps it to multiple FPGAs. Each streaming kernel uses the AXI stream protocol with a

dest field to specify which kernel the packet is destined for. As shown in Figure 3.3, a logical

view of this infrastructure is a large switch connecting all the kernels within the cluster. The

framework transforms this logical switch into two physical switches, with the first being an

AXI Stream switch on the FPGA and the second being a top-of-rack network switch. The


application region of the original base infrastructure is shown in Figure 4.3.

Figure 4.3: The original open-source framework from [4].

In the base infrastructure, Ethernet packets are transformed into AXI stream packets through

the use of an Input Bridge that matches the MAC address of the arriving packet with the FPGA

MAC address. This then connects to an Input Switch, which is a Xilinx AXI-stream switch.

This switch is connected to all kernels within the cluster, either directly if on the same FPGA,

or through a Packet-Formatter module that encapsulates the AXI stream packet with the ap-

propriate Ethernet headers, and places the AXI stream dest field in the Ethernet payload (one

packet formatter for each kernel outside the FPGA). These direct connections limit scalability

as the number of kernels in the cluster is limited by the number of ports in the switch (16).

4.2.2 Galapagos Application Region

The modified version of the base infrastructure application region is illustrated in Figure 4.4.

In Galapagos, we first address scalability by creating a general router. The block diagram of

the router is shown in Figure 4.5. The router on each FPGA includes a routing table indexed

by the unique ID of each kernel in the entire cluster (including kernels not on this FPGA) and

the network address (MAC or IP) of the FPGA that contains each kernel. All kernels output

their packets to the output switch of the router. Then a module called filter reads the dest field

of the packet, and looks up the network destination in the routing table. Afterwards, the packet


is either routed back to the input switch of the router or routed out to the network. The tool

flow of Galapagos specifies the number of ports of the router by reading the cluster description

files (described in Chapter 6) and it can be any number between 1 and 16. The number of ports

on this router is equal to the number of kernels only on this FPGA, since all packets leaving

the FPGA share one channel. This limits us to have up to 16 kernels on a particular FPGA

as opposed to the entire cluster as per the original design. The routing table is automatically

generated by our modifications to the cluster generator.

Figure 4.4: A high-level overview of Galapagos application region.

Figure 4.5: Router block diagram

The network bridge which is shown in Figure 4.4 is responsible for converting network

packets into AXI-stream format and vice versa. For the Ethernet Network Bridge of Fig-


ure 4.4, this thesis combines the Input Bridge and a modified version of the Packet Formatter

of Figure 4.3 into a single module. The block diagram of this module is shown in Figure 4.6.

The modified packet formatter is equipped with a look up table that has the MAC addresses for

each destination kernel. Using the MAC address lookup table in the packet formatter decreases

resource utilization as it only uses one packet formatter logic with a small memory instead of

multiple packet formatters for each kernel outside the FPGA, and it helps the Galapagos to be

more scalable.

In [4] the multi-FPGA communication uses the Ethernet protocol, which is not reliable. In

Galapagos, to address reliability, an optional TCP core [32] is integrated into the framework.

In this thesis an additional Network Bridge is created for TCP, allowing us to standardize the

interface between the Hypervisor and Application Region. The standardization of the interface

allows us to use both TCP and Ethernet interchangeably, thus addressing modularity. The block

diagram of th TCP Network Bridge is illustrated in Figure 4.7. Observe that the interfaces are

the same as for the Ethernet Network Bridge in Figure 4.6.

Figure 4.6: Ethernet NetworkBridge

Figure 4.7: TCP Network Bridge.

The Ethernet Network Bridge receives all the packets that are sent to the same VLAN. It

matches the FPGA MAC address with the destination MAC address of the incoming packet.

Afterwards, it either drops the packet or forwards it to the intra-FPGA network. The TCP


Network Bridge is more complex since it is considered a higher level of the network stack. In

this bridge there are notions of server and client. The server listens to a specific port, waiting

for a connection. A client can make a connection to the server using the specific port and IP

address of the server. When a connection is accepted by the server, a session will be created on

both sides of this communication. Afterwards, the client can send a request to the server, and

the server will respond to the request using the session.

The TCP Network Bridge developed in this thesis operates in the following manner. At

reset time, all the servers (FPGA and CPU nodes) of the entire cluster start listening to a specific

port. When a kernel sends data to another kernel that is located on another node, the client side

of the sender creates a connection, using the port and IP address of the destination kernel that is

available in a look up table called the IP table (the automated tool flow of Galapagos generates

this). When the destination server accepts the connection, a new session will be created for

both client and server. They will put the session ID into another look up table, by destination.

This session will be used for all future communication between these two kernels in the cluster.

A user may wish to create a communication layer on top of standard network layers (e.g

MPI). A communication bridge (Comm Bridge in Figure 4.4) is used to transform network

packets to communication layer compliant packets. In Chapter 5, the HUMboldt communi-

cation layer will be described as an example of a communication bridge. In Galapagos, we

further address modularity by allowing the user to configure the network protocol (Ethernet or

TCP) and communication bridge via the mapping configuration file. Details of the mapping

file can be found in Chapter 6.

Chapter 5

HUMboldt Communication Layer

In this chapter a heterogeneous communication layer called HUMboldt is described. The rea-

sons that this communication layer is called HUMboldt are as follows:

• In the Galapagos islands, Humboldt is a name of an ocean current that flows between

the islands.

• Humboldt is also an explorer ship that carries passengers between islands.

• The communication layer in this work is heterogeneous and it is based on message pass-

ing model. So “HUM” could stand for Heterogeneous Uniform Messaging.

Due to the modular design of Galapagos, the implementation of the communication layer can

be any communication model as long as it adheres to the AXI stream interface, and the appro-

priate communication bridge to convert AXI stream packets into communication layer specific

packets is provided. The communication layer we present here implements HUMBoldt, a min-

imal subset of MPI that is sufficient to enable basic message passing between kernels, which

is explained in Section 5.2. MPI is a standard API that defines signatures for functions such

as sending and receiving messages. These signatures must remain the same for all implemen-

tations of MPI. Even though the implementation for these functions varies according to the

platform (e.g FPGA or CPU), the standardization of the protocol for HUMBoldt allows for

30

CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 31

communication between heterogeneous platforms. In this chapter, we first provide more de-

tails about TMD-MPI, the predecessor to HUMboldt followed by some background on MPI.

We then present the details of HUMboldt.

5.1 TMD-MPI

TMD-MPI [25] has implemented a heterogeneous MPI infrastructure that can be considered as

a baseline heterogeneous communication layer that supports CPUs and FPGAs. The high-level

architecture of TMD-MPI FPGA node is shown in Figure 5.1. The Off-chip Communication

Controller (OCCC) module handles the communication with off-chip nodes. The bridge trans-

lates the packets to a format that the OCCC can send outside the FPGA. The Network Interface

(NetIf) modules route packets to their destination internal to an FPGA.

In TMD-MPI the user has to deal with all the low-level infrastructure. However, in Gala-

pagos all the infrastructure is already implemented, and the user can choose between different

options for communication by creating the cluster description files, and the cluster will be gen-

erated automatically. Furthermore, in Galapagos, by introducing the communication bridge,

a user can define any protocol for communication. However, the communication protocol of

TMD-MPI is fixed, and integrated within the whole infrastructure.

Figure 5.1: TMD-MPI high-level view.

The kernels that use TMD-MPI for communication should be connected to a module called

TMD-MPE (TMD-Message Passing Engine). This module has all equivalent functionality of

the TMD-MPI software library. The user has to manage the scheduling of the TMD-MPE


themselves, which they do through a MPE wrapper and an FSM as shown in Figure 5.2. TMD-

MPI was developed before high-level synthesis tools existed so the kernels and the TMD-MPE

were all written in HDL. The user had to manually connect them.

The HUMboldt communication layer provides a library in HLS that has all the HUMboldt

software equivalent functions. Therefore, a user can use the same code for both software and

hardware, and just call HUMboldt library functions. In TMD-MPI, creation of a hardware core

is difficult, because the user has to manually schedule all the MPI function calls, and add the

handler logic that works with TMD-MPE. Whereas, in HUMboldt, HLS handles the scheduling

of MPI calls.

Figure 5.2: TMD-MPI kernel.

5.2 MPI Communication Layer

This section explains MPI and the subset of the MPI library that we support, which we call

Heterogeneous Uniform Messaging (HUMboldt). MPI is a common communication and pro-

gramming layer in the High Performance Computing field. Using MPI, many parallel processes


called “ranks” communicating via messages, can be run on multi-node platforms. In this work,

instead of “rank” we use “kernel” to be consistent with the base infrastructure terminology.

MPI has been mainly developed for multi-CPU network-connected platforms, which is the

standard high-performance computing architecture. However, it can also be used as a message

passing model and a programming layer for heterogeneous platforms.

In MPI software implementations such as MPICH [33], and OpenMPI [34], functions are

provided to transmit data among different kernels in various ways. The two main categories of

send and receive are blocking and non-blocking. In blocking send and receive, the functions

do not return until the data transmission is successfully done. In contrast, in non-blocking send

and receive, the functions immediately return, and continue the normal flow of the program. In

this thesis, only blocking send and receive is implemented. Other types of functions provided in

MPI include collective functions e.g. MPI Gather, MPI Reduce, and MPI AllReduce. These

functions collect the data from different nodes in an optional root kernel by a user specific

operation e.g. add or sub. This thesis has not implemented any of these functions, however,

implementing these functions using the send and receive functions as the building blocks is

straightforward.

The minimum subset of MPI functions needed for communication are:

1. MPI Init: This function initializes the MPI environment, and does the basic setup such

as network interface initialization.

2. MPI Send and MPI Recv: These two functions are building blocks of the MPI program-

ming model that enable data transmission among kernels. For every Send to a kernel,

there must be a matching Receive on that kernel to get the data from the sender.

3. MPI Finalize: Makes sure that all kernel processes are completed.


5.3 Protocol

In this section we describe the protocol by which all hardware and software kernels communi-

cate with each other in HUMboldt. The first step in data transmission is to send an envelope

as a send request. This envelope has information about routing, message size, and data type.

Once the receiver kernel is ready to receive the data, it sends a clear to send envelope to the

sender. After the sender receives the clear to send packet, it starts to send the data packets to

the receiver, and at the end, when the receiver has received all the data successfully, it sends a

packet to the sender indicating that the transmission is done. These steps are shown in Figure

5.3.

Figure 5.3: HMPI Protocol

Buffering is necessary because there are scenarios when send or receive envelopes might

be sent out of order. For example, consider kernel A receiving data from kernel B when kernel

C sends an envelope to kernel A. To avoid losing the envelope, the envelope from kernel C

must be buffered by kernel A until the previous transaction with kernel B is completed.

In parallel processing, platform synchronization is an important aspect since different pro-

cesses run at different speeds. At some point, one node will need to wait for other nodes to


reach a specific point in the program. In HUMboldt, we currently only implement blocking

sends and receives, meaning that the kernels wait for these functions to finish before resuming

processing. This is sufficient for the synchronization we currently need, but when we add the

ability to do non-blocking sends and receives we will also need to add barriers into the next

version of the HUMboldt library.

There are two types of networks in our HUMBoldt communication layer. The Intra-FPGA

AXI stream network is used for kernels that are located in the same physical FPGA, and the

Inter-node communication between FPGAs and CPUs uses the network. The network com-

munication currently supports TCP or Ethernet but any network protocol with an AXI stream

interface can be used. We use the same underlying HUMBoldt protocol for kernels commu-

nicating within the same node (FPGA or CPU) and between nodes, as our Communication

Bridge described in Section 4.2.2 encapsulates a HUMBoldt compliant packet with the appro-

priate network header (e.g TCP or Ethernet) to handle the inter-node communication.

Figure 5.4 shows the envelope that carries the HUMBoldt messages between kernels. The

first two bytes correspond to the destination and source kernels. The Packet Types are send

request, clear to send, data, or done, where the different packet types are used to implement

the message passing protocol underlying HUMBoldt. The next three bytes specify the size

of the message being sent. In the standard implementation of MPI, the Tag is an option for

the user to tag optional metadata to transactions, and for compatibility purposes we keep this

as a field in the envelope used for HUMBoldt. Due to the limitation of HLS lacking a void

pointer, there are different functions for Send and Receive for different types of data in the

HMPI library. The Data Type field helps to process the different data types properly.

Figure 5.4: HUMboldt packet format


5.4 Packetization

For the sake of modularity of the design, the HUMboldt library should be able to send any size

of data to other kernels. This is not a problem in the Intra-FPGA network because there is no

limitation for packet size. However, when a packet is going through the inter-node network

it faces the packet size limitation which is 1518 bytes. This number is the maximum number

of bytes that is allowed for an Ethernet frame. Therefore, packetization logic is implemented

in the Communication Bridge because an outgoing packet goes through this module first. The

Ethernet Communication Bridge divides the packet into Ethernet frame size, adds a data enve-

lope to them, and sends it to the network bridge. The TCP Network Bridge has the ability to

handle larger packets. Nonetheless, in the first experiments with the TCP core, we saw that the

TCP/IP core first buffers all the data, and then starts to send it, and still it has the buffer size

limitation of 65k. Furthermore, the core eventually divides the large TCP packet into small

Ethernet frames. The packetization logic is added to the TCP communication bridge to lower

the latency of the TCP communications (it does not wait to buffer data), and address the limi-

tation of buffering size in TCP. In addition, it keeps the consistency between TCP and Ethernet

implementation of the Communication Layer.

5.5 Reliability Layer

Given that Galapagos uses Layer 2 communication (Ethernet), it is possible that packets can

be dropped in the network or received out of order. There are two options for addressing these

issues. First is adding a reliable transport layer like TCP which is described in Section 4.2.2.

The other alternative is to add a simple and lightweight reliability layer on top of Layer 2. The

reliability on top of Layer 2 could be low latency in the network environments that are not busy

and the packet drop rate in them is low. The other advantage of a lightweight reliability layer

is a lower resource utilization.

As mentioned in Section 5.4, all the packets that leave the FPGA go through the commu-


nication bridge that is responsible for packetizing the data and adding a data envelope to each

network packet. It adds a sequence number to each data packet. The sequence number starts

at 0 and increments by one per outgoing network packet, and the message size field of the data

packet envelope is used to add this extra information. This number is used as a key element

for our reliability layer. The receiver keeps track of the sequence numbers from each node of

the cluster, for each transaction. When a packet gets dropped or packets are received out of

order, the sequence number does not match the one that the receiver is expecting. The receiver

sends an error envelope with the last correct sequence number it received. Once the sender

receives the error envelope, it starts sending data from that sequence number again. The re-

ceiver keeps discarding data packets until it gets the correct one, and data transmission resumes

in the normal way. We expect that this simple protocol will be sufficient because packets are

being transmitted over relatively short distances within a data center as compared to large geo-

graphical distances so packet loss and ordering issues should not be frequent. This lightweight

reliability layer is implemented, but has many corner cases that need to be addressed. There-

fore, the complete bug-free version of this reliability layer is moved to future work.

5.6 HUMboldt Software

A heterogeneous communication layer needs to have functions for each platform that commu-

nicates with the same protocol. For the software side of HUMboldt, standard socket program-

ming libraries are used to implement the functions that can communicate with the same proto-

col as hardware kernels. These libraries are available for different types of CPU platforms, so

this helps improve the functional portability and heterogeneity. The Galapagos infrastructure

and how it handles different layers of the network, and how it delivers packets to the HUM-

boldt layer is described in Chapter 4. In addition, in this chapter the underlying protocol of

the HUMboldt communication layer is described. In the base infrastructure in [4], there is not

any notion of software kernels because it is focused on the FPGA cluster. However, with the


modifications done in Galapagos, and adding HUMboldt as a communication layer, software

kernels are now part of the platform.

Due to the support of different layers of the network stack, different software functions are

implemented to act as a network bridge in hardware (described in Section 4.2.2). With the het-

erogeneity of the HUMboldt communication layer, it is considered a higher level of abstraction

that includes different combinations of communication between software and hardware ker-

nels. The way the HUMboldt software is designed is that in the initialization phase, it opens an

Ethernet socket as well as making a TCP server. All of the HUMboldt software kernels always

have both Ethernet and TCP. But in the hardware because of limited available resources, the

user is allowed to use only one protocol. Furthermore, it parses the cluster description files

to know the type of the kernel (software or hardware) and the chosen communication proto-

col (Ethernet or TCP) of each node. For the communications, the following scenarios could

happen.

• Software and Software: MPICH is a mature and well-optimized software implementa-

tions that supports all the MPI features, so, it is included as a part of the HUMboldt layer

that is responsible for communication among software kernels. For software-software

communications instead of using our custom protocol and functions, MPICH functions

are called. The command that runs the MPICH program called mpirun has multiple

options such as number of kernels, and the distribution of kernels on hosts that should

be specified in a host file. Using this information, it creates processes for kernels, and

distributes them on the hosts according to the host file. When HUMboldt has multi-

ple kernels in software, the mpirun command should be used. All the information that

it needs such as number of software kernels and their hosts (IP addresses) is available

in cluster description files. At the current state of this work, the user should run this

command manually, however as a future work a simple HUMboldtrun command can be

created that first parses those files and then sets the mpirun options.

• Software and Ethernet hardware: In this case a HUMboldt function is going to call the


set of functions that can communicate to the hardware kernel through the Ethernet Pro-

tocol. A receiving thread always listens to an Ethernet socket with a specific HUMboldt

EtherType, and stores all the incoming packets into different FIFOs that the HUMboldt

functions are using for executing the protocol.

• Software and TCP Hardware: In this case the TCP set of functions are going to be

called. The TCP network equivalent in software forks a server thread that listens to the

specific HUMboldt port. When a connection is accepted, a new thread will be created

for receiving data from that session, and the session will be stored in an array indexed

by the kernel number for future use. In the case of sending data, the main thread of the

program searches the sessions array. If the session already exists, the main thread uses

it to send data to the target kernel. On the other hand, it creates a new connection for

sending, and it creates a new receiving thread for this session as well as storing it into

the sessions array. Similar to the Ethernet network bridge, all the incoming packets will

be stored in a buffer and queued to be processed at the right time.

• Hardware and Hardware: The hardware kernels should communicate through the same

protocol (Ethernet of TCP) because there is only one supported protocol for each hard-

ware node. However, with the HUMboldt software implementation, there is a possibility

to bridge between the different protocols. As mentioned before, all the software kernels

have both Ethernet and TCP protocols. For example, an Ethernet hardware node can

send data to a software kernel and it forwards the data to a TCP hardware node.

We have described the Galapagos infrastructure and the HUMboldt communication layer

that is built on top Galapagos. We have shown how this platform is scalable, modular, and

heterogeneous. However, how the user can easily enjoy these characteristic of the platform is

going to be described in Chapter 6 that introduces the user interface and tool flow of Galapagos

and HUMboldt.

Chapter 6

System Interface and Tool Flow

This chapter specifies how a user makes an application using the HUMBoldt communication

layer as well as Galapagos underneath. The ease of use, heterogeneity, and scalability of this

infrastructure is shown more clearly in this chapter.

6.1 Base Infrastructure System Interface

The tool flow of the base infrastructure [4] [21] takes several streaming IP blocks and the

description files, which are the logical file, mapping file, and MAC address file in XML format

(described in Section 3.2). These files are the user interface to the base infrastructure, and they

will be parsed by the tool flow of the base infrastructure to automatically connect kernels in the

user defined cluster. Afterwards, the tool flow reserves the resources from the cloud, puts them

in the same network, and gives the user the handler to program and send data to the created

cluster.

6.1.1 Logical File

The sample of a Logical file is shown in Listing 6.1. The “kernel” field is used to specify

the user hardware core that has two AXI-stream interfaces. The name of these interfaces are

40

CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 41

required by the automated tool flow. In the “interface” field in logical file user can specify

the name and the direction of the AXI-stream port. The “num” field in the logical file will

be used in the mapping file as a unique identifier, when the user is mapping kernels into the

physical FPGAs. The “rep” field specifies the number of replications of a kernel within the

entire cluster. For example, in Listing 6.1, “rep” field of hardware core A is 4, meaning that

kernels numbered 0 to 3 are hardware core A.

1 <?xml version="1.0" encoding="UTF-8"?>

2 <cluster>

3 <kernel> hardware_core_name_A

4 <num> 0 </num>

5 <rep> 4 </rep>

6 <interface>

7 <direction> in </direction>

8 <name> stream_in_V </name>

9 </interface>

10 <interface>

11 <direction> out </direction>

12 <name> stream_out_V </name>

13 </interface>

14 </kernel>

15 <kernel> hardware_core_name_B

16 <num> 4 </num>

17 <rep> 4 </rep>

18 <interface>



21 </interface>

22 <interface>



25 </interface>


26 </kernel>

27 </cluster>

Listing 6.1: Sample of base infrastructure Logical file

6.1.2 Mapping File

When the logical view of the kernels is defined, the kernels should be mapped onto actual

physical FPGAs. A sample of the mapping file is illustrated in Listing 6.2. In this file, the

kernel numbers (defined in the Logical file) are used to map the kernels onto the FPGAs. In

this example, kernels 0 to 3, which are hardware core A, are mapped to the first FPGA, and the

rest of the kernels which are hardware core B, are mapped to the second FPGA.


2 <cluster>

3 <FPGA>

4 <kernel> 0 </kernel>




8 </FPGA>

9 <FPGA>





14 </FPGA>

15 </cluster>

Listing 6.2: Sample of base infrastructure Mapping file


6.1.3 MAC Address File

The base infrastructure needs another file that contains all the MAC addresses. For each FPGA

that is defined in the mapping file, there should be a MAC address. The automated tool flow

will assign these MAC addresses to physical FPGAs. The sample of this file is shown in

Listing 6.3.

1 fa:16:3e:55:ca:02

2 fa:16:3e:55:ca:03

Listing 6.3: Sample of base infrastructure MAC address file

6.2 Galapagos System Interface

In Section 4.2.2, the application region of Galapagos is described. In that Section, the focus

was on the hardware aspect of the application region. It is obvious when underlying design of

an infrastructure is changed, the system interface should change accordingly to support those

changes. This section describes the Galapagos system interface that supports the underlying

infrastructure.

The features that are added to the Galapagos system interface was done with collaboration

with Naif Tarafdar as he is the designer of system interface of the base infrastructure. He has

modified the part of the tool flow that parses the cluster description files to make the Applica-

tion Region of Galapagos automatically. These modifications was needed by the HUMboldt

communication layer.

6.2.1 Logical File

Listing 6.4 is a sample logical file. The logical file in Listing 6.4 is relatively the same as

the base infrastructure logical file in [4]. However, there are some small changes that are as

follows. There are different naming conventions for the reset and the clock ports in the Vivado


environment. In the modified logical file the user can specify the clock and reset port names.

One feature that makes this complex heterogeneous system easy to use is that a user be able to

debug it by monitoring some signals. This thesis has added a debug capability to the system, by

which the signals that are marked as debug (Line 12 and 17 of Listing 6.4) will be connected to

a Xilinx ILA core. The Xilinx Integrated Logic Analyzer (ILA) IP core[35] is a logic analyzer

that can be used to monitor the internal signals of a design, running on an FPGA. Another

capability that is added to the system interface of Galapagos is defining a constant port that

can be assigned automatically by the tool flow.


2 <cluster>

3 <kernel> hardware_core_name

4 <num> 1 </num>

5 <rep> 96 </rep>

6 <clk> aclk </clk>

7 <aresetn> aresetn </aresetn>

8 <id_port> kernel_id </id_port>

9 <interface>



12 <debug/>

13 </interface>

14 <interface>



17 <debug/>

18 </interface>

19 <const>

20 <name> size </name>

21 <val> 4 </val>

22 <width> 16 </width>

23 </const>


24 </kernel>

25 <kernel> cpu

26 <num> 0 </num>

27 <rep> 1 </rep>

28 </kernel>

29 </cluster>

Listing 6.4: Sample of Logical File

6.2.2 Mapping File

Listing 6.5 shows how the kernel replications can be mapped into one or more FPGAs, in lines

20 to 24 it can be seen that the kernels 1-16 are mapped to a single FPGA. This shows how

easy it is to scale the system by changing some lines in the configuration files. The mapping

file of the base infrastructure, which is described in Section 6.1.2, has the same capability

that Galapagos inherited. However, as mentioned in Section 4.2, the base infrastructure has

the limitation of 16 kernels within the entire cluster, but, Galapagos has no limitation of total

number of kernels. It has only the limitation for the 16 kernels per FPGA.

Some additional features are added to the mapping file to support heterogeneity and mod-

ularity. For example, in Listing 6.5 lines 4 and 18 show two different types of nodes (software

and hardware), which addresses heterogeneity and how easy it is to change a kernel from hard-

ware to software, or vice versa. Furthermore to address modularity, the user can specify a

bridge for their communication layer as shown in lines 10 to 16. If the user does not specify a

bridge, then it is assumed the kernels will communicate directly via AXI-stream. Modularity

within the network layer can be observed in line 19, where the user can specify the network

protocol (e.g TCP or Ethernet), and network addresses as demonstrated in lines 6-7 and 25-26.

In Galapagos the MAC address file is removed from the description files (only the logical file

and the mapping file are needed for the cluster description files), and the user defines the ad-

dresses using the mapping file. The network addresses would be supplied by the manager of


the data center. This could either be the user or a cloud resource manager.


2 <cluster>

3 <node>

4 <type> sw </type>


6 <mac_addr> ac:c4:7a:88:c0:47 </mac_addr>

7 <ip_addr> 10.1.2.152 </ip_addr>

8 </node>

9 <node>

10 <appBridge>

11 <name> communication_bridge_eth_mpi </name>

12 <to_app> to_app_V </to_app>

13 <from_app> from_app_V </from_app>

14 <to_net> to_net_V </to_net>

15 <from_net> from_net_V </from_net>

16 </appBridge>

17 <board> adm-8k5-debug </board>

18 <type> hw </type>

19 <comm> eth </comm>


21 .

22 .

23 .


25 <mac_addr> fa:16:3e:55:ca:02 </mac_addr>

26 <ip_addr> 10.1.2.101 </ip_addr>

27 </node>

28 </cluster>

29 ˜

Listing 6.5: Sample of Map File


6.3 HUMboldt

The system interface of Galapagos is described in Section 6.2. In this section, the system

interface of the HUMboldt communication layer, which is built on top Galapagos, is described.

The HUMboldt communication layer is a higher layer on top of Galapagos. The user inter-

face of HUMboldt is as high as the application layer. The user needs to provide an application

code as input for the HUMboldt layer. Recall that the HUMboldt code can be used as hard-

ware or software kernels. So, HUMboldt can be considered as a programming layer for this

heterogeneous platform. In Listing 6.6, an example of a common model for High Performance

Computing applications is illustrated, in which one kernel is responsible for distributing data

to several other kernels, and gathers processed data, when all kernels are done.

1 #include "HUMboldt.h"

2 #define MAX_ITR 10

3 #define DATA_SIZE 1000

4 #define TAG 0

5

6 int main(int argc, char* argv[])

7 {

8 HUM_Init(&argc,&argv);

9 int data_array[DATA_SIZE];

10 int size = atoi(argv[1]);

11 int kernel_id = atoi(argv[2]);

12

13 for(int i = 0 ; i < MAX_ITR ; i++){

14 if(kernel_id == 0){

15 for(int r = 1; r < size ; r++)

16 HUM_Send(data_array,DATA_SIZE,MPI_FLOAT,r,TAG,

MPI_COMM_WORLD);

17 }

18 else{

19 HUM_Recv(data_array,DATA_SIZE,MPI_FLOAT,0,TAG,MPI_COMM_WORLD


);

20 }

21 /* process data*/

22 if(kernel_id == 0){

23 for(int r = 1 ; r < size ; r++)

24 HUM_Recv(data_array,DATA_SIZE,MPI_FLOAT,r,TAG,

MPI_COMM_WORLD);

25 }

26 else

27 HUM_Send(data_array,DATA_SIZE,MPI_FLOAT,0,TAG,MPI_COMM_WORLD

);

28 }

29 HUM_Finalize();

30 return 0;

31 }

Listing 6.6: HUMboldt sample code

The code in Listing 6.6 needs a few minor modifications to be synthesizable by Vivado

HLS. For example in line 6, instead of argc and argv to input the kernel and size we use two

constant ports. These values get assigned automatically in the tool flow using the logical de-

scription file of kernels, so there is no need for lines 10 and 11. These changes are shown

in Listing 6.7. Furthermore, some HLS pragmas should be added just for interfaces that are

always the same. These modifications can be done using a very simple script, but essentially,

the same code can be run as software or implemented as hardware in an FPGA. This code

demonstrates that HUMboldt is heterogeneous and functionally portable for different process-

ing nodes in a cluster.

1 int main(const int size, const int kernel_id)

2 {

3 #pragma HLS INTERFACE ap_ctrl_none port=return


4 #pragma HLS resource core=AXI4Stream variable=stream_out

5 #pragma HLS resource core=AXI4Stream variable=stream_in

6 #pragma HLS DATA_PACK variable = stream_out

7 #pragma HLS DATA_PACK variable = stream_in

Listing 6.7: Modifications that needed to make HUMboldt code synthesizable

As it is mentioned in Section 5.6, HUMboldt is considered as a higher level communication

layer. In Listing 6.6 all the send and receive functions start with “HUM”, meaning that the

source and destination kernel of these functions could be in hardware or software. It is possible

to port any MPI code to HUMboldt if all the MPI functions that are used in the code are

supported by HUMboldt. By changing all of the “MPI” at the beginning of the function names

to “HUM” the new code can be used as an input file for the tool flow of HUMboldt. These

changes can be done by using a simple script as a stage of the HUMboldt tool flow.

6.4 HUMboldt Tool Flow

To make HUMboldt work transparently and conveniently across a heterogeneous platform, a

tool flow is required that takes the Galapagos description files and HUMboldt code, and creates

the whole cluster automatically. Recall that one of the goals is to use identical code whether

it is to run as a software kernel or as a hardware kernel. This means that the tool flow will

have two paths, one to create software executables and the other to build FPGA bitstreams in a

user-defined platform. Figure 6.1 shows the flows for hardware and software kernels.

6.4.1 Software Kernels

Building software kernels is essentially the same as what is currently done for standard MPI

software distributions. The first step is to link in the HUMboldt software library to the user

code. The HUMboldt software library has the MPI library for software kernels communi-

cations. The “MPICC” is used to compile the code because there are MPI functions in the


Figure 6.1: Software and Hardware Tool Flow

HUMboldt library.

6.4.2 Hardware Kernels

The building of the hardware kernels requires transforming the original source code into a form

that can be used with high-level synthesis (HLS). HLS essentially creates a block of hardware

using software code. The hardware path of the tool flow gives these hardware blocks as well as

cluster description files to the Galapagos tool flow, and it creates the bitstreams for the FPGAs

in the cluster.

Chapter 7

Evaluation

In this section, the evaluation of our platform is presented. We consider the resource utilization,

latency and throughput, and scaling and heterogeneity of our platform.

The testbed that we use to run our test scenarios is a cluster of servers with Intel Xeon

E5-2650 CPUs running at 2.20 GHz, each with 12 physical cores, so via hyper-threading 24

software threads can be running. The FPGAs that are located in the same network of this

cluster are Xilinx UltraScale XCKU115-2-FLVA1517E devices on Alpha data ADM-PCIE-

8K5 boards. All network connections are 10G ethernet connected to a Dell Networking S4048-

ON 10G switch. Our HUMBoldt implementation is used for any configurations that include

a hardware node. To test the best software to software implementation, we just use MPICH,

which is a mature open-source MPI implementation.

7.1 Resource Utilization of Galapagos and HUMboldt

The resource utilization of the different layers of Galapagos including the Hypervisor that we

got from [4] plus off-chip memory support that we added, network bridges, communication

layer bridges, and the router within the application region are shown in Table 7.1. The percent-

ages, which are shown in brackets, are relative to the KU115 FPGAs. Observe that the resource

utilization of Galapagos is about 20% when the user chooses TCP, and it is 15% when the user

51

CHAPTER 7. EVALUATION 52

chooses Ethernet. The resources used here are not necessarily extra overhead as a developer

would require resources to create a custom multi-FPGA interconnect as well.

Table 7.1: Resource Utilization of GalapagosGalapagos Layer LUTs Flip-Flops BRAMs

I) Hypervisor 95332 120367 255(14.4%) (9.1%) (11.8%)

II) Network Bridge 29146 32582 86TCP (4.39%) (2.4%) (4.0%)III) Network Bridge 582 1087 2Ethernet (0.09%) (0.08%) (0.09%)IV) Communication Bridge 1039 1585 1TCP to HUMboldt (0.1%) (0.1%) (0.046%)V) Communication Bridge 729 1332 1Ethernet to HUMboldt (0.1%) (0.1%) (0.046%)VI) Router 5067 6310 1with 16 ports (0.8%) (0.5%) (0.046%)Total TCP 130584 160847 343(I + II + IV + VI) (19.7%) (12.1%) (15.9 %)

Total Ethernet 101710 129096 259(I + III + V + VI) (15.3%) (9.7%) (12.0%)

The other resource utilization to consider is related to the HUMboldt kernels. Each kernel

can use any of the functions that are defined in the HUMboldt communication layer library.

Once a HUMboldt function is called in the user code, the module for that function will be

added to the hardware of that kernel. Multiple calls to the same function do not increase the

instantiations of the hardware module. The resource utilization of each function is presented

in Table 7.2.

Table 7.2: Resource Overhead of HUMBoldt Communication Layer API FunctionsHUMboldt Function LUTs Flip-Flops BRAMs

HUM Send 389 372 0(0.06%) (0.03%) (0%)

HUM Recv 1180 1072 0(0.18%) (0.08%) (0%)


7.2 Latency and Throughput

We have created a microbenchmark to test the send and receive functionality of our system,

with one kernel sending and another kernel receiving. We change the implementation of the

kernel from hardware to software and test several configurations. The configurations are as fol-

lows: software to hardware, hardware to hardware (on the same FPGA), hardware to hardware

(on different FPGAs), and hardware to software. These configurations are tested with both

TCP and Ethernet. Furthermore we test the following software configurations with MPICH:

software to software (on the same CPU), and software to software (on different CPUs). MPICH

uses TCP for network communication. We test these using MPICH to compare the best soft-

ware implementation of MPI to our HUMboldt communication layer. The measurements re-

ported were averaged over many runs until results converged. For MPICH that required close

to a million runs, whereas our hardware results required close to 10 runs.

Figures 7.1 and 7.2 show the throughput of our benchmark. Our HUMboldt communication

layer and the respective bridge transforming HUMboldt packets into network packets performs

at line-rate and the throughput is limited by the 10G Ethernet Core in the Hypervisor along

with the respective packet headers required for TCP and Ethernet (hence the higher throughput

in Ethernet than TCP). Between hardware and software in Figure 7.1 we cannot scale past

a 128 KB payload after which we notice packet drops using Ethernet. This is because the

software kernel cannot receive data as fast as the hardware can send it. We roughly achieved

the maximum TCP core bandwidth that is mentioned in [32]. The curves show the expected

shape where the bandwidth improves as the payload size, and hardware to hardware works best

when compared to links involving a software node. Note that curve for hardware to hardware

in the same FPGA is the same in both Ethernet and TCP cases because the routing is done

internal to the FPGA without needing to add the Ethernet or TCP headers. The hardware to

software TCP throughput is extremely low. We would expect it to be much higher, but at the

time of this thesis, we have not figured out the reason. We believe it to be an issue in our

software.


We do not plot the latency and bandwidth between two software nodes on the same CPU

because they would be difficult to show on the same graph as the others due to the scaling

required. MPICH uses shared memory for the communication, much of which can fit in the

cache so the bandwidth can be very high and the latency very low. The observed bandwidth

for a 512 KB payload is approximately 60 GB/s with a latency of 0.21 µs. It can also be seen

that MPICH shows a very high throughput for small packets and then drops off. We do not

understand the inner workings of MPICH to explain this behavior, but we suspect optimizations

for small packets or possibly cache effects.

Figure 7.1: Ethernet Throughput.

The latency is shown in Figures 7.3 and 7.4. We define latency as the time for sending a

zero-payload size transaction, including sending an envelope, receiving a clear to send, sending

a zero-payload packet, and receiving a done. We measure the cycle counts with a probe (Xilinx

Integrated Logic Analyzer) running on the hardware with a 156.25 Mhz clock. The latency for

two kernels sending and receiving an entire transaction (all four packets) on the same FPGA

is deterministic and takes 29 cycles. When the receiving kernel is on another node (different


Figure 7.2: TCP Throughput.

FPGA or CPU) we incur an additional latency required to transform HUMboldt packets into

network packets. Each packet would be processed through a communication bridge and a

network bridge, these latencies are shown in Table 7.3.

Table 7.3: Per Packet Additional LatencyComponent and Protocol Send Receive

TCP Communication Bridge 9 5Ethernet Communication Bridge 6 5TCP Network Bridge 177 199Ethernet Network Bridge 7 12

On top of the internal FPGA latencies that are mentioned above there is a non-deterministic

latency of the network depending on the network topology. It can be seen that whenever any

communication uses a network link, the additional cycles for the bridges is very small except

when TCP is used. Also, whenever there is a software node involved, it is clear that handling

protocols in software is much slower than in hardware.

As a sanity check of our numbers we can make an approximate comparison to the latency


number measured for Microsoft Catapult [2] where they report an FPGA to FPGA round-trip

latency of 2.88µs when using their LTL communication layer over a 40G ethernet link through

a single top-of-rack switch. Figure 7.3 shows that the FPGA to FPGA latency when using the

Ethernet network link (hw to hw diff) on our platform is about 6 µs for a one-way message,

which would be about 12 µs for a round trip. In addition to Ethernet, Catapult uses UDP

frame encapsulation and adds their LTL protocol [2], which we do not have. They do not

report how many cycles LTL adds to their latency, but given that the protocol is lightweight,

we would expect that the number of cycles is small. At their reported 156 MHz clock rate,

each cycle is about 6.4 ns so as a fraction of the overall latency, LTL would be very small

and almost negligible. The other main difference from our platform is that Catapult uses 40G

links and we use 10G links. Again, with another crude approximation that the latency at 40G

will be four-fold lower than 10G, the latency for Catapult assuming 10G links would be about

4× 2.88µs = 11.5µs, which is close to our estimated 12 µs for a round trip.

Figure 7.3: Ethernet Latency. Figure 7.4: TCP Latency.


7.3 Scalability and Heterogeneity

To test scalability and heterogeneity, we built a simple application proxy shown in Listing 6.6

that is representative of a common computing pattern. It has a central node forking tasks to

many processes and gathering the results. The goal of this application is to exercise our plat-

form and show how easy it is to scale the application to run with different numbers of nodes.

The number of nodes within our system can scale to the depth of the routing table, which is

currently set to 256 but can be updated as required. Furthermore, we have a limitation of 16 ker-

nels per FPGA. Within our experiment we easily implemented several working configurations

up to 96 kernels across 6 FPGAs (limited by the number of FPGAs available in our cluster)

by just making a few changes in the configuration files as described in Chapter 6. Within our

experiment we also tested heterogeneity by implementing multiple kernels in hardware and

software.

Chapter 8

Conclusion and Future Work

A communication layer that supports heterogeneity and functional portability allows users to

easily target CPU and FPGA clusters, as it handles the communication complexities that arise

at a large scale. We improved the open source framework in [4], and introduced a modular way

to create a communication layer on top of that. Our rebuilt infrastructure is called Galapagos.

Our HUMboldt layer is a subset of MPI that we built as a message passing communica-

tion layer. The HUMboldt and Galapagos hardware implementation works at line-rate for 10G

links, and the latency is quite small. We show that we can target both heterogeneity and scala-

bility quite easily, as we can use multiple configurations that we can easily scale by changing

two configuration files. We have also shown that the resources required to support a general

and easy-to-use communication layer (not application specific) is quite minimal.

The implementation of Galapagos and HUMboldt has shown that a layered approach to

building heterogeneous infrastructure to support heterogeneous computing works well and is

important for making heterogeneous applications be portable and easier to build.

We have built Galapagos and HUMboldt to make it easier to build multi-FPGA and hetero-

geneous systems. To show the true power of this infrastructure, we will build some showcase

applications that can leverage such a platform.

The HUMboldt layer currently supports message passing using a minimal subset of the

58

CHAPTER 8. CONCLUSION AND FUTURE WORK 59

MPI standard. To more fully support MPI, more functions need to be implemented, which just

adds to the current library.

To support more types of applications it would be good to add the streaming communication

model to HUMboldt. Because of the modularity that we added to Galapagos, and the HLS

implementation of HUMboldt, it will not be difficult to add streaming. Just as we have done by

using MPI for message passing, it would be good to use a popular programming model such as

ZeroMQ [26] to define the interfaces.

This work is open-source and can be downloaded at

https://github.com/eskandarinariman/HMPI.

https://github.com/eskandarinariman/HMPI

Bibliography

[1] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constan-

tinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan

Gray, et al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services.

ACM SIGARCH Computer Architecture News, 42(3):13–24, 2014.

[2] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Daniel Firestone,

Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, et al.

Configurable Clouds. IEEE Micro, 37(3):52–61, 2017.

[3] Hubert Zimmermann. OSI reference model–The ISO model of architecture for open

systems interconnection. IEEE Transactions on Communications, 28(4):425–432, 1980.

[4] Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, and

Paul Chow. Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data

Center. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-

Programmable Gate Arrays, pages 237–246. ACM, 2017.

[5] Scott Hauck and Andre DeHon. Reconfigurable Computing: The Theory and Practice of

FPGA-Based Computation, volume 1. Elsevier, 2010.

[6] Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Ca-

nis, Yu Ting Chen, Hsuan Hsiao, Stephen Brown, Fabrizio Ferrandi, et al. A Survey and

Evaluation of FPGA High-Level Synthesis Tools. IEEE Transactions on Computer-Aided

Design of Integrated Circuits and Systems, 35(10):1591–1604, 2016.

60

BIBLIOGRAPHY 61

[7] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H

Anderson, Stephen Brown, and Tomasz Czajkowski. LegUp: High-Level Synthesis for

FPGA-Based Processor/Accelerator Systems. In Proceedings of the 19th ACM/SIGDA

international symposium on Field programmable gate arrays, pages 33–36. ACM, 2011.

[8] Vivado-HLS Xilinx. Vivado Design Suite User Guide, High-Level Synthesis, 2018.

[9] Intel. Product Brief: Intel R© HLS Compiler, .

[10] Alberto Leon-Garcia and Indra Widjaja. Communication Networks. McGraw-Hill, Inc.,

2003.

[11] Universal Serial Bus Specification. Rev. 2.0. ARM, http://www. arm. com, 2000.

[12] AXI Xilinx. Reference Guide, UG761 (v13. 1). URL http://www. xilinx. com/support/-

documentation/ip documentation/ug761 axi reference guide. pdf, 2011.

[13] Partial Reconfiguration Xilinx. Vivado Design Suite User Guide, Partial Reconfiguration,

2018.

[14] Product Guide. SDAccel Environment, Platform Development Guide. Xilinx Inc, 2016.

[15] Naif Tarafdar. Building and Using Virtual FPGA Clusters in Datacenters, 2017.

[16] Xilinx UG081. MicroBlaze Processor Reference Guide, 2009.

[17] Stuart Byma, J Gregory Steffan, Hadi Bannazadeh, Alberto Leon Garcia, and Paul Chow.

Fpgas in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack. In

Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual

International Symposium on, pages 109–116. IEEE, 2014.

[18] Joon-Myung Kang, Hadi Bannazadeh, Hesam Rahimi, Thomas Lin, Mohammad Faraji,

and Alberto Leon-Garcia. Software-defined infrastructure and the future central office.

BIBLIOGRAPHY 62

In Communications Workshops (ICC), 2013 IEEE International Conference on, pages

225–229. IEEE, 2013.

[19] Joon-Myung Kang, Hadi Bannazadeh, and Alberto Leon-Garcia. Savi Testbed: Control

and Management of Converged Virtual ICT Resources. In Integrated Network Manage-

ment (IM 2013), 2013 IFIP/IEEE International Symposium on, pages 664–667. IEEE,

2013.

[20] Jagath Weerasinghe, Francois Abel, Christoph Hagleitner, and Andreas Herkersdorf. En-

abling FPGAs in Hyperscale Data Centers. In Ubiquitous Intelligence and Computing and

2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl

Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-

ATC-ScalCom), 2015 IEEE 12th Intl Conf on, pages 1078–1086. IEEE, 2015.

[21] Naif Tarafdar, Thomas Lin, Nariman Eskandari, David Lion, Alberto Leon-Garcia, and

Paul Chow. Heterogeneous Virtualized Network Function Framework for the Data Center.

In Field Programmable Logic and Applications (FPL), 2017 27th International Confer-

ence on, pages 1–8. IEEE, 2017.

[22] Henning Schulzrinne, Anup Rao, and Robert Lanphier. Real Time Streaming Protocol

(RTSP). Technical report, 1998.

[23] Iraj Sodagar. The MPEG-Dash standard for multimedia streaming over the internet. IEEE

MultiMedia, (4):62–67, 2011.

[24] Marc Snir. MPI–the Complete Reference: the MPI Core, volume 1. MIT press, 1998.

[25] Manuel Saldana and Paul Chow. TMD-MPI: An MPI Implementation for Multiple Pro-

cessors Across Multiple FPGAs. In Field Programmable Logic and Applications, 2006.

FPL’06. International Conference on, pages 1–6. IEEE, 2006.

BIBLIOGRAPHY 63

[26] Pieter Hintjens. ZeroMQ: Messaging for Many Applications. ” O’Reilly Media, Inc.”,

2013.

[27] Manuel Saldana, Arun Patel, Christopher Madill, Daniel Nunes, Danyao Wang, Paul

Chow, Ralph Wittig, Henry Styles, and Andrew Putnam. MPI as a programming model

for high-performance reconfigurable computers. ACM Transactions on Reconfigurable

Technology and Systems (TRETS), 3(4):22, 2010.

[28] Alan D George, Martin C Herbordt, Herman Lam, Abhijeet G Lawande, Jiayi Sheng,

and Chen Yang. Novo-G#: Large-scale reconfigurable computing with direct and pro-

grammable interconnects. In High Performance Extreme Computing Conference (HPEC),

2016 IEEE, pages 1–7. IEEE, 2016.

[29] Alpha Data. Alpha Data 7v3 boards. https://www.alpha-data.com/dcp/

products.php?product=adm-pcie-7v3, 2016.

[30] Alpha Data. Alpha Data 8k5 boards. https://www.alpha-data.com/dcp/

products.php?product=adm-pcie-8k5, 2017.

[31] David Sidler, Gustavo Alonso, Michaela Blott, Kimon Karras, Kees Vissers, and Ray-

mond Carley. Scalable 10Gbps TCP/IP stack architecture for reconfigurable hardware.

In Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual

International Symposium on, pages 36–43. IEEE, 2015.

[32] D. Sidler et al. Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware.

In 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom

Computing Machines, pages 36–43, May 2015.

[33] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A High-Performance,

Portable Implementation of the MPI Message Passing Interface Standard. Parallel Com-

puting, volume=22, number=6, pages=789–828, year=1996, publisher=Elsevier.

https://www.alpha-data.com/dcp/products.php?product=adm-pcie-7v3

https://www.alpha-data.com/dcp/products.php?product=adm-pcie-7v3

https://www.alpha-data.com/dcp/products.php?product=adm-pcie-8k5

https://www.alpha-data.com/dcp/products.php?product=adm-pcie-8k5

BIBLIOGRAPHY 64

[34] Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jef-

frey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine,

et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation.

In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting,

pages 97–104. Springer, 2004.

[35] Product Guide. LogiCORE IP Soft Error Mitigation Controller v3. 3. Xilinx Inc, 2016.

by Nariman Eskandari A thesis submitted in conformity with ... · Chapter 1 Introduction The...

Documents

Transcript of by Nariman Eskandari A thesis submitted in conformity with ... · Chapter 1 Introduction The...