by Nariman Eskandari A thesis submitted in conformity with ... · Chapter 1 Introduction The...
Transcript of by Nariman Eskandari A thesis submitted in conformity with ... · Chapter 1 Introduction The...
A MODULAR HETEROGENEOUS COMMUNICATION LAYER FOR A
CLUSTER OF FPGAS AND CPUS
by
Nariman Eskandari
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical and ComputerEngineering
University of Toronto
c© Copyright 2018 by Nariman Eskandari
Abstract
A Modular Heterogeneous Communication Layer for a Cluster of FPGAs and CPUs
Nariman Eskandari
Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical and Computer Engineering
University of Toronto
2018
A key infrastructure required to make heterogeneous clusters easier to use is a standard
communication mechanism between computing nodes. Without this infrastructure, application
developers of multi-FPGA applications are forced to develop their own custom communica-
tion cores for hardware kernels that can communicate. This problem is significantly worse at a
data center scale. This thesis shows how introducing modularity into an existing orchestration
platform makes it possible to automatically and interchangeably support different underlying
network protocols and multiple communication models for a multi-FPGA platform. We show
that it is straightforward to connect the communication models implemented in the FPGAs to
compatible software implementations of the same models so that it is easy to build heteroge-
neous applications using CPUs and FPGAs. Our results have shown that the heterogeneous
communication has introduced very little area overhead in the FPGAs and can still perform at
line-rate, bottlenecked solely by the network protocols.
ii
Acknowledgements
I would like to thank my supervisor, Professor Paul Chow. The completion of this thesis and
the many valuable life skills I have acquired in the process is because of his guidance, and his
patience to work with me.
I would like to thank my parents Neda Mashhadi and Hamid Reza Eskandari. Without their
support and patience I would not be here.
I would like to thank my excellent colleague Naif Tarafdar for all the support and help, and
my other colleagues in PT477.
I would like to thank my wonderful friends. They have been like family to me over the pas
two years.
iii
Contents
Acknowledgements iii
Table of Contents vi
1 Introduction 1
2 Background 4
2.1 Field Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Data Center Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Network Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 AXI Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Xilinx SDAccel Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 MicroBlaze Soft Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Related Work 10
3.1 Current Data Centers/Clouds with FPGAs . . . . . . . . . . . . . . . . . . . . 11
3.1.1 FPGA in the cloud: Booting Virtualized Hardware Accelerators with
OpenStack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Catapult v1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Catapult v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
iv
3.1.4 Enabling FPGAs in the Hyperscale Data Center . . . . . . . . . . . . . 14
3.2 Base Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Communication Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Streaming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Message Passing Model . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Other Heterogeneous Communication Layers . . . . . . . . . . . . . . . . . . 18
3.4.1 TMD-MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.2 Novo-G# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Galapagos 20
4.1 Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Base Infrastructure Hypervisor . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Galapagos Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Application Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Base infrastructure Application Region . . . . . . . . . . . . . . . . . 25
4.2.2 Galapagos Application Region . . . . . . . . . . . . . . . . . . . . . . 26
5 HUMboldt Communication Layer 30
5.1 TMD-MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 MPI Communication Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Packetization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5 Reliability Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.6 HUMboldt Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 System Interface and Tool Flow 40
6.1 Base Infrastructure System Interface . . . . . . . . . . . . . . . . . . . . . . . 40
6.1.1 Logical File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.1.2 Mapping File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
v
6.1.3 MAC Address File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Galapagos System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2.1 Logical File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2.2 Mapping File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 HUMboldt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 HUMboldt Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.4.1 Software Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.4.2 Hardware Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7 Evaluation 51
7.1 Resource Utilization of Galapagos and HUMboldt . . . . . . . . . . . . . . . . 51
7.2 Latency and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.3 Scalability and Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Conclusion and Future Work 58
Bibliography 59
vi
Chapter 1
Introduction
The integration of accelerators in the data center has been shown to be beneficial [1, 2] but using
heterogeneity can be difficult for data center application developers and system administrators.
This heterogeneity is especially difficult for application developers when integrating FPGAs
as this usually requires the user to design the application and management circuitry, includ-
ing the network stack and memory management. For both application developers and system
administrators, the lack of a common communication standard among clusters of accelerators
and CPUs raises challenges in creating communication links between different devices, and
supporting the network connections between these devices. Another challenge for application
developers is the lack of functional portability of a distributed application across different de-
vices, which usually requires a developer to tailor an application specifically for each device
and manage its communication. These challenges become even more daunting at the data
center scale where we can have potentially thousands of nodes interacting. We believe the
challenges can be addressed with a portable standard communication layer across both CPUs
and FPGAs, which we address in this thesis.
The communication layer that a user needs for their distributed application is very specific
to the traffic patterns and programming models of their application. For example, some dis-
tributed applications are more suited to a streaming model of communication, such as video
1
CHAPTER 1. INTRODUCTION 2
processing applications. Other applications that have many nodes sending and receiving data
to each other are more amenable to a message passing model, such as many scientific applica-
tions. In software, these communication models are implemented as application network layers
built on top of the OSI network stack [3]. This gives the user the option to choose a commu-
nication layer suitable to their application and the underlying network layers used to support
the chosen communication layer. There is no equivalent stack in hardware, where typically,
each hardware application has to build its own custom communication layer. We would like a
standard FPGA/CPU communication layer to fit as a module on top of an implementation on
different layers of the OSI network stack in hardware.
The main contributions of this thesis are as follows. First, we build on top of an open source
scalable FPGA orchestration tool [4]. This tool allows us to easily map streaming FPGA ker-
nels and connect them across many FPGAs in a data center by changing a configuration file.
We first address scalability limitations in the tool, and also we re-implemented the networking
stack within this tool by making the network portion modular allowing us to support multiple
implementations of different layers of the networking stack. We demonstrate this by having
the same application communicate over both Ethernet and TCP without changing the appli-
cation. We call our rebuilt infrastructure Galapagos. Then we implement a message passing
communication layer on top of Galapagos, which we call HUMboldt. This is implemented
as a high-level synthesisizable (HLS) and software library allowing an application developed
with this library to be functionally portable across both CPUs and FPGAs. The functional
portability is important because it enables application development in a pure software environ-
ment. Once correct functionality has been achieved, parts of the code can be ported to run as
hardware without modifying the code. A standard communication protocol is used allowing
heterogeneous devices (CPUs and FPGAs) to communicate at a data center scale.
The remainder of the thesis is organized as follows. Chapter 2 provides information that
is necessary for better understanding this thesis. Chapter 3 explores related works in clusters
with FPGAs and cloud systems as well as other communication layers specifically on multi-
CHAPTER 1. INTRODUCTION 3
FPGA clusters. Chapter 4 explores our modular rebuild of an FPGA cluster generator, and
the implementation details of our communication layer is described in Chapter 5. The details
on how to interface with our system and tool flow is provided in Chapter 6. Chapter 7 shows
our results with microbenchmarks measuring the performance of our communication layer and
infrastructure between FPGAs and CPUs. Lastly, conclusion and future work of this thesis is
presented in Chapter 8.
Chapter 2
Background
This chapter presents necessary information for better understanding this work.
2.1 Field Programmable Gate Array
Field Programmable Gate Arrays (FPGAs) are components that have configurable logic that
can be used to implement different digital logic functions [5]. The main components of the
FPGAs are Logic Blocks, Block RAM (BRAMs), and Digital Signal Processing units (DSPs),
and some of them have a ARM processor. The logic blocks have several look-up tables and
Flip-Flops, and they are used to implement most of the logic. There is also a sophisticated
programmable routing network that is used to interconnect all of the logic components and
connect to the inputs and outputs of the logic function.
2.2 High-Level Synthesis
High-Level Synthesis (HLS) tools provide way for developers and hardware designers to build
hardware cores by using high-level programming languages such as C and C++ rather than
Hardware Description languages (HDLs) such as Verilog or VHDL. HLS tools generate HDL
out of higher level programming languages with equivalent functionality, and synthesizable
4
CHAPTER 2. BACKGROUND 5
for FPGAs much as compilers for software languages generate machine-specific assembly
code [6]. One of the most notable advantages of HLS is that a hardware designer can ex-
plore design space choices in a much shorter amount of time. This helps to achieve better
results in performance, area, and power trade-offs.
There are various implementations of HLS tools. Some of these tools are academic. For
example, LegUp [7] is an open-source HLS tool. Some industrial versions of these tools are
Vivado HLS [8] and the Intel HLS Compiler [9]. The work in this thesis uses Vivado HLS that
targets Xilinx FPGAs.
2.3 Data Center Architecture
Figure 2.1 shows the architecture of our testbed platform. All processing elements in our plat-
form, CPUs and FPGAs, have direct connections to the network. This allows any processing
element to communicate directly with any other processing element on the network. Some of
the FPGAs are hosted in a server so there would also be a link over PCIe between the FPGA
and the CPU. Other FPGAs are part of a System on Chip (SoC), where there is an ARM proces-
sor on the same chip as the FPGA, so in this case there is also a connection between the ARM
processor and the FPGA fabric. All of our network links use either 10G or 100G Ethernet.
2.4 Network Stack
The standard OSI network stack is shown in Figure 2.2. It uses a layered approach where
the lower layers provide services to the layers above [10]. The Physical layer is the actual
connection between nodes in a network such as optic fiber and Ethernet cable. The Data Link
layer is responsible for delivering packets within local hops using some addressing scheme. In
this work, we are using Ethernet-based communications as the Data Link layer so we will also
refer to this as the Ethernet layer. This layer does not provide reliability, so there is a chance
for a packet to be lost or corrupted. The layer above the Data link layer is the Network layer,
CHAPTER 2. BACKGROUND 6
Figure 2.1: Data center architecture
which in our case is concerned with routing among networks or within a network using IP
addresses. The Network Layer is an end-to-end communication layer that does not care about
the path that a packet may take. The Transport Layer provides end to end communications with
more services such as reliable communications or fragmentation. The most popular Transport
protocols are TCP and UDP.
TCP guarantees the recipient will receive the packets in order by numbering them. The
recipient sends messages back to the sender saying it received the messages. If the sender
does not get a correct response, it will resend the packets to ensure the recipient received them.
Packets are also checked for errors. TCP is all about this reliability, so packets sent with TCP
are tracked so no data is lost or corrupted in transit.
When using UDP, packets are just sent to the recipient. The sender will not wait to make
sure the recipient received the packet, and it will just continue sending the next packets. If
the recipient misses some UDP packets, it cannot ask for those packets again. There is no
guarantee that receiver is getting all the packets and there is no way to ask for a packet again if
receiver misses it, but losing all this overhead means the nodes can communicate more quickly.
In this thesis we have support up to the Transport layer of the Network stack, and the user is
CHAPTER 2. BACKGROUND 7
able to choose between using the raw Ethernet protocol (Data Link layer) and the TCP Protocol
(Transport layer) for the communication between nodes of an application using the network.
Figure 2.2: Network Stack
2.5 AXI Bus
AXI is part of the AMBA [11] bus specification that Xilinx uses it for interconnection between
different circuit blocks (IPs) in a large circuit system. The current version of AXI is AXI4
which has three versions [12].
• AXI4: for high-performance memory-mapped requirements.
• AXI4-Lite: for simple, low-throughput memory-mapped communication (for example,
to and from control and status registers).
• AXI4-Stream: for high-speed streaming data.
Most of the interconnections between the components in this work are AXI4-Stream. The
AXI4-Stream protocol defines a channel for transmission of streaming data. Unlike AXI4,
CHAPTER 2. BACKGROUND 8
AXI4-Stream interfaces can burst an unlimited amount of data. AXI4-Stream also provides a
back-pressure signal to the previous producer IP, indicating that it is not ready to receive data.
2.6 Partial Reconfiguration
By using Partial Reconfiguration, a user is able to program a part of an FPGA without inter-
rupting the remaining logic [13]. This capability allows the user to change the functionality on
the fly, and it enhances the flexibility of the FPGA devices. It also makes the reconfiguration
time shorter, and could be used to reduce the power consumption of the FPGA. We also point
out a further advantage of using Partial reconfiguration in Section 4.1.2.
2.7 Xilinx SDAccel Platform
The Xilinx SDAccel Platform [14] provides several program interfaces (APIs), by which FP-
GAs can be used as accelerators for a host. A host is a CPU that can have several accelerator
devices connected to it, and it manages the devices. Using these APIs the user is able to pro-
gram the FPGA, send data to it, and read back the result of the computation on the FPGA. The
FPGAs in SDAccel are connected through a PCIe interface. The Xilinx SDAccel platform is
based on the OpenCL language. OpenCL is a similar language to C and C++ that is both a
programming language for parallel and heterogeneous devices, and a programming API for the
host to manage all the OpenCL devices. Xilinx has introduced its own OpenCL driver called
the Hardware Abstraction Layer (HAL) that provides functions by which it is possible to send
and receive data to and from the FPGA, and program the FPGA [15].
2.8 MicroBlaze Soft Processor
MicroBlaze is a 32-bit soft core processor that is optimized for Xilinx FPGAs [16]. Soft core
means that the processor is realized using the available resources in FPGAs rather than being
CHAPTER 2. BACKGROUND 9
generated as a ASIC core. Using this core a user can design a system that contains a software
part. This core can connect to hardware cores of the system through the different types of AXI
bus.
Chapter 3
Related Work
Recently, cloud systems have deployed FPGAs, and provide services that includes FPGAs as a
computation node. The reason behind this incorporation is that FPGAs have been shown to be
excellent accelerators for many applications, and their power consumption is relatively low in
comparison against other cloud resources e.g. GPUs and CPUs. However, current tool flows
for deploying FPGAs in the cloud are hard to use. This is more cumbersome when the system
is scaled up to multiple FPGAs, or the FPGA is deployed in a heterogeneous system that has
other cloud resources, and it needs to communicate with them. Therefore, the creation of an
easy to use tool flow is necessary that has following characteristics.
• Provision resources (CPUs and FPGAs) from the cloud and deal with their connectivity.
• A heterogeneous communication layer for the given cluster to make communication high
performance and functionally portable.
There are different communication models such as streaming and message-passing, and in
this work we chose to implement a heterogeneous message-passing communication layer. A
communication layer for heterogeneous systems should be flexible enough to allow a user to
tradeoff application requirements against communication overheads. One dimension of this
flexibility is to allow the user to choose between different layers of the network stack, such as
10
CHAPTER 3. RELATED WORK 11
using raw Ethernet or TCP to tradeoff reliability and scalability versus network overhead. With
this context, the remainder of this chapter describes related work.
The base infrastructure used as the starting point for this work is presented in [4] that or-
chestrates an FPGA cluster in the cloud by taking cluster description files as input (more details
will be presented in Chapter 6.1), and returning to the user a handle to a multi-FPGA network
connected cluster. The layered architecture of baseline infrastructure makes it easy to improve
and add a modular and flexible layer for functionally portable and scalable communication
between hardware and software.
3.1 Current Data Centers/Clouds with FPGAs
In this section, we explore some works that have deployed FPGAs in cloud systems. All works
can be categorized into two major categories that are as follows:
1. Host/Slave: In this model, each FPGA is tightly coupled with a CPU, so the control
and data transmission to the FPGA is done by the CPU, and there is no direct network
communication to another FPGA or CPU in the cluster.
2. Standalone: In his model FPGAs have direct connections to the network, so there is a
potential for them to communicate to each other or CPUs. This approach is shown in
Figure 2.1.
The reminder of this section introduces some notable and representative works in this area and
describes their characteristics. There are many other works, but they do not introduce any other
features beyond the systems presented here.
CHAPTER 3. RELATED WORK 12
3.1.1 FPGA in the cloud: Booting Virtualized Hardware Accelerators
with OpenStack
Byma et al. [17] introduced an approach for FPGAs to be used as accessible resources in
the SAVI testbed [18], which is an OpenStack-based research cloud system [19]. In this
OpenStack-based cloud, when a resource request arrives, OpenStack finds available resources
in the database, and finds which physical machine they are located on. The OpenStack con-
troller has been modified to communicate with the resource manager via a separate process
running beside the software hypervisor, called an Agent [17]. The Agent is a simple program
that can communicate with Nova, which is a component of OpenStack. In this system, a single
FPGA can be used as a resource for more than one cloud application. Each FPGA has four par-
tially reconfigurable regions called VFRs (Virtual FPGA Resources) that are directly connected
to the network over a Layer Two network (Ethernet). Since the FPGAs in this infrastructure
are directly connected to the network they could be considered as a standalone node. This in-
frastructure fits into Category 2. A load-balancing application was used to observe the impact
of using an FPGA in the cloud in terms of latency and throughput. The latency of the software
Virtual Machine (VM) load-balancer varies and increases when the injection rate increases. In
contrast, the VFR load balancer latency is constant with different injection rate. The VM load
balancer starts to drop packages even at a 25 MB/s injection rates; nevertheless, VFRs never
drop a single packet.
3.1.2 Catapult v1
Catapult v1 [1] of Microsoft was the most important pioneer for deploying FPGAs into large-
scale data centers, demonstrating the advantage of using FPGAs in data centers. They used
Microsoft Bing’s page ranking as a case study. A PCIe card with a Stratix V FPGA was added
to every CPU. The FPGAs in one rack are connected by a secondary network, which has a
two-dimensional, 6x8 torus topology [1].
CHAPTER 3. RELATED WORK 13
In Catapult v1 the inter-FPGAs and host-to-FPGA interconnections are abstracted at the
user point of view by implementing a Shell in the FPGAs, which handles communications
with the CPU via the PCIe link, the network and DDR memory. The block diagram of this
shell is shown in Figure 3.1. In our work, a new open-source and general shell is implemented,
which is described in Chapter 4.1. Instead of shell, we use the term hypervisor as we intend to
provide more functionality and services than just a hardware abstraction.
Figure 3.1: Microsoft Catapult Shell.
Using Catapult v1, Microsoft implemented a significant fraction of Bing’s page ranking en-
gine on FPGAs. In comparison with the pure software implementation, at the expense of 10%
more power consumption, this system achieves a 95% improvement in throughput (latency is
equivalent) and they can reduce tail latency by 29% [1].
Catapult v1 does not exactly fit either of the two categories described above. It is a hy-
brid in that the PCIe link is used to access and control the FPGA but the torus network also
provides FPGA-FPGA communication between FPGAs in the same rack, but not beyond the
rack. Therefore, this is not generally scalable. Furthermore, this infrastructure is not elastic
and flexible, which is desired for a cloud platform.
CHAPTER 3. RELATED WORK 14
3.1.3 Catapult v2
In Catapult v1 [1], the FPGAs have network access only to the other FPGAs in same rack (48
FPGAs). This limits elasticity and scalability. The architecture of Catapult v2 [2] is shown
in Figure 3.2. In the new cloud architecture the FPGA hardware plane is decoupled form the
traditional software plane. Furthermore, instead of directly connecting the server NIC to the
top-of-rack (ToR) switch, the NIC is connected to the FPGA, and the FPGA is connected to
the ToR switch. The PCIe connection still exists. With this topology, it is possible to make a
large pool of FPGAs resources. Two key components are added to the shell: an Elastic Router
for intra-FPGA connections and a Light-weight Transport Layer Protocol Engine that enables
inter-FPGA connections. With the direct network connection for the FPGA, it is possible to do
what Microsoft calls Remote Acceleration where tasks can be accelerated using an FPGA that is
not attached to the CPU requiring the acceleration. Using this approach, the Bing ranking was
implemented and it was shown that the local acceleration and remote acceleration throughputs
are the same. Catapult v2 fits into the second category because the FPGAs can be orchestrated
as standalone nodes in the cluster.
3.1.4 Enabling FPGAs in the Hyperscale Data Center
In this work, the authors proposes a method to decouple FPGAs from the CPUs in a cloud
environment [20]. In this cloud, the FPGA can be used as a standalone resource (Category 2).
In their system architecture, they introduce a high-level “shell” in FPGAs that can manage the
interconnections in all network layers. For example, with this shell, they support FPGAs that
are securely isolated in a multi-tenant environment by offering VLANs or OVNs in their cloud
structure. VLAN (Virtual LAN) and OVN (Open Virtual Network) provide a partitioned and
isolated network at the Ethernet layer (layer 2). In their cloud system, instead of using Nova
in OpenStack, which is responsible for provisioning resources in the cloud, they add their
service called the “Accelerator Service”. This service is responsible for employing FPGAs as a
standalone resource for a cloud application. Their cloud system network is a Software Defined
CHAPTER 3. RELATED WORK 15
Figure 3.2: (a) Microsoft Catapult v2 decoupled the hardware plane from the traditional soft-ware plane. (b) Server plus FPGA schematic [2].
Network (SDN).
3.2 Base Infrastructure
In this section, we describe the open source framework as it was before we upgraded it to
support our communication layer. This framework is described in [4] and further extended
in [21]. In these works, Tarafdar et al. introduce a multi-FPGA abstraction layer that maps a
graph of streaming IP blocks connected by a large logical switch onto a multi-FPGA network-
connected cluster that is provisioned from an elastic pool of cloud resources. From a high-
level, the user provides a collection of kernels, a logical file describing the entire cluster and
a mapping file of kernels to physical FPGAs. The user is returned a network handle for their
FPGA cluster. The logical view of the kernels and how they are mapped into multiple FPGAs
is shown in Figure 3.3. The details of these files, which are the user interface to the base
infrastructure, is described in Chapter 6.1.
CHAPTER 3. RELATED WORK 16
Figure 3.3: Logical view of the streaming kernels and how they are mapped into physicalFPGAs.
Each FPGA in the cluster has its physical resources abstracted with a Hypervisor. The Hy-
pervisor exposes a control interface through PCIe and a data interface through the 10G Ethernet
port, connecting to an application region. Within the application region, the framework places
an interconnect on each FPGA to interface with all kernels within the cluster (either directly
connects to kernels that are locally connected or encapsulates the packet with network infor-
mation to make a network hop). This direct connection to all kernels has an inherent limitation
as the interconnect has at most 16 ports, thus limiting the total number of kernels within the
entire cluster to 16. Furthermore the modularity of this implementation is limited as the user is
forced to use Ethernet packets between FPGAs, making the communication unreliable. More
details about the data center infrastructure, and network architecture can be found in [4], and
more details on the FPGA Hypervisor can be found in [21].
CHAPTER 3. RELATED WORK 17
3.3 Communication Models
The communication model is very application-specific as different applications exhibit differ-
ent communication traffic patterns. Two popular communication models are streaming and
message passing.
3.3.1 Streaming Model
In a streaming model, data is being sent continuously through a point-to-point channel. Some
implementations of streaming communication models include the Real-time streaming-protocol
(RTSP) [22] and MPEG-DASH [23]. These protocols are typically built on top of network pro-
tocols like UDP, or even Ethernet as these protocols have better latency but lack reliability by
default.
3.3.2 Message Passing Model
In a message passing model, data can be transferred between arbitrary nodes. Generally, shorter
packets provide less latency while longer packets provide better throughput because message
overheads are minimized. Users must partition data into messages and include a destination
address when sending, unlike a stream where it is a continuous flow of data to one or more
preconfigured receivers. The most common implementation of a message passing model is
the Message Passing Interface (MPI) [24]. We have decided to implement our communication
layer as a subset of MPI because there has been prior work [25] in implementing MPI on an
FPGA cluster that was very successful. MPI is also a well-known standard API that is widely
used in various types of HPC applications. This helps us with our goals of heterogeneity,
functional portability and scalability.
Another messaging protocol that has a significant user community is ZeroMQ (0MQ) [26].
ZeroMQ uses a socket-like interface that supports multiple message patterns such as request-
reply and publish-subscribe. By using a socket-like interface, it would also be easier to use it
CHAPTER 3. RELATED WORK 18
like a FIFO, or streaming interface.
3.4 Other Heterogeneous Communication Layers
In this section we explore other implementations of heterogeneous communication layers. We
look at two in particular: the first being TMD-MPI [27] and the second being Novo-G# [28].
3.4.1 TMD-MPI
The work presented in [25] explores an implementation of MPI within a multi-FPGA environ-
ment called TMD-MPI. TMD-MPI implements a subset of the MPI protocol to allow hardware
or software processing engines on the FPGA to communicate amongst each other on the same
FPGA and across multiple FPGAs. The hardware versions of the MPI functions are imple-
mented in VHDL. In this thesis, these functions are described in HLS-synthesizable C code.
TMD-MPI was created to be portable across multiple physical platforms so it was imple-
mented in several layers, including layers that correspond to the physical setup of the network
connected FPGAs. In this thesis, we avoid the need to create an equivalent of these layers as
our communication layer modularly builds on an improved cluster generator tool that handles
the communication of blocks within an FPGA and across multiple FPGAs in a data center.
3.4.2 Novo-G#
The work presented in [28] is a heterogeneous environment with 24 CPU servers that are con-
nected via PCIe to FPGA boards that contain four Stratix V FPGAs. There are direct connects
between the FPGAs on an individual board forming a 3D torus with a custom hardware net-
work stack to support these direct connections. If needed, communication between host nodes
can use MPI. The Novo-G# is a system that shows both the use of a custom hardware network
stack for FPGA-to-FPGA communication as well as a model where accelerators are connected
to host nodes and the host nodes can communicate using a standard software MPI library.
CHAPTER 3. RELATED WORK 19
In this thesis, hardware and software components of the same application can communicate
with the same communication layer as peers, which makes it much easier to use hardware or
software for any computing kernel because the programming model is the same for either case.
Chapter 4
Galapagos
We rebuilt the open source framework described in [4] to improve scalability, modularity, and
reliability. These improvements allow the user to implement designs with different network
protocols (e.g Ethernet or TCP) and communication layers by changing a configuration file
describing the heterogeneous cluster. The user can target any number of available devices
(FPGA and/or CPU) with a limit of 16 kernels per FPGA due to the number of ports on the
Xilinx switch IP core being used. We call this rebuilt framework Galapagos.
In this chapter we will describe how we improved the framework from [4] by introducing a
new easily available, portable for other boards, and more general hypervisor. Furthermore, we
will describe how we addressed the limitations of the base infrastructure, and how we made
this platform reliable and more modular. These changes led to building a communication layer
on top of Galapagos that is described in Chapter 5.
4.1 Hypervisor
All the I/O interfaces on the FPGA are abstracted away from the user (such as network, PCIe,
and DDR memory) by introducing the notion of the Hypervisor. Note that this could also be
called a Shell in Microsoft terminology, but our goal is to add much more than an abstraction
layer to the hardware and provide other services in the same manner as a software hypervisor
20
CHAPTER 4. GALAPAGOS 21
is used to support virtual machines (VMs). The FPGA Hypervisor we implemented has two
regions: a static region that provides standard interfaces for the user such as network and
memory, as well as the application region where user kernels are placed. The static region
needs to be programmed only once, and it will remain intact when the user programs the
application region.
The Hypervisor achieves one of the main goals of this work, which is the ease of use.
The user, who develops the application, only needs to deal with some AXI interfaces, and all
the infrastructure below is already implemented and ready to use even for different types of
FPGAs. An application developer can enjoy all the services provided by the hypervisor with
low effort, as a lot of difficulties are removed through abstractions.
4.1.1 Base Infrastructure Hypervisor
The first iteration of the Hypervisor that is used in the base infrastructure was introduced in [4].
That Hypervisor is a modified version of the SDAccel Hypervisor described in Section 2.7. A
part of the SDAccel platform provides some high-level functions to send and receive data
through the PCIe link connecting to the CPU. Tarafdar et al. [4] added a 1G Ethernet capability
to this Hypervisor. Furthermore, they improved the Hypervisor described in [21] by upgrading
the Ethernet speed to 10G. They also defined the application region as a partial reconfiguration
region. The block diagram of the base infrastructure hypervisor is shown in Figure 4.1, and the
notable features that the base infrastructure hypervisor provides are as follows:
• Network: This Hypervisor provides access to the network through a 10G Ethernet link.
It interfaces with Application Region by an input and output AXI-stream port.
• PCIe: The SDAccel platform provides an easy way to use the PCIe link. Using the
SDAccel as a Hypervisor on the hardware side, and the HAL as a driver for PCIe (de-
scribed in Section 2.7), some high-level functions can be used to communicate with the
FPGA through the PCIe link. The interface of this block to the Application Region is
CHAPTER 4. GALAPAGOS 22
Figure 4.1: Base infrastructure Hypervisor block diagram
through an AXI-full slave.
• DDR Memory: The SDAccel platform has the needed infrastructure to give the user the
capability to use DDR memory. Using some high-level functions a CPU can send and
receive data to and from DDR memory through PCIe. The FPGA fabric can send and
receive data to and from DDR memory through an AXI-full master.
• Application Region Interface: The Application Region in this hypervisor has several
interfaces:
– AXI-Stream inputs and outputs connected to the network port of the FPGA.
– An AXI-full slave connection to the PCIe link.
– An AXI-full master connection to DDR memory.
In addition, the application region is a partial reconfiguration region, which is done for
several reasons. There is an advantage of shorter reconfiguration time because only part
of the FPGA is configured. Partial reconfiguration also solves another problem with the
CHAPTER 4. GALAPAGOS 23
PCIe interface.
After reconfiguration of the FPGA, the PCIe device is not going to be discoverable by
the host anymore. It is because some of the PCIe configuration information disappears
during the FPGA programming, and it only can be scanned by the root during booting.
This requires rebooting the system, which is not always an option. The application region
of this hypervisor is in a partial reconfiguration region that is described in Section 2.6.
This means that the PCIe hardware placed in the static region of the hypervisor, which
remains intact during the partial reconfiguration process of the application region. This
solves the PCIe problem, and the PCIe of the FPGA is going to be discoverable by the
host.
The reconfiguration process of a partial region should be at the time that the network is
not in the middle of sending a packet to the partial reconfiguration region. The responsi-
bility of the gates in this hypervisor is to make sure that the current packet transmission
is done before it starts programming the partial bitstream.
4.1.2 Galapagos Hypervisor
The Hypervisor in [4] was built specifically for the Alpha Data 7v3 boards [29], and it had
proprietary IPs that were not publicly available. This thesis is intended to be an open-source
project. Therefore, as the first step, a general Hypervisor is built for the Alpha Data 8k5
boards [30] using cores that are available in the Xilinx IP catalog versus using proprietary cores.
Figure 4.2 shows the block diagram of the Hypervisor. The characteristics of this Hypervisor
are as follows:
• Network: Same as the base infrastructure, this Hypervisor provides access to the network
through a 10G Ethernet link. Furthermore, by adding the Xilinx Network Stack [31], it
supports higher levels of network communication such as a IP layer and a Transport layer
(e.g UDP and TCP). However, in this thesis Ethernet is chosen as the main interface to the
CHAPTER 4. GALAPAGOS 24
application region via AXI streams due to its compatibility with the base infrastructure.
The modified version of the base infrastructure application region is still able to give the
user the option of using higher layers of the network in an easy manner. This will be
described in Section 4.2.
In this Hypervisor, there is a monitoring module called Minitor (mini-monitor) that mea-
sures network load on the FPGA. This simple module that counts the number of incoming
and outgoing packets in specific time intervals can help the cloud orchestration tool to
perform load balancing. A MicroBlaze in the hypervisor has access to this information
and the users may define their own use-cases for this information.
• PCIe: For the PCIe interface of this Hypervisor, the Xilinx DMA/Bridge Subsystem is
used. This module is available in the Xilinx IP catalog, and its driver and some sample
codes are provided as well. In the integration of this Hypervisor with the base infrastruc-
ture, the PCIe link is used as a control path for the system.
• DDR Memory: For the DDR memory I/O interface in this thesis, the Xilinx Memory
Interface Generator (MIG) is used. This module is available in the Xilinx IP catalog as
well. The memory management unit in this Hypervisor is a simple AXI interconnect
because this hypervisor does not support multiple concurrent users.
• Application Region Interface: The application region interface is exactly the same as the
base infrastructure Hypervisor.
This thesis is focused in heterogeneous platforms, in which different boards and hosts are
able to communicate with each other. All the modules in this hypervisor are from the Xilinx IP
catalog and so with some simple changes in configurations and constraint files, the Hypervisor
can be easily ported. The link to this work can be found at
https://github.com/eskandarinariman/8k5-shell.
CHAPTER 4. GALAPAGOS 25
Figure 4.2: Hypervisor/Shell block diagram
4.2 Application Region
After building a new Hypervisor for the base infrastructure, some modifications are required for
making the application region of the base infrastructure more scalable, modular, and reliable.
These changes are applied to the application region and the automated flow that creates it. In
this chapter, we focus on the hardware design of the application region, and tool flow will be
described in Chapter 6.
4.2.1 Base infrastructure Application Region
This framework takes a description of a cluster composed of streaming kernels with a unique
ID and maps it to multiple FPGAs. Each streaming kernel uses the AXI stream protocol with a
dest field to specify which kernel the packet is destined for. As shown in Figure 3.3, a logical
view of this infrastructure is a large switch connecting all the kernels within the cluster. The
framework transforms this logical switch into two physical switches, with the first being an
AXI Stream switch on the FPGA and the second being a top-of-rack network switch. The
CHAPTER 4. GALAPAGOS 26
application region of the original base infrastructure is shown in Figure 4.3.
Figure 4.3: The original open-source framework from [4].
In the base infrastructure, Ethernet packets are transformed into AXI stream packets through
the use of an Input Bridge that matches the MAC address of the arriving packet with the FPGA
MAC address. This then connects to an Input Switch, which is a Xilinx AXI-stream switch.
This switch is connected to all kernels within the cluster, either directly if on the same FPGA,
or through a Packet-Formatter module that encapsulates the AXI stream packet with the ap-
propriate Ethernet headers, and places the AXI stream dest field in the Ethernet payload (one
packet formatter for each kernel outside the FPGA). These direct connections limit scalability
as the number of kernels in the cluster is limited by the number of ports in the switch (16).
4.2.2 Galapagos Application Region
The modified version of the base infrastructure application region is illustrated in Figure 4.4.
In Galapagos, we first address scalability by creating a general router. The block diagram of
the router is shown in Figure 4.5. The router on each FPGA includes a routing table indexed
by the unique ID of each kernel in the entire cluster (including kernels not on this FPGA) and
the network address (MAC or IP) of the FPGA that contains each kernel. All kernels output
their packets to the output switch of the router. Then a module called filter reads the dest field
of the packet, and looks up the network destination in the routing table. Afterwards, the packet
CHAPTER 4. GALAPAGOS 27
is either routed back to the input switch of the router or routed out to the network. The tool
flow of Galapagos specifies the number of ports of the router by reading the cluster description
files (described in Chapter 6) and it can be any number between 1 and 16. The number of ports
on this router is equal to the number of kernels only on this FPGA, since all packets leaving
the FPGA share one channel. This limits us to have up to 16 kernels on a particular FPGA
as opposed to the entire cluster as per the original design. The routing table is automatically
generated by our modifications to the cluster generator.
Figure 4.4: A high-level overview of Galapagos application region.
Figure 4.5: Router block diagram
The network bridge which is shown in Figure 4.4 is responsible for converting network
packets into AXI-stream format and vice versa. For the Ethernet Network Bridge of Fig-
CHAPTER 4. GALAPAGOS 28
ure 4.4, this thesis combines the Input Bridge and a modified version of the Packet Formatter
of Figure 4.3 into a single module. The block diagram of this module is shown in Figure 4.6.
The modified packet formatter is equipped with a look up table that has the MAC addresses for
each destination kernel. Using the MAC address lookup table in the packet formatter decreases
resource utilization as it only uses one packet formatter logic with a small memory instead of
multiple packet formatters for each kernel outside the FPGA, and it helps the Galapagos to be
more scalable.
In [4] the multi-FPGA communication uses the Ethernet protocol, which is not reliable. In
Galapagos, to address reliability, an optional TCP core [32] is integrated into the framework.
In this thesis an additional Network Bridge is created for TCP, allowing us to standardize the
interface between the Hypervisor and Application Region. The standardization of the interface
allows us to use both TCP and Ethernet interchangeably, thus addressing modularity. The block
diagram of th TCP Network Bridge is illustrated in Figure 4.7. Observe that the interfaces are
the same as for the Ethernet Network Bridge in Figure 4.6.
Figure 4.6: Ethernet NetworkBridge
Figure 4.7: TCP Network Bridge.
The Ethernet Network Bridge receives all the packets that are sent to the same VLAN. It
matches the FPGA MAC address with the destination MAC address of the incoming packet.
Afterwards, it either drops the packet or forwards it to the intra-FPGA network. The TCP
CHAPTER 4. GALAPAGOS 29
Network Bridge is more complex since it is considered a higher level of the network stack. In
this bridge there are notions of server and client. The server listens to a specific port, waiting
for a connection. A client can make a connection to the server using the specific port and IP
address of the server. When a connection is accepted by the server, a session will be created on
both sides of this communication. Afterwards, the client can send a request to the server, and
the server will respond to the request using the session.
The TCP Network Bridge developed in this thesis operates in the following manner. At
reset time, all the servers (FPGA and CPU nodes) of the entire cluster start listening to a specific
port. When a kernel sends data to another kernel that is located on another node, the client side
of the sender creates a connection, using the port and IP address of the destination kernel that is
available in a look up table called the IP table (the automated tool flow of Galapagos generates
this). When the destination server accepts the connection, a new session will be created for
both client and server. They will put the session ID into another look up table, by destination.
This session will be used for all future communication between these two kernels in the cluster.
A user may wish to create a communication layer on top of standard network layers (e.g
MPI). A communication bridge (Comm Bridge in Figure 4.4) is used to transform network
packets to communication layer compliant packets. In Chapter 5, the HUMboldt communi-
cation layer will be described as an example of a communication bridge. In Galapagos, we
further address modularity by allowing the user to configure the network protocol (Ethernet or
TCP) and communication bridge via the mapping configuration file. Details of the mapping
file can be found in Chapter 6.
Chapter 5
HUMboldt Communication Layer
In this chapter a heterogeneous communication layer called HUMboldt is described. The rea-
sons that this communication layer is called HUMboldt are as follows:
• In the Galapagos islands, Humboldt is a name of an ocean current that flows between
the islands.
• Humboldt is also an explorer ship that carries passengers between islands.
• The communication layer in this work is heterogeneous and it is based on message pass-
ing model. So “HUM” could stand for Heterogeneous Uniform Messaging.
Due to the modular design of Galapagos, the implementation of the communication layer can
be any communication model as long as it adheres to the AXI stream interface, and the appro-
priate communication bridge to convert AXI stream packets into communication layer specific
packets is provided. The communication layer we present here implements HUMBoldt, a min-
imal subset of MPI that is sufficient to enable basic message passing between kernels, which
is explained in Section 5.2. MPI is a standard API that defines signatures for functions such
as sending and receiving messages. These signatures must remain the same for all implemen-
tations of MPI. Even though the implementation for these functions varies according to the
platform (e.g FPGA or CPU), the standardization of the protocol for HUMBoldt allows for
30
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 31
communication between heterogeneous platforms. In this chapter, we first provide more de-
tails about TMD-MPI, the predecessor to HUMboldt followed by some background on MPI.
We then present the details of HUMboldt.
5.1 TMD-MPI
TMD-MPI [25] has implemented a heterogeneous MPI infrastructure that can be considered as
a baseline heterogeneous communication layer that supports CPUs and FPGAs. The high-level
architecture of TMD-MPI FPGA node is shown in Figure 5.1. The Off-chip Communication
Controller (OCCC) module handles the communication with off-chip nodes. The bridge trans-
lates the packets to a format that the OCCC can send outside the FPGA. The Network Interface
(NetIf) modules route packets to their destination internal to an FPGA.
In TMD-MPI the user has to deal with all the low-level infrastructure. However, in Gala-
pagos all the infrastructure is already implemented, and the user can choose between different
options for communication by creating the cluster description files, and the cluster will be gen-
erated automatically. Furthermore, in Galapagos, by introducing the communication bridge,
a user can define any protocol for communication. However, the communication protocol of
TMD-MPI is fixed, and integrated within the whole infrastructure.
Figure 5.1: TMD-MPI high-level view.
The kernels that use TMD-MPI for communication should be connected to a module called
TMD-MPE (TMD-Message Passing Engine). This module has all equivalent functionality of
the TMD-MPI software library. The user has to manage the scheduling of the TMD-MPE
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 32
themselves, which they do through a MPE wrapper and an FSM as shown in Figure 5.2. TMD-
MPI was developed before high-level synthesis tools existed so the kernels and the TMD-MPE
were all written in HDL. The user had to manually connect them.
The HUMboldt communication layer provides a library in HLS that has all the HUMboldt
software equivalent functions. Therefore, a user can use the same code for both software and
hardware, and just call HUMboldt library functions. In TMD-MPI, creation of a hardware core
is difficult, because the user has to manually schedule all the MPI function calls, and add the
handler logic that works with TMD-MPE. Whereas, in HUMboldt, HLS handles the scheduling
of MPI calls.
Figure 5.2: TMD-MPI kernel.
5.2 MPI Communication Layer
This section explains MPI and the subset of the MPI library that we support, which we call
Heterogeneous Uniform Messaging (HUMboldt). MPI is a common communication and pro-
gramming layer in the High Performance Computing field. Using MPI, many parallel processes
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 33
called “ranks” communicating via messages, can be run on multi-node platforms. In this work,
instead of “rank” we use “kernel” to be consistent with the base infrastructure terminology.
MPI has been mainly developed for multi-CPU network-connected platforms, which is the
standard high-performance computing architecture. However, it can also be used as a message
passing model and a programming layer for heterogeneous platforms.
In MPI software implementations such as MPICH [33], and OpenMPI [34], functions are
provided to transmit data among different kernels in various ways. The two main categories of
send and receive are blocking and non-blocking. In blocking send and receive, the functions
do not return until the data transmission is successfully done. In contrast, in non-blocking send
and receive, the functions immediately return, and continue the normal flow of the program. In
this thesis, only blocking send and receive is implemented. Other types of functions provided in
MPI include collective functions e.g. MPI Gather, MPI Reduce, and MPI AllReduce. These
functions collect the data from different nodes in an optional root kernel by a user specific
operation e.g. add or sub. This thesis has not implemented any of these functions, however,
implementing these functions using the send and receive functions as the building blocks is
straightforward.
The minimum subset of MPI functions needed for communication are:
1. MPI Init: This function initializes the MPI environment, and does the basic setup such
as network interface initialization.
2. MPI Send and MPI Recv: These two functions are building blocks of the MPI program-
ming model that enable data transmission among kernels. For every Send to a kernel,
there must be a matching Receive on that kernel to get the data from the sender.
3. MPI Finalize: Makes sure that all kernel processes are completed.
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 34
5.3 Protocol
In this section we describe the protocol by which all hardware and software kernels communi-
cate with each other in HUMboldt. The first step in data transmission is to send an envelope
as a send request. This envelope has information about routing, message size, and data type.
Once the receiver kernel is ready to receive the data, it sends a clear to send envelope to the
sender. After the sender receives the clear to send packet, it starts to send the data packets to
the receiver, and at the end, when the receiver has received all the data successfully, it sends a
packet to the sender indicating that the transmission is done. These steps are shown in Figure
5.3.
Figure 5.3: HMPI Protocol
Buffering is necessary because there are scenarios when send or receive envelopes might
be sent out of order. For example, consider kernel A receiving data from kernel B when kernel
C sends an envelope to kernel A. To avoid losing the envelope, the envelope from kernel C
must be buffered by kernel A until the previous transaction with kernel B is completed.
In parallel processing, platform synchronization is an important aspect since different pro-
cesses run at different speeds. At some point, one node will need to wait for other nodes to
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 35
reach a specific point in the program. In HUMboldt, we currently only implement blocking
sends and receives, meaning that the kernels wait for these functions to finish before resuming
processing. This is sufficient for the synchronization we currently need, but when we add the
ability to do non-blocking sends and receives we will also need to add barriers into the next
version of the HUMboldt library.
There are two types of networks in our HUMBoldt communication layer. The Intra-FPGA
AXI stream network is used for kernels that are located in the same physical FPGA, and the
Inter-node communication between FPGAs and CPUs uses the network. The network com-
munication currently supports TCP or Ethernet but any network protocol with an AXI stream
interface can be used. We use the same underlying HUMBoldt protocol for kernels commu-
nicating within the same node (FPGA or CPU) and between nodes, as our Communication
Bridge described in Section 4.2.2 encapsulates a HUMBoldt compliant packet with the appro-
priate network header (e.g TCP or Ethernet) to handle the inter-node communication.
Figure 5.4 shows the envelope that carries the HUMBoldt messages between kernels. The
first two bytes correspond to the destination and source kernels. The Packet Types are send
request, clear to send, data, or done, where the different packet types are used to implement
the message passing protocol underlying HUMBoldt. The next three bytes specify the size
of the message being sent. In the standard implementation of MPI, the Tag is an option for
the user to tag optional metadata to transactions, and for compatibility purposes we keep this
as a field in the envelope used for HUMBoldt. Due to the limitation of HLS lacking a void
pointer, there are different functions for Send and Receive for different types of data in the
HMPI library. The Data Type field helps to process the different data types properly.
Figure 5.4: HUMboldt packet format
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 36
5.4 Packetization
For the sake of modularity of the design, the HUMboldt library should be able to send any size
of data to other kernels. This is not a problem in the Intra-FPGA network because there is no
limitation for packet size. However, when a packet is going through the inter-node network
it faces the packet size limitation which is 1518 bytes. This number is the maximum number
of bytes that is allowed for an Ethernet frame. Therefore, packetization logic is implemented
in the Communication Bridge because an outgoing packet goes through this module first. The
Ethernet Communication Bridge divides the packet into Ethernet frame size, adds a data enve-
lope to them, and sends it to the network bridge. The TCP Network Bridge has the ability to
handle larger packets. Nonetheless, in the first experiments with the TCP core, we saw that the
TCP/IP core first buffers all the data, and then starts to send it, and still it has the buffer size
limitation of 65k. Furthermore, the core eventually divides the large TCP packet into small
Ethernet frames. The packetization logic is added to the TCP communication bridge to lower
the latency of the TCP communications (it does not wait to buffer data), and address the limi-
tation of buffering size in TCP. In addition, it keeps the consistency between TCP and Ethernet
implementation of the Communication Layer.
5.5 Reliability Layer
Given that Galapagos uses Layer 2 communication (Ethernet), it is possible that packets can
be dropped in the network or received out of order. There are two options for addressing these
issues. First is adding a reliable transport layer like TCP which is described in Section 4.2.2.
The other alternative is to add a simple and lightweight reliability layer on top of Layer 2. The
reliability on top of Layer 2 could be low latency in the network environments that are not busy
and the packet drop rate in them is low. The other advantage of a lightweight reliability layer
is a lower resource utilization.
As mentioned in Section 5.4, all the packets that leave the FPGA go through the commu-
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 37
nication bridge that is responsible for packetizing the data and adding a data envelope to each
network packet. It adds a sequence number to each data packet. The sequence number starts
at 0 and increments by one per outgoing network packet, and the message size field of the data
packet envelope is used to add this extra information. This number is used as a key element
for our reliability layer. The receiver keeps track of the sequence numbers from each node of
the cluster, for each transaction. When a packet gets dropped or packets are received out of
order, the sequence number does not match the one that the receiver is expecting. The receiver
sends an error envelope with the last correct sequence number it received. Once the sender
receives the error envelope, it starts sending data from that sequence number again. The re-
ceiver keeps discarding data packets until it gets the correct one, and data transmission resumes
in the normal way. We expect that this simple protocol will be sufficient because packets are
being transmitted over relatively short distances within a data center as compared to large geo-
graphical distances so packet loss and ordering issues should not be frequent. This lightweight
reliability layer is implemented, but has many corner cases that need to be addressed. There-
fore, the complete bug-free version of this reliability layer is moved to future work.
5.6 HUMboldt Software
A heterogeneous communication layer needs to have functions for each platform that commu-
nicates with the same protocol. For the software side of HUMboldt, standard socket program-
ming libraries are used to implement the functions that can communicate with the same proto-
col as hardware kernels. These libraries are available for different types of CPU platforms, so
this helps improve the functional portability and heterogeneity. The Galapagos infrastructure
and how it handles different layers of the network, and how it delivers packets to the HUM-
boldt layer is described in Chapter 4. In addition, in this chapter the underlying protocol of
the HUMboldt communication layer is described. In the base infrastructure in [4], there is not
any notion of software kernels because it is focused on the FPGA cluster. However, with the
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 38
modifications done in Galapagos, and adding HUMboldt as a communication layer, software
kernels are now part of the platform.
Due to the support of different layers of the network stack, different software functions are
implemented to act as a network bridge in hardware (described in Section 4.2.2). With the het-
erogeneity of the HUMboldt communication layer, it is considered a higher level of abstraction
that includes different combinations of communication between software and hardware ker-
nels. The way the HUMboldt software is designed is that in the initialization phase, it opens an
Ethernet socket as well as making a TCP server. All of the HUMboldt software kernels always
have both Ethernet and TCP. But in the hardware because of limited available resources, the
user is allowed to use only one protocol. Furthermore, it parses the cluster description files
to know the type of the kernel (software or hardware) and the chosen communication proto-
col (Ethernet or TCP) of each node. For the communications, the following scenarios could
happen.
• Software and Software: MPICH is a mature and well-optimized software implementa-
tions that supports all the MPI features, so, it is included as a part of the HUMboldt layer
that is responsible for communication among software kernels. For software-software
communications instead of using our custom protocol and functions, MPICH functions
are called. The command that runs the MPICH program called mpirun has multiple
options such as number of kernels, and the distribution of kernels on hosts that should
be specified in a host file. Using this information, it creates processes for kernels, and
distributes them on the hosts according to the host file. When HUMboldt has multi-
ple kernels in software, the mpirun command should be used. All the information that
it needs such as number of software kernels and their hosts (IP addresses) is available
in cluster description files. At the current state of this work, the user should run this
command manually, however as a future work a simple HUMboldtrun command can be
created that first parses those files and then sets the mpirun options.
• Software and Ethernet hardware: In this case a HUMboldt function is going to call the
CHAPTER 5. HUMBOLDT COMMUNICATION LAYER 39
set of functions that can communicate to the hardware kernel through the Ethernet Pro-
tocol. A receiving thread always listens to an Ethernet socket with a specific HUMboldt
EtherType, and stores all the incoming packets into different FIFOs that the HUMboldt
functions are using for executing the protocol.
• Software and TCP Hardware: In this case the TCP set of functions are going to be
called. The TCP network equivalent in software forks a server thread that listens to the
specific HUMboldt port. When a connection is accepted, a new thread will be created
for receiving data from that session, and the session will be stored in an array indexed
by the kernel number for future use. In the case of sending data, the main thread of the
program searches the sessions array. If the session already exists, the main thread uses
it to send data to the target kernel. On the other hand, it creates a new connection for
sending, and it creates a new receiving thread for this session as well as storing it into
the sessions array. Similar to the Ethernet network bridge, all the incoming packets will
be stored in a buffer and queued to be processed at the right time.
• Hardware and Hardware: The hardware kernels should communicate through the same
protocol (Ethernet of TCP) because there is only one supported protocol for each hard-
ware node. However, with the HUMboldt software implementation, there is a possibility
to bridge between the different protocols. As mentioned before, all the software kernels
have both Ethernet and TCP protocols. For example, an Ethernet hardware node can
send data to a software kernel and it forwards the data to a TCP hardware node.
We have described the Galapagos infrastructure and the HUMboldt communication layer
that is built on top Galapagos. We have shown how this platform is scalable, modular, and
heterogeneous. However, how the user can easily enjoy these characteristic of the platform is
going to be described in Chapter 6 that introduces the user interface and tool flow of Galapagos
and HUMboldt.
Chapter 6
System Interface and Tool Flow
This chapter specifies how a user makes an application using the HUMBoldt communication
layer as well as Galapagos underneath. The ease of use, heterogeneity, and scalability of this
infrastructure is shown more clearly in this chapter.
6.1 Base Infrastructure System Interface
The tool flow of the base infrastructure [4] [21] takes several streaming IP blocks and the
description files, which are the logical file, mapping file, and MAC address file in XML format
(described in Section 3.2). These files are the user interface to the base infrastructure, and they
will be parsed by the tool flow of the base infrastructure to automatically connect kernels in the
user defined cluster. Afterwards, the tool flow reserves the resources from the cloud, puts them
in the same network, and gives the user the handler to program and send data to the created
cluster.
6.1.1 Logical File
The sample of a Logical file is shown in Listing 6.1. The “kernel” field is used to specify
the user hardware core that has two AXI-stream interfaces. The name of these interfaces are
40
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 41
required by the automated tool flow. In the “interface” field in logical file user can specify
the name and the direction of the AXI-stream port. The “num” field in the logical file will
be used in the mapping file as a unique identifier, when the user is mapping kernels into the
physical FPGAs. The “rep” field specifies the number of replications of a kernel within the
entire cluster. For example, in Listing 6.1, “rep” field of hardware core A is 4, meaning that
kernels numbered 0 to 3 are hardware core A.
1 <?xml version="1.0" encoding="UTF-8"?>
2 <cluster>
3 <kernel> hardware_core_name_A
4 <num> 0 </num>
5 <rep> 4 </rep>
6 <interface>
7 <direction> in </direction>
8 <name> stream_in_V </name>
9 </interface>
10 <interface>
11 <direction> out </direction>
12 <name> stream_out_V </name>
13 </interface>
14 </kernel>
15 <kernel> hardware_core_name_B
16 <num> 4 </num>
17 <rep> 4 </rep>
18 <interface>
19 <direction> in </direction>
20 <name> stream_in_V </name>
21 </interface>
22 <interface>
23 <direction> out </direction>
24 <name> stream_out_V </name>
25 </interface>
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 42
26 </kernel>
27 </cluster>
Listing 6.1: Sample of base infrastructure Logical file
6.1.2 Mapping File
When the logical view of the kernels is defined, the kernels should be mapped onto actual
physical FPGAs. A sample of the mapping file is illustrated in Listing 6.2. In this file, the
kernel numbers (defined in the Logical file) are used to map the kernels onto the FPGAs. In
this example, kernels 0 to 3, which are hardware core A, are mapped to the first FPGA, and the
rest of the kernels which are hardware core B, are mapped to the second FPGA.
1 <?xml version="1.0" encoding="UTF-8"?>
2 <cluster>
3 <FPGA>
4 <kernel> 0 </kernel>
5 <kernel> 1 </kernel>
6 <kernel> 2 </kernel>
7 <kernel> 3 </kernel>
8 </FPGA>
9 <FPGA>
10 <kernel> 4 </kernel>
11 <kernel> 5 </kernel>
12 <kernel> 6 </kernel>
13 <kernel> 7 </kernel>
14 </FPGA>
15 </cluster>
Listing 6.2: Sample of base infrastructure Mapping file
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 43
6.1.3 MAC Address File
The base infrastructure needs another file that contains all the MAC addresses. For each FPGA
that is defined in the mapping file, there should be a MAC address. The automated tool flow
will assign these MAC addresses to physical FPGAs. The sample of this file is shown in
Listing 6.3.
1 fa:16:3e:55:ca:02
2 fa:16:3e:55:ca:03
Listing 6.3: Sample of base infrastructure MAC address file
6.2 Galapagos System Interface
In Section 4.2.2, the application region of Galapagos is described. In that Section, the focus
was on the hardware aspect of the application region. It is obvious when underlying design of
an infrastructure is changed, the system interface should change accordingly to support those
changes. This section describes the Galapagos system interface that supports the underlying
infrastructure.
The features that are added to the Galapagos system interface was done with collaboration
with Naif Tarafdar as he is the designer of system interface of the base infrastructure. He has
modified the part of the tool flow that parses the cluster description files to make the Applica-
tion Region of Galapagos automatically. These modifications was needed by the HUMboldt
communication layer.
6.2.1 Logical File
Listing 6.4 is a sample logical file. The logical file in Listing 6.4 is relatively the same as
the base infrastructure logical file in [4]. However, there are some small changes that are as
follows. There are different naming conventions for the reset and the clock ports in the Vivado
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 44
environment. In the modified logical file the user can specify the clock and reset port names.
One feature that makes this complex heterogeneous system easy to use is that a user be able to
debug it by monitoring some signals. This thesis has added a debug capability to the system, by
which the signals that are marked as debug (Line 12 and 17 of Listing 6.4) will be connected to
a Xilinx ILA core. The Xilinx Integrated Logic Analyzer (ILA) IP core[35] is a logic analyzer
that can be used to monitor the internal signals of a design, running on an FPGA. Another
capability that is added to the system interface of Galapagos is defining a constant port that
can be assigned automatically by the tool flow.
1 <?xml version="1.0" encoding="UTF-8"?>
2 <cluster>
3 <kernel> hardware_core_name
4 <num> 1 </num>
5 <rep> 96 </rep>
6 <clk> aclk </clk>
7 <aresetn> aresetn </aresetn>
8 <id_port> kernel_id </id_port>
9 <interface>
10 <direction> in </direction>
11 <name> stream_in_V </name>
12 <debug/>
13 </interface>
14 <interface>
15 <direction> out </direction>
16 <name> stream_out_V </name>
17 <debug/>
18 </interface>
19 <const>
20 <name> size </name>
21 <val> 4 </val>
22 <width> 16 </width>
23 </const>
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 45
24 </kernel>
25 <kernel> cpu
26 <num> 0 </num>
27 <rep> 1 </rep>
28 </kernel>
29 </cluster>
Listing 6.4: Sample of Logical File
6.2.2 Mapping File
Listing 6.5 shows how the kernel replications can be mapped into one or more FPGAs, in lines
20 to 24 it can be seen that the kernels 1-16 are mapped to a single FPGA. This shows how
easy it is to scale the system by changing some lines in the configuration files. The mapping
file of the base infrastructure, which is described in Section 6.1.2, has the same capability
that Galapagos inherited. However, as mentioned in Section 4.2, the base infrastructure has
the limitation of 16 kernels within the entire cluster, but, Galapagos has no limitation of total
number of kernels. It has only the limitation for the 16 kernels per FPGA.
Some additional features are added to the mapping file to support heterogeneity and mod-
ularity. For example, in Listing 6.5 lines 4 and 18 show two different types of nodes (software
and hardware), which addresses heterogeneity and how easy it is to change a kernel from hard-
ware to software, or vice versa. Furthermore to address modularity, the user can specify a
bridge for their communication layer as shown in lines 10 to 16. If the user does not specify a
bridge, then it is assumed the kernels will communicate directly via AXI-stream. Modularity
within the network layer can be observed in line 19, where the user can specify the network
protocol (e.g TCP or Ethernet), and network addresses as demonstrated in lines 6-7 and 25-26.
In Galapagos the MAC address file is removed from the description files (only the logical file
and the mapping file are needed for the cluster description files), and the user defines the ad-
dresses using the mapping file. The network addresses would be supplied by the manager of
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 46
the data center. This could either be the user or a cloud resource manager.
1 <?xml version="1.0" encoding="UTF-8"?>
2 <cluster>
3 <node>
4 <type> sw </type>
5 <kernel> 0 </kernel>
6 <mac_addr> ac:c4:7a:88:c0:47 </mac_addr>
7 <ip_addr> 10.1.2.152 </ip_addr>
8 </node>
9 <node>
10 <appBridge>
11 <name> communication_bridge_eth_mpi </name>
12 <to_app> to_app_V </to_app>
13 <from_app> from_app_V </from_app>
14 <to_net> to_net_V </to_net>
15 <from_net> from_net_V </from_net>
16 </appBridge>
17 <board> adm-8k5-debug </board>
18 <type> hw </type>
19 <comm> eth </comm>
20 <kernel> 1 </kernel>
21 .
22 .
23 .
24 <kernel> 16 </kernel>
25 <mac_addr> fa:16:3e:55:ca:02 </mac_addr>
26 <ip_addr> 10.1.2.101 </ip_addr>
27 </node>
28 </cluster>
29 ˜
Listing 6.5: Sample of Map File
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 47
6.3 HUMboldt
The system interface of Galapagos is described in Section 6.2. In this section, the system
interface of the HUMboldt communication layer, which is built on top Galapagos, is described.
The HUMboldt communication layer is a higher layer on top of Galapagos. The user inter-
face of HUMboldt is as high as the application layer. The user needs to provide an application
code as input for the HUMboldt layer. Recall that the HUMboldt code can be used as hard-
ware or software kernels. So, HUMboldt can be considered as a programming layer for this
heterogeneous platform. In Listing 6.6, an example of a common model for High Performance
Computing applications is illustrated, in which one kernel is responsible for distributing data
to several other kernels, and gathers processed data, when all kernels are done.
1 #include "HUMboldt.h"
2 #define MAX_ITR 10
3 #define DATA_SIZE 1000
4 #define TAG 0
5
6 int main(int argc, char* argv[])
7 {
8 HUM_Init(&argc,&argv);
9 int data_array[DATA_SIZE];
10 int size = atoi(argv[1]);
11 int kernel_id = atoi(argv[2]);
12
13 for(int i = 0 ; i < MAX_ITR ; i++){
14 if(kernel_id == 0){
15 for(int r = 1; r < size ; r++)
16 HUM_Send(data_array,DATA_SIZE,MPI_FLOAT,r,TAG,
MPI_COMM_WORLD);
17 }
18 else{
19 HUM_Recv(data_array,DATA_SIZE,MPI_FLOAT,0,TAG,MPI_COMM_WORLD
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 48
);
20 }
21 /* process data*/
22 if(kernel_id == 0){
23 for(int r = 1 ; r < size ; r++)
24 HUM_Recv(data_array,DATA_SIZE,MPI_FLOAT,r,TAG,
MPI_COMM_WORLD);
25 }
26 else
27 HUM_Send(data_array,DATA_SIZE,MPI_FLOAT,0,TAG,MPI_COMM_WORLD
);
28 }
29 HUM_Finalize();
30 return 0;
31 }
Listing 6.6: HUMboldt sample code
The code in Listing 6.6 needs a few minor modifications to be synthesizable by Vivado
HLS. For example in line 6, instead of argc and argv to input the kernel and size we use two
constant ports. These values get assigned automatically in the tool flow using the logical de-
scription file of kernels, so there is no need for lines 10 and 11. These changes are shown
in Listing 6.7. Furthermore, some HLS pragmas should be added just for interfaces that are
always the same. These modifications can be done using a very simple script, but essentially,
the same code can be run as software or implemented as hardware in an FPGA. This code
demonstrates that HUMboldt is heterogeneous and functionally portable for different process-
ing nodes in a cluster.
1 int main(const int size, const int kernel_id)
2 {
3 #pragma HLS INTERFACE ap_ctrl_none port=return
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 49
4 #pragma HLS resource core=AXI4Stream variable=stream_out
5 #pragma HLS resource core=AXI4Stream variable=stream_in
6 #pragma HLS DATA_PACK variable = stream_out
7 #pragma HLS DATA_PACK variable = stream_in
Listing 6.7: Modifications that needed to make HUMboldt code synthesizable
As it is mentioned in Section 5.6, HUMboldt is considered as a higher level communication
layer. In Listing 6.6 all the send and receive functions start with “HUM”, meaning that the
source and destination kernel of these functions could be in hardware or software. It is possible
to port any MPI code to HUMboldt if all the MPI functions that are used in the code are
supported by HUMboldt. By changing all of the “MPI” at the beginning of the function names
to “HUM” the new code can be used as an input file for the tool flow of HUMboldt. These
changes can be done by using a simple script as a stage of the HUMboldt tool flow.
6.4 HUMboldt Tool Flow
To make HUMboldt work transparently and conveniently across a heterogeneous platform, a
tool flow is required that takes the Galapagos description files and HUMboldt code, and creates
the whole cluster automatically. Recall that one of the goals is to use identical code whether
it is to run as a software kernel or as a hardware kernel. This means that the tool flow will
have two paths, one to create software executables and the other to build FPGA bitstreams in a
user-defined platform. Figure 6.1 shows the flows for hardware and software kernels.
6.4.1 Software Kernels
Building software kernels is essentially the same as what is currently done for standard MPI
software distributions. The first step is to link in the HUMboldt software library to the user
code. The HUMboldt software library has the MPI library for software kernels communi-
cations. The “MPICC” is used to compile the code because there are MPI functions in the
CHAPTER 6. SYSTEM INTERFACE AND TOOL FLOW 50
Figure 6.1: Software and Hardware Tool Flow
HUMboldt library.
6.4.2 Hardware Kernels
The building of the hardware kernels requires transforming the original source code into a form
that can be used with high-level synthesis (HLS). HLS essentially creates a block of hardware
using software code. The hardware path of the tool flow gives these hardware blocks as well as
cluster description files to the Galapagos tool flow, and it creates the bitstreams for the FPGAs
in the cluster.
Chapter 7
Evaluation
In this section, the evaluation of our platform is presented. We consider the resource utilization,
latency and throughput, and scaling and heterogeneity of our platform.
The testbed that we use to run our test scenarios is a cluster of servers with Intel Xeon
E5-2650 CPUs running at 2.20 GHz, each with 12 physical cores, so via hyper-threading 24
software threads can be running. The FPGAs that are located in the same network of this
cluster are Xilinx UltraScale XCKU115-2-FLVA1517E devices on Alpha data ADM-PCIE-
8K5 boards. All network connections are 10G ethernet connected to a Dell Networking S4048-
ON 10G switch. Our HUMBoldt implementation is used for any configurations that include
a hardware node. To test the best software to software implementation, we just use MPICH,
which is a mature open-source MPI implementation.
7.1 Resource Utilization of Galapagos and HUMboldt
The resource utilization of the different layers of Galapagos including the Hypervisor that we
got from [4] plus off-chip memory support that we added, network bridges, communication
layer bridges, and the router within the application region are shown in Table 7.1. The percent-
ages, which are shown in brackets, are relative to the KU115 FPGAs. Observe that the resource
utilization of Galapagos is about 20% when the user chooses TCP, and it is 15% when the user
51
CHAPTER 7. EVALUATION 52
chooses Ethernet. The resources used here are not necessarily extra overhead as a developer
would require resources to create a custom multi-FPGA interconnect as well.
Table 7.1: Resource Utilization of GalapagosGalapagos Layer LUTs Flip-Flops BRAMs
I) Hypervisor 95332 120367 255(14.4%) (9.1%) (11.8%)
II) Network Bridge 29146 32582 86TCP (4.39%) (2.4%) (4.0%)III) Network Bridge 582 1087 2Ethernet (0.09%) (0.08%) (0.09%)IV) Communication Bridge 1039 1585 1TCP to HUMboldt (0.1%) (0.1%) (0.046%)V) Communication Bridge 729 1332 1Ethernet to HUMboldt (0.1%) (0.1%) (0.046%)VI) Router 5067 6310 1with 16 ports (0.8%) (0.5%) (0.046%)Total TCP 130584 160847 343(I + II + IV + VI) (19.7%) (12.1%) (15.9 %)
Total Ethernet 101710 129096 259(I + III + V + VI) (15.3%) (9.7%) (12.0%)
The other resource utilization to consider is related to the HUMboldt kernels. Each kernel
can use any of the functions that are defined in the HUMboldt communication layer library.
Once a HUMboldt function is called in the user code, the module for that function will be
added to the hardware of that kernel. Multiple calls to the same function do not increase the
instantiations of the hardware module. The resource utilization of each function is presented
in Table 7.2.
Table 7.2: Resource Overhead of HUMBoldt Communication Layer API FunctionsHUMboldt Function LUTs Flip-Flops BRAMs
HUM Send 389 372 0(0.06%) (0.03%) (0%)
HUM Recv 1180 1072 0(0.18%) (0.08%) (0%)
CHAPTER 7. EVALUATION 53
7.2 Latency and Throughput
We have created a microbenchmark to test the send and receive functionality of our system,
with one kernel sending and another kernel receiving. We change the implementation of the
kernel from hardware to software and test several configurations. The configurations are as fol-
lows: software to hardware, hardware to hardware (on the same FPGA), hardware to hardware
(on different FPGAs), and hardware to software. These configurations are tested with both
TCP and Ethernet. Furthermore we test the following software configurations with MPICH:
software to software (on the same CPU), and software to software (on different CPUs). MPICH
uses TCP for network communication. We test these using MPICH to compare the best soft-
ware implementation of MPI to our HUMboldt communication layer. The measurements re-
ported were averaged over many runs until results converged. For MPICH that required close
to a million runs, whereas our hardware results required close to 10 runs.
Figures 7.1 and 7.2 show the throughput of our benchmark. Our HUMboldt communication
layer and the respective bridge transforming HUMboldt packets into network packets performs
at line-rate and the throughput is limited by the 10G Ethernet Core in the Hypervisor along
with the respective packet headers required for TCP and Ethernet (hence the higher throughput
in Ethernet than TCP). Between hardware and software in Figure 7.1 we cannot scale past
a 128 KB payload after which we notice packet drops using Ethernet. This is because the
software kernel cannot receive data as fast as the hardware can send it. We roughly achieved
the maximum TCP core bandwidth that is mentioned in [32]. The curves show the expected
shape where the bandwidth improves as the payload size, and hardware to hardware works best
when compared to links involving a software node. Note that curve for hardware to hardware
in the same FPGA is the same in both Ethernet and TCP cases because the routing is done
internal to the FPGA without needing to add the Ethernet or TCP headers. The hardware to
software TCP throughput is extremely low. We would expect it to be much higher, but at the
time of this thesis, we have not figured out the reason. We believe it to be an issue in our
software.
CHAPTER 7. EVALUATION 54
We do not plot the latency and bandwidth between two software nodes on the same CPU
because they would be difficult to show on the same graph as the others due to the scaling
required. MPICH uses shared memory for the communication, much of which can fit in the
cache so the bandwidth can be very high and the latency very low. The observed bandwidth
for a 512 KB payload is approximately 60 GB/s with a latency of 0.21 µs. It can also be seen
that MPICH shows a very high throughput for small packets and then drops off. We do not
understand the inner workings of MPICH to explain this behavior, but we suspect optimizations
for small packets or possibly cache effects.
Figure 7.1: Ethernet Throughput.
The latency is shown in Figures 7.3 and 7.4. We define latency as the time for sending a
zero-payload size transaction, including sending an envelope, receiving a clear to send, sending
a zero-payload packet, and receiving a done. We measure the cycle counts with a probe (Xilinx
Integrated Logic Analyzer) running on the hardware with a 156.25 Mhz clock. The latency for
two kernels sending and receiving an entire transaction (all four packets) on the same FPGA
is deterministic and takes 29 cycles. When the receiving kernel is on another node (different
CHAPTER 7. EVALUATION 55
Figure 7.2: TCP Throughput.
FPGA or CPU) we incur an additional latency required to transform HUMboldt packets into
network packets. Each packet would be processed through a communication bridge and a
network bridge, these latencies are shown in Table 7.3.
Table 7.3: Per Packet Additional LatencyComponent and Protocol Send Receive
TCP Communication Bridge 9 5Ethernet Communication Bridge 6 5TCP Network Bridge 177 199Ethernet Network Bridge 7 12
On top of the internal FPGA latencies that are mentioned above there is a non-deterministic
latency of the network depending on the network topology. It can be seen that whenever any
communication uses a network link, the additional cycles for the bridges is very small except
when TCP is used. Also, whenever there is a software node involved, it is clear that handling
protocols in software is much slower than in hardware.
As a sanity check of our numbers we can make an approximate comparison to the latency
CHAPTER 7. EVALUATION 56
number measured for Microsoft Catapult [2] where they report an FPGA to FPGA round-trip
latency of 2.88µs when using their LTL communication layer over a 40G ethernet link through
a single top-of-rack switch. Figure 7.3 shows that the FPGA to FPGA latency when using the
Ethernet network link (hw to hw diff) on our platform is about 6 µs for a one-way message,
which would be about 12 µs for a round trip. In addition to Ethernet, Catapult uses UDP
frame encapsulation and adds their LTL protocol [2], which we do not have. They do not
report how many cycles LTL adds to their latency, but given that the protocol is lightweight,
we would expect that the number of cycles is small. At their reported 156 MHz clock rate,
each cycle is about 6.4 ns so as a fraction of the overall latency, LTL would be very small
and almost negligible. The other main difference from our platform is that Catapult uses 40G
links and we use 10G links. Again, with another crude approximation that the latency at 40G
will be four-fold lower than 10G, the latency for Catapult assuming 10G links would be about
4× 2.88µs = 11.5µs, which is close to our estimated 12 µs for a round trip.
Figure 7.3: Ethernet Latency. Figure 7.4: TCP Latency.
CHAPTER 7. EVALUATION 57
7.3 Scalability and Heterogeneity
To test scalability and heterogeneity, we built a simple application proxy shown in Listing 6.6
that is representative of a common computing pattern. It has a central node forking tasks to
many processes and gathering the results. The goal of this application is to exercise our plat-
form and show how easy it is to scale the application to run with different numbers of nodes.
The number of nodes within our system can scale to the depth of the routing table, which is
currently set to 256 but can be updated as required. Furthermore, we have a limitation of 16 ker-
nels per FPGA. Within our experiment we easily implemented several working configurations
up to 96 kernels across 6 FPGAs (limited by the number of FPGAs available in our cluster)
by just making a few changes in the configuration files as described in Chapter 6. Within our
experiment we also tested heterogeneity by implementing multiple kernels in hardware and
software.
Chapter 8
Conclusion and Future Work
A communication layer that supports heterogeneity and functional portability allows users to
easily target CPU and FPGA clusters, as it handles the communication complexities that arise
at a large scale. We improved the open source framework in [4], and introduced a modular way
to create a communication layer on top of that. Our rebuilt infrastructure is called Galapagos.
Our HUMboldt layer is a subset of MPI that we built as a message passing communica-
tion layer. The HUMboldt and Galapagos hardware implementation works at line-rate for 10G
links, and the latency is quite small. We show that we can target both heterogeneity and scala-
bility quite easily, as we can use multiple configurations that we can easily scale by changing
two configuration files. We have also shown that the resources required to support a general
and easy-to-use communication layer (not application specific) is quite minimal.
The implementation of Galapagos and HUMboldt has shown that a layered approach to
building heterogeneous infrastructure to support heterogeneous computing works well and is
important for making heterogeneous applications be portable and easier to build.
We have built Galapagos and HUMboldt to make it easier to build multi-FPGA and hetero-
geneous systems. To show the true power of this infrastructure, we will build some showcase
applications that can leverage such a platform.
The HUMboldt layer currently supports message passing using a minimal subset of the
58
CHAPTER 8. CONCLUSION AND FUTURE WORK 59
MPI standard. To more fully support MPI, more functions need to be implemented, which just
adds to the current library.
To support more types of applications it would be good to add the streaming communication
model to HUMboldt. Because of the modularity that we added to Galapagos, and the HLS
implementation of HUMboldt, it will not be difficult to add streaming. Just as we have done by
using MPI for message passing, it would be good to use a popular programming model such as
ZeroMQ [26] to define the interfaces.
This work is open-source and can be downloaded at
https://github.com/eskandarinariman/HMPI.
Bibliography
[1] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constan-
tinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan
Gray, et al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services.
ACM SIGARCH Computer Architecture News, 42(3):13–24, 2014.
[2] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Daniel Firestone,
Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, et al.
Configurable Clouds. IEEE Micro, 37(3):52–61, 2017.
[3] Hubert Zimmermann. OSI reference model–The ISO model of architecture for open
systems interconnection. IEEE Transactions on Communications, 28(4):425–432, 1980.
[4] Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, and
Paul Chow. Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data
Center. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pages 237–246. ACM, 2017.
[5] Scott Hauck and Andre DeHon. Reconfigurable Computing: The Theory and Practice of
FPGA-Based Computation, volume 1. Elsevier, 2010.
[6] Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Ca-
nis, Yu Ting Chen, Hsuan Hsiao, Stephen Brown, Fabrizio Ferrandi, et al. A Survey and
Evaluation of FPGA High-Level Synthesis Tools. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 35(10):1591–1604, 2016.
60
BIBLIOGRAPHY 61
[7] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H
Anderson, Stephen Brown, and Tomasz Czajkowski. LegUp: High-Level Synthesis for
FPGA-Based Processor/Accelerator Systems. In Proceedings of the 19th ACM/SIGDA
international symposium on Field programmable gate arrays, pages 33–36. ACM, 2011.
[8] Vivado-HLS Xilinx. Vivado Design Suite User Guide, High-Level Synthesis, 2018.
[9] Intel. Product Brief: Intel R© HLS Compiler, .
[10] Alberto Leon-Garcia and Indra Widjaja. Communication Networks. McGraw-Hill, Inc.,
2003.
[11] Universal Serial Bus Specification. Rev. 2.0. ARM, http://www. arm. com, 2000.
[12] AXI Xilinx. Reference Guide, UG761 (v13. 1). URL http://www. xilinx. com/support/-
documentation/ip documentation/ug761 axi reference guide. pdf, 2011.
[13] Partial Reconfiguration Xilinx. Vivado Design Suite User Guide, Partial Reconfiguration,
2018.
[14] Product Guide. SDAccel Environment, Platform Development Guide. Xilinx Inc, 2016.
[15] Naif Tarafdar. Building and Using Virtual FPGA Clusters in Datacenters, 2017.
[16] Xilinx UG081. MicroBlaze Processor Reference Guide, 2009.
[17] Stuart Byma, J Gregory Steffan, Hadi Bannazadeh, Alberto Leon Garcia, and Paul Chow.
Fpgas in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack. In
Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual
International Symposium on, pages 109–116. IEEE, 2014.
[18] Joon-Myung Kang, Hadi Bannazadeh, Hesam Rahimi, Thomas Lin, Mohammad Faraji,
and Alberto Leon-Garcia. Software-defined infrastructure and the future central office.
BIBLIOGRAPHY 62
In Communications Workshops (ICC), 2013 IEEE International Conference on, pages
225–229. IEEE, 2013.
[19] Joon-Myung Kang, Hadi Bannazadeh, and Alberto Leon-Garcia. Savi Testbed: Control
and Management of Converged Virtual ICT Resources. In Integrated Network Manage-
ment (IM 2013), 2013 IFIP/IEEE International Symposium on, pages 664–667. IEEE,
2013.
[20] Jagath Weerasinghe, Francois Abel, Christoph Hagleitner, and Andreas Herkersdorf. En-
abling FPGAs in Hyperscale Data Centers. In Ubiquitous Intelligence and Computing and
2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl
Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-
ATC-ScalCom), 2015 IEEE 12th Intl Conf on, pages 1078–1086. IEEE, 2015.
[21] Naif Tarafdar, Thomas Lin, Nariman Eskandari, David Lion, Alberto Leon-Garcia, and
Paul Chow. Heterogeneous Virtualized Network Function Framework for the Data Center.
In Field Programmable Logic and Applications (FPL), 2017 27th International Confer-
ence on, pages 1–8. IEEE, 2017.
[22] Henning Schulzrinne, Anup Rao, and Robert Lanphier. Real Time Streaming Protocol
(RTSP). Technical report, 1998.
[23] Iraj Sodagar. The MPEG-Dash standard for multimedia streaming over the internet. IEEE
MultiMedia, (4):62–67, 2011.
[24] Marc Snir. MPI–the Complete Reference: the MPI Core, volume 1. MIT press, 1998.
[25] Manuel Saldana and Paul Chow. TMD-MPI: An MPI Implementation for Multiple Pro-
cessors Across Multiple FPGAs. In Field Programmable Logic and Applications, 2006.
FPL’06. International Conference on, pages 1–6. IEEE, 2006.
BIBLIOGRAPHY 63
[26] Pieter Hintjens. ZeroMQ: Messaging for Many Applications. ” O’Reilly Media, Inc.”,
2013.
[27] Manuel Saldana, Arun Patel, Christopher Madill, Daniel Nunes, Danyao Wang, Paul
Chow, Ralph Wittig, Henry Styles, and Andrew Putnam. MPI as a programming model
for high-performance reconfigurable computers. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 3(4):22, 2010.
[28] Alan D George, Martin C Herbordt, Herman Lam, Abhijeet G Lawande, Jiayi Sheng,
and Chen Yang. Novo-G#: Large-scale reconfigurable computing with direct and pro-
grammable interconnects. In High Performance Extreme Computing Conference (HPEC),
2016 IEEE, pages 1–7. IEEE, 2016.
[29] Alpha Data. Alpha Data 7v3 boards. https://www.alpha-data.com/dcp/
products.php?product=adm-pcie-7v3, 2016.
[30] Alpha Data. Alpha Data 8k5 boards. https://www.alpha-data.com/dcp/
products.php?product=adm-pcie-8k5, 2017.
[31] David Sidler, Gustavo Alonso, Michaela Blott, Kimon Karras, Kees Vissers, and Ray-
mond Carley. Scalable 10Gbps TCP/IP stack architecture for reconfigurable hardware.
In Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual
International Symposium on, pages 36–43. IEEE, 2015.
[32] D. Sidler et al. Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware.
In 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom
Computing Machines, pages 36–43, May 2015.
[33] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A High-Performance,
Portable Implementation of the MPI Message Passing Interface Standard. Parallel Com-
puting, volume=22, number=6, pages=789–828, year=1996, publisher=Elsevier.
BIBLIOGRAPHY 64
[34] Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jef-
frey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine,
et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation.
In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting,
pages 97–104. Springer, 2004.
[35] Product Guide. LogiCORE IP Soft Error Mitigation Controller v3. 3. Xilinx Inc, 2016.