by Stuart A. Byma - University of Toronto T-Space Virtualizing FPGAs for Cloud Computing...
Transcript of by Stuart A. Byma - University of Toronto T-Space Virtualizing FPGAs for Cloud Computing...
Virtualizing FPGAs for Cloud Computing Applications
by
Stuart A. Byma
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2014 by Stuart A. Byma
Abstract
Virtualizing FPGAs for Cloud Computing Applications
Stuart A. Byma
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2014
Cloud computing has become a multi-billion dollar industry, and represents a computing
paradigm where all resources are virtualized, flexible and scalable. Field Programmable
Gate Arrays (FPGAs) have the potential to accelerate many cloud-based applications,
but as of yet are not available as cloud resources because they are so different from the
conventional microprocessors that virtual machines (VMs) are based on. This thesis
presents a first attempt at virtualizing and integrating FPGAs into cloud computing sys-
tems, making them available as generic cloud resources to end users. A novel architecture
enabling this integration is presented and explored, and several custom hardware appli-
cations are evaluated on a prototype system. These applications show that Virtualized
FPGA Resources can significantly outperform VMs in certain classes of common cloud
computing applications, showing the potential to increase user compute power while
reducing datacenter power consumption and operating costs.
ii
Dedication
To Jennifer.
iii
Acknowledgements
First I must sincerely thank my advisors Professors Greg Steffan and Paul Chow. I owe
my success in graduate school to them and their invaluable guidance and advice. I could
not have asked for better mentorship throughout my Masters research, or in this chapter
of my life. Thank you both.
Also to my esteemed colleagues and office mates Xander Chin, Charles Lo, Ruedi
Willenberg, Robert Heße, Fernando Martin Del Campo, Andrew Shorten, Jimmy Lin
and others: You have made my time at the U of T a true pleasure – thank you for all
the good times, and for being an ever present sounding board for thoughts and ideas.
A special thanks as well to members of the SAVI testbed: Professor Alberto Leon-
Garcia, Hadi Bannazadeh, Thomas Lin and Hesam Rahimi. Your help and advice, tech-
nical and otherwise, has made my work presented here possible.
Finally and most importantly, an everlasting thanks to my wife. Thank you for
encouraging me to pursue my passions, and thank you for your unfaltering belief in me.
None of this would have happened without you.
iv
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 The Smart Applications on Virtual Infrastructure Network . . . . . . . . 7
2.3.1 The Smart Edge Node . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Heterogeneous Resources in the SAVI Testbed . . . . . . . . . . . 10
2.4 FPGA Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Architecture for FPGA Virtualization 14
3.1 OpenStack Resource Management . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 SAVI Testbed FPGA Resources . . . . . . . . . . . . . . . . . . . 16
3.1.2 Requirements for FPGA Virtualization in OpenStack . . . . . . . 18
3.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Fully Virtualized Hardware . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 FPGA Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . 20
v
3.2.3 Virtualization via PR . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4 Static Logic Design . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Agent Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Booting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Deleting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Booting VFRs in OpenStack . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Compiling Custom Hardware . . . . . . . . . . . . . . . . . . . . . . . . 31
4 SAVI Testbed Prototype 35
4.1 FPGA Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Agent Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 statusTable and Associated Objects . . . . . . . . . . . . . . . . 38
4.2.2 Initialization and Operation . . . . . . . . . . . . . . . . . . . . . 42
5 Platform Evaluation and Application Case Studies 44
5.1 Platform Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Case Study: Load Balancer . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Load Balancer Designs . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Case Study: Extending OpenFlow Capabilities . . . . . . . . . . . . . . . 53
5.3.1 VXLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2 Virtualized Hardware for New OpenFlow Capabilities . . . . . . . 55
5.3.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Conclusion 62
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1.1 Architectural Enhancements . . . . . . . . . . . . . . . . . . . . . 63
6.1.2 Failures and Migration . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1.3 Further Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . 65
vi
6.1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1.5 FPGA CAD as a Service . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.6 Complementary Studies . . . . . . . . . . . . . . . . . . . . . . . 67
Bibliography 68
vii
Acronyms
BRAM Block Random Access Memory. 27
CAD Computer Aided Design. 16, 24
CAM Content Addressable Memory. 17–19
DRAM Dynamic Random Access Memory. 19, 20, 26
FIFO First In First Out. 20
FPGA Field Programmable Gate Array. 1–4, 9–17, 19–22, 24
GPIO General-Purpose Input-Output. 18, 19, 27
LUT Look-up Table. 27
MAC Media Access Control. 17–20, 22
PR Partial Reconfiguration. 9, 15, 16, 19, 22, 24
PRM Partially Reconfigurable Module. 16, 17, 24, 30
PRR Partially Reconfigurable Region. 15–17, 22, 24, 25, 27–30
SAVI Smart Applications on Virtual Infrastructure. 5–9
TCP Transmission Control Protocol. 21
viii
UART Universal Asynchronous Receiver-Transmitter. 18
UUID Universal Unique Identifier. 21–23
VFR Virtualized FPGA Resource. 2, 3, 17–25
VM Virtual Machine. 2, 3, 17
ix
List of Tables
4.1 Resource Usage for System Static Hardware . . . . . . . . . . . . . . . . 37
5.1 Boot Times for VMs and VFRs . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Resource Usage for VFR Load Balancer . . . . . . . . . . . . . . . . . . 49
5.3 Resource Usage for VFR VXLAN Port Firewall . . . . . . . . . . . . . . 57
5.4 Throughput and Latency for VXLAN Port Firewall . . . . . . . . . . . . 59
x
List of Figures
2.1 Diagram of the SAVI testbed. . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The SAVI testbed Smart Edge node. . . . . . . . . . . . . . . . . . . . . 9
2.3 The Driver-Agent abstraction used in the SAVI testbed OpenStack system. 10
3.1 A simplified view of resource management in OpenStack/SAVI Testbed. . 15
3.2 FPGA Partial Reconfiguration. . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 System view of the on-FPGA portion of the virtualization hardware. . . 24
3.4 Virtualization hardware input arbiter block. . . . . . . . . . . . . . . . . 26
3.5 The VFR wrapper design. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 VFR boot sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Compile flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 A sequence diagram of the entire boot procedure in the SAVI testbed
prototype system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 Experiment setup for load balancer tests. . . . . . . . . . . . . . . . . . . 51
5.2 VFR load balancer latency at different throughput levels. . . . . . . . . . 52
5.3 VM load balancer latency at different throughput levels. . . . . . . . . . 52
5.4 Number of dropped packets for the VM load balancer. . . . . . . . . . . 53
5.5 Packet diagram for the VXLAN protocol. . . . . . . . . . . . . . . . . . . 55
5.6 VXLAN Port Firewall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Experimental setups for VFR-based VXLAN firewall. . . . . . . . . . . . 59
xi
Chapter 1
Introduction
Datacenter-based cloud computing has evolved into a multi-billion dollar industry, with
continued growth forecast [1]. Cloud computing is based on virtualization technology,
which abstracts physical resources into virtualized resources. This virtualization provides
flexibility and system scalability (or elasticity) [2], and also allows many users to share
available resources in a datacenter in a transparent way. Cloud computing can also greatly
reduce Information Technology (IT) operating costs of companies and organizations [3],
making it a very attractive option for IT needs.
1.1 Motivation
Field-Programmable Gate Arrays (FPGAs) have the potential to accelerate many com-
mon cloud computing and datacenter-centric applications, such as encryption [4], com-
pression [5], or low-level packet processing [6]. FPGAs have begun to make their way
into datacenters, and their use in this context can be organized into three categories. The
first sees FPGAs being used in the technology that enables the datacenter itself, such as
switches and routers. FPGAs in this category are transparent, as neither the end user
nor datacenter operator are necessarily aware of their existence. In the second category,
FPGAs are used in special “appliances” – essentially boxes that accelerate certain tasks
1
Chapter 1. Introduction 2
or processing. An example could be FPGA-based Memcached appliances [7]. The appli-
ance may be available to end users, but the FPGAs inside are themselves not accessible,
programmable resources – they are still relatively transparent. The third category, which
is the focus of this thesis, sees FPGAs becoming fully user-accessible, programmable re-
sources. Users would be able to allocate FPGA resources just as a virtual machine using
the same control infrastructure – making FPGAs first-class citizens of the cloud.
Consider a motivating example: A large organization runs its Information Technology
(IT) services and website on an infrastructure-as-a-service cloud, using hundreds or even
thousands of VMs to serve their site and services to millions of users. Their applications
may require compute-intensive processing, or application-level packet processing that re-
quires many VMs to do efficiently. If user-accessible FPGA resources are available in the
cloud, the organization could design custom hardware to accelerate these tasks – elimi-
nating a number of VMs in exchange for a few FPGA resources, and potentially gaining
a boost in throughput and a reduction in latency. At the same time, the user retains
all the benefits of using a compute cloud, such as dynamic scalability, flexibility, and
reliability. The cloud provider also benefits by freeing up VMs, which could potentially
reduce power consumption and operating costs.
The work presented in this thesis aims to explore methods of enabling these FPGA
resources in commercial cloud computing systems.
1.2 Contributions
A hardware/software architecture enabling the virtualization of FPGAs and manage-
ment thereof using the OpenStack cloud system is presented in this thesis. The major
contributions are outlined as follows:
• A hardware and software infrastructure that splits an FPGA into a number of recon-
figurable regions, and allows these regions to be managed as individual resources in
Chapter 1. Introduction 3
an OpenStack cloud system. Introduction of the term Virtualized FPGA Resource
(VFR).
• A functional implementation of such an architecture using the Smart Applications
on Virtual Infrastructures testbed.
• An comparison of VFRs and VMs in terms of boot time performance.
• An evaluation and proof of concept of the prototype system by means of two ap-
plications:
– A hardware load balancer using a hypothetical UDP-based protocol
– A method of using virtualized hardware to extend capabilities in an OpenFlow
software-defined network (SDN).
1.3 Overview
The rest of this thesis is organized as follows: Chapter 2 will provide background and
context – reviewing prior work in virtualizing FPGAs or work similar to the techniques
used in this thesis. Chapter 3 will introduce the hardware and software architecture en-
abling FPGA virtualization. Chapter 4 will introduce the SAVI [8] testbed prototype [9]
and implementation details. Comparisons of VMs and VFRs, as well as evaluations of
the proof of concept applications, are shown in Chapter 5. Chapter 6 provides some
future vision and concludes the thesis.
Chapter 2
Background
This chapter introduces concepts and definitions used throughout this thesis, and provides
context. Work related to the techniques used in this thesis will also be examined.
2.1 Field-Programmable Gate Arrays
This thesis focuses on the use of Field-Programmable Gate Arrays (FPGAs) in data-
centers and cloud computing. An FPGA is a silicon chip whose functionality can be
reprogrammed an arbitrary number of times to become nearly any digital circuit – it
is a type of reconfigurable hardware. Modern FPGAs are typically made up of a large
array of programmable Look-up Tables (LUTs), each of which can implement a four, five
or six variable logic function, depending on the device architecture. LUTs are usually
coupled with flip-flops and organized into logic blocks, that can then be connected to-
gether through a dense, programmable routing fabric. For further reading on FPGA
architectures, the reader is directed to [10]. A set of CAD tools can map arbitrary hard-
ware designs described in Hardware Description Languages (HDLs) to the FPGA fabric.
The most common HDLs include Verilog HDL [11] and VHDL [12], but others such as
BlueSpec Verilog [13] are gaining popularity.
Modern FPGAs also have embedded hard blocks to increase their capabilities – these
4
Chapter 2. Background 5
include Digital Signal Processor (DSP) or multiplier blocks, block Random Access Mem-
ories (BRAMs), high speed serialize-deserialize (SERDES) transceivers, communication
controllers (Ethernet [14], PCIe [15]), and even full microprocessors.
2.2 Cloud Computing
It is useful to define what is meant by the term cloud computing, as many companies,
individuals and other sources may use the term in different ways. This thesis will fol-
low the definition of cloud computing given by NIST that describes several essential
characteristics [16], summarized here:
1. On Demand Self Service: ability to provision cloud resources in the cloud at any
time, on demand, without interaction with humans.
2. Broad Network Access: all resources are available and accessible over the network.
3. Resource Pooling: provider resources are organized into pools enabling multi-tenant
service.
4. Rapid Elasticity: amount of resources can be dynamically increased or contracted.
5. Measured Service: ability to monitor, control and report resource usage.
In addition there are also several different cloud service models. The NIST definition
above covers all these models, and they are described briefly here:
• Software as a Service (SaaS): Allocatable resources are software programs, usually
provided over the Internet via web browsers.
• Platform as a Service (Paas): Resources are operating systems, development tools
and frameworks for creating software and services.
• Infrastructure as a Service (Iaas): Resources are virtualized datacenter components
such as Virtual Machines (VMs), virtual storage, networking, bandwidth.
Chapter 2. Background 6
This thesis will focus primarily on IaaS type cloud computing, where allocatable
resources are virtualized datacenter components. A good example of this type of cloud
computing would be Amazon Web Services [17], or the SAVI testbed, which will be
discussed shortly.
The cloud computing paradigm has become immensely popular in the IT services and
related industries because it frees organizations from the physical aspects of computing
and IT infrastructures. There is no major capital investment required for physical servers
and networking equipment, nor maintenance costs on said equipment. These burdens
are shifted to the Cloud Provider, and the IT organization simply pays a set rate for
the cloud-based resources that it uses. The fact that the organization pays only for
what it uses, combined with the lack of capital investment represents a significant cost
reduction to the organization. The cloud generally guarantees a Service Level Agreement
(SLA), leaving the organization assured that its IT infrastructure will experience little
to zero downtime due to hardware problems. Cloud computing also allows the end
user to scale their systems up or down seamlessly, avoiding the need to over-provision
computing capabilities or bandwidth usually needed to mitigate the effects of bursty
traffic, again saving on operating costs. From a technical and cost perspective, cloud
computing is generally extremely attractive to organizations with both large and small
IT needs. Certain other factors may influence the attractiveness of cloud computing,
usually legal issues arising from the geographic location of the cloud providers datacenter,
or privacy concerns due to the fact that a user’s data is effectively in the hands of a third
party, however these points are outside the scope of this thesis.
Chapter 2. Background 7
2.3 The Smart Applications on Virtual Infrastruc-
ture Network
SAVI [8] is a Canada-wide research network aimed at exploring next-generation appli-
cation marketplaces that make use of fully virtualized infrastructure, as well as future
Internet alternatives. A central vision of the SAVI network is the notion of a Smart
Edge node – a smaller-scale datacenter situated close to the network edge, providing
specialized low-latency processing for future application platforms. SAVI joins a number
of other networking research testbeds such as GENI [18], Emulab [19], PlanetLab [20]
and Internet2 [21], many of which are also federated.
C & M
U of T Edge U of T CoreC & M
ORION CANARIE CANARIE
SAVI Testbed Network
C & M
McGill Edge
C & M
Carlton Edge
C & M
Victoria Edge
C & M
Calgary Edge
C & M
Waterloo Edge
C & M
YorkU Edge
Virtual Network
Virtual Network
Application XResources
Application YResources
C & M
Figure 2.1: Diagram of the SAVI testbed.
The SAVI testbed is one of the SAVI network research themes. The goal of the SAVI
testbed is to realize a future application platform that will provide a testing ground for
other SAVI research themes. The testbed consists of several Core nodes and many Smart
Edge nodes, deployed at various Universities and institutions across Canada. These
Core and Edge nodes are interconnected by a fully virtualized Software Defined Network
(SDN), and the whole system is orchestrated by a Control and Management (C & M)
system. Users can allocate virtualized resources via the C & M system across all nodes in
the testbed, as well as private virtual networks that provide complete isolation from other
Chapter 2. Background 8
users experiments and systems. The ORION [22] and CANARIE [23] networks connect
all components over a large geographic area of Canada. Figure 2.1 shows a diagram of
the testbed architecture.
2.3.1 The Smart Edge Node
SAVI Smart Edge nodes are small-scale datacenters situated close to the edge of the
network and are the primary connection point for application users. SAVI Smart Edge
nodes are unique in that they make use of heterogeneous resources in addition to virtual
machines to accelerate processing – GPUs and reconfigurable hardware, as well as regular
bare-metal servers. These resources put a large amount of processing power close to the
edge of the network, and allow applications to do a majority of intensive processing
before having to traverse the possibly high-latency network to the Core datacenter. Such
intensive processing may include things like advanced signal processing for wirelessly
connected devices, encryption and decryption, multimedia streaming acceleration, new
types of switching and routing, and other packet-oriented processing.
Figure 2.2 shows a diagram of a Smart Edge node. In the SAVI testbed, the Smart
Edge is an OpenStack cloud system [24]. OpenStack management forms the Smart
Edge C & M plane through a number of subsystems, all of which are reachable through
RESTful [25] APIs.
• Nova [26] – manages all compute resources through a Driver-Agent abstraction.
• Keystone [27] – performs authentication and identity management.
• Glance [28] – manages Virtual Machine images and other images.
• Quantum [29] – performs network management functions. Due to naming trade-
marks, will become known as “Neutron”.
• Swift [30] – object storage system.
Chapter 2. Background 9
• Cinder [31] – block storage service.
The SAVI smart edge also has a custom Software-Defined Infrastructure (SDI) man-
ager, called Janus. Janus offloads certain tasks from OpenStack, such as network control
and resource scheduling, and also performs configuration management and orchestration
of the testbed’s OpenFlow-based Software-Defined Network (SDN). Network control is
accomplished through an OpenFlow Controller implemented using Ryu [32]. Janus also
virtualizes the network into slices using FlowVisor [33] (an OpenFlow-based network vir-
tualization layer), and users can run their own User OpenFlow Controller to manage
their own private network slice. Essentially, the SDI Manager brings together Cloud
Computing and Software Defined Networking together under one management system.
More information on SAVI testbed infrastructure management and Janus is provided
in [34].
Smart Edge
Application and Service Provider Keystone Glance-reg
Nova
Driver-1 Driver-2 Driver-N
Agent AgentAgent
Agent Agent Agent
Sliceable OpenFlow Network
Heterogeneous Resources
SDI ManagerJanus
ConfigurationManager
OpenFlow Controller (ryu)
FlowVisor
Cinder
Glance-API
Swift
QuantumOpenStack
Use
r Op
en
Flow
Contro
ller
Figure 2.2: The SAVI testbed Smart Edge node.
Of particular interest to this thesis in Figure 2.2 is the Nova component of OpenStack,
which is the part that allocates resources. The standard Nova only supports processor
Chapter 2. Background 10
virtualization, where Virtual Machines (VMs) are booted on top of hypervisors that
abstract away the physical hardware. The vision of the Smart Edge however, incorporates
Heterogeneous Resources in addition to VMs. Thus Nova in the SAVI testbed is extended
to enable it to manage these new resources.
2.3.2 Heterogeneous Resources in the SAVI Testbed
For OpenStack to manage different types of resources, they must all appear homoge-
neous in nature. To accomplish this, the SAVI testbed OpenStack uses a Driver-Agent
system. A driver for any resource implements required OpenStack management API
methods, such as boot, reboot, start, stop and release. The driver then communicates
these OpenStack management commands to an Agent, which carries them out directly
on the resource, via a hypervisor or otherwise. In this fashion, OpenStack can manage
all resources through the same interface. Figure 2.3 shows a diagram of the Driver-Agent
system. Essentially, the Agent is performing resource-specific management, while the
driver facilitates resource-agnostic management for OpenStack. The method of commu-
nication between the Agent and the Resource, and the Driver and the Agent, is entirely
resource-dependent.
Common API
OpenStackNova
Driver
Agent N Resources
Agent 1 Resources
Communication
Resource-Specific ManagementResource-Agnostic
Management
...
Figure 2.3: The Driver-Agent abstraction used in the SAVI testbed OpenStack system.
If a user desires to allocate a resource, they need to be able to specify what resource
type they want - the SAVI testbed extends the OpenStack notion of resource flavor to en-
able this. Usually, resource flavor refers to the number of virtual processors and amount
of RAM to allocate to a VM. The SAVI testbed extends the definition of flavor to also
Chapter 2. Background 11
include resource type. The SAVI testbed currently has several of these additional resource
types including GPUs, bare-metal servers, and reconfigurable hardware. Although recon-
figurable hardware is included in SAVI (and its precursor VANI [35]) it is still relatively
non-virtualized – simply FPGA cards in bare-metal servers managed by OpenStack.
To be made aware of their existence, OpenStack must have resource references placed
in its database – one for each allocatable resource. This is done using the nova-manage
tool. The resource database entry includes the address of the Agent that provides the
resource, a type name that can be associated with a flavor, and how many physical
network interfaces the resource has. A flavor is created for each unique resource type.
2.4 FPGA Virtualization
This thesis focuses on exploring methods of virtualizing FPGAs and managing the device
or portions thereof using OpenStack in the SAVI testbed. A number of prior works have
examined “virtualization” of FPGAs in different contexts, described in the following
subsection.
2.4.1 Related Work
Hardware virtualization, especially that pertaining to FPGAs, has been explored for
some time. Initially realized through time multiplexing hardware [36], most hardware
virtualization schemes now generally use run-time reconfiguration of the FPGA. This
can be full reconfiguration of the entire device, but usually refers to dynamic Partial
Reconfiguration (PR) of a portion of the FPGA.
In terms of network applications, which is a theme in this thesis, there has been
work examining the use of partial reconfiguration to virtualize forwarding data planes in
routers [37], although this is a very specific case and does not involve user-designed custom
hardware. Recent works involving virtualized FPGAs for custom user hardware virtually
Chapter 2. Background 12
increase the number of available FPGA partially reconfigurable resources [38] or virtualize
non-PR coprocessors [39], to maintain parity with the number of microprocessors in a high
performance computing environment. Others use the partial reconfiguration technique
to make reconfigurable hardware sharable by multiple software processes. This generally
involves some sort of virtualization on the level of the operating system in addition to
the FPGA or gate-level virtualization done using PR. Some works investigate operating
system and scheduler design specifically to manage reconfigurable hardware tasks [40, 41].
On a lower level, Huang et al. use a hot plugin technique to provide access to PR based
accelerators via a unified Linux kernel module, allowing multiple processes to efficiently
share different accelerators [42]. Pham et al. propose a microkernel hypervisor for new
FPGA/CPU hybrid devices, which facilitates access to either a CGRA-like Intermediate
Fabric or a regular PR region running user accelerators [43].
What all these schemes and others like them have in common is that they view
the reconfigurable accelerators as rather short-lived entities executing hardware “tasks”,
which supplement software tasks running on a conventional processor. Thus they focus
heavily on reconfiguration times and concurrent access to the same PR region for multiple
processes, as well as high bandwidth between the FPGA and CPU.
The context of FPGA virtualization in this thesis is markedly different. This thesis
does not assume the virtualized accelerators to be closely coupled with CPUs or software
processes, rather, the accelerators are seen as being a major or supplemental component
of massive, distributed, cloud-based infrastructures. Most virtualization techniques like
the ones mentioned above cannot readily be applied to IaaS clouds and VMs because the
end user does not have any sort of access to the underlying physical hardware. It may be
possible for a cloud provider to make so called hardware tasks available to VMs and thus
end users, but the users would likely be unable to define their own hardware because of
the low-level access it would still require, which somewhat defeats the purpose.
Additionally, the hardware task model may not suit all IaaS users – this thesis also
Chapter 2. Background 13
envisions streaming, packet processing, and network centric applications as well – all
things that the user of a virtualized datacenter may need. Because of this, the work
presented here focuses on providing virtualized, in-network hardware resources that are
analogous in the “resource” sense to VMs.
Chapter 3
Architecture for FPGA
Virtualization
The general approach for virtualization is modelled after that for virtual machines. To
do this, it is important to understand how OpenStack manages resources. This chapter
will briefly examine how OpenStack operates and manages heterogeneous resources in
the SAVI Testbed. Then, the architecture of the system enabling FPGA virtualization
is presented.
3.1 OpenStack Resource Management
Figure 3.1 shows a simplified diagram of resource management in OpenStack. The Open-
Stack Controller runs on a Commodity Server inside the SAVI Testbed, and provides an
API (specifically Nova) to allow a user to request resources. In general, VM resources in
the cloud system are booted on top of hypervisors on physically separate machines from
the Controller - OpenStack maintains a database of all resources in the system, both in
use and free for allocation.
When a resource request comes in, OpenStack finds available resources in the Re-
source Database, and finds which physical machine they are located on. The Controller
14
Chapter 3. Architecture for FPGA Virtualization 15
Commodity Server
SAVI TBOpenStack Controller
Resource Database
Resource Server
Agent Hypervisor
VMRsrce
...
API (Nova)
User
VMRsrce
VMRsrce
Physical ServerResource
GPU ServerResource
*-Drivers
SAVI TB Heterogeneous Resources
(Commodity Server)
Figure 3.1: A simplified view of resource management in OpenStack/SAVI Testbed.
communicates with the Resource Server via a separate process running beside the Hyper-
visor, called an Agent. The Agent is a piece of software that interprets commands from
the main OpenStack Controller. As described briefly in Chapter 2, the Agent is part of
a Driver-Agent abstraction – a Driver integrates with the OpenStack compute controller
(Nova), implementing resource control API functions. Through the Driver, OpenStack
requests the resources from the Agent, which in turn instructs the Hypervisor to boot a
VM with the operating system image and parameters specified by the User (sent to the
Agent by the Controller). Networking information is also sent by the Controller, which
is used by the Agent and Hypervisor to set up network access for the VM. A reference,
usually in the form of an IP address, is returned to the User such that they can connect
to their resource, and run whatever application or system they want on top of it. For
more details on OpenStack, the reader is referred to [24].
Chapter 3. Architecture for FPGA Virtualization 16
Figure 3.1 also shows that the SAVI Testbed OpenStack Controller manages het-
erogeneous resources through similar mechanisms. A custom driver communicates with
Agents managing bare-metal servers, with some being regular bare-metal servers, and
others containing GPUs or reconfigurable hardware.
The current state of reconfigurable hardware in the SAVI Testbed is relatively non-
virtualized – bare-metal servers with PCIe FPGA cards and several BEECube systems.
They are non-virtualized because the resource provided is a fully physical resource, not
sharable between Users and thus not very scalable or flexible. The current reconfigurable
resources in the SAVI Testbed are described briefly below.
3.1.1 SAVI Testbed FPGA Resources
This subsection briefly describes the current, non-virtualized FPGA resources in the
SAVI Testbed.
BEE2 Boards
The SAVI Testbed has a number of BEE2 systems [44]. The BEE2 is equipped with five
Xilinx FPGAs, with one used to control the others. In the Testbed, an Agent runs on an
embedded system on the control FPGA, and manages the other FPGAs as resources that
can be allocated. Each FPGA resource has four 10G-capable CX4 interfaces that connect
to the Testbed SDN, allowing the user to send and receive data from their hardware on
the FPGA.
Since the user simply gets the entire device as a resource, they are responsible for
designing and compiling their hardware using vendor tools, ensuring that their hardware
ports match the correct pin locations on the BEE2, and ensuring that the hardware will
function correctly. Once they generate a bitstream file for programming the FPGA, it is
uploaded to OpenStack as an image, and would be loaded on the FPGA by the Agent.
Note that again the definition of a concept in OpenStack is being extended. Normally,
Chapter 3. Architecture for FPGA Virtualization 17
an “image” refers only to an Operating System (OS) image, however OpenStack allows
any file type to be uploaded as an image. Therefore, for a BEE2 FPGA resource, the im-
age will be a bitstream generated by the FPGA tools. For the BEE2 resource, the Agent
will receive this image from the OpenStack Controller via the driver, and simply config-
ures it onto an unused FPGA. OpenStack sees the FPGA as any other resource thanks
to the Driver-Agent abstraction, and the user can now make use of custom hardware
acceleration in the SAVI Testbed.
PCIe-Based FPGA Cards
To increase the range of different FPGA applications available to researchers, it is useful
to have FPGAs closely coupled to processors so that the reconfigurable hardware can
accelerate compute-intensive portions of software. The SAVI Testbed provides several
PCI-Express-based FPGA boards connected to physical servers: The NetFPGA, the
NetFPGA10G [45] and the DE5Net [46]. The boards have varying FPGA device sizes
and on-board memory, but have in common four network interfaces that are connected to
the Testbed SDN. The NetFPGA has four 1G Ethernet ports, while the NetFPGA10G
and DE5Net have four 10G Ethernet ports. A researcher can now design custom hardware
that can accelerate software tasks, provide line-rate packet processing, or a combination
of both.
In addition to these boards, the Testbed also contains MiniBEE [47] resources. The
MiniBEE contains a conventional processor and an on-board FPGA connected through
PCIe. It also has 10G network interfaces, a large amount of memory and an expansion
port for additional FPGA peripherals.
Since the PCIe boards are required to be mounted inside physical servers, the SAVI
Testbed provides the server itself with the FPGA card attached as a resource. In the
case of the MiniBEE, the entire system is also offered as a resource.
Chapter 3. Architecture for FPGA Virtualization 18
3.1.2 Requirements for FPGA Virtualization in OpenStack
Using Virtual Machines as a model for full FPGA virtualization, it is clear there are
several required components: An agent to provide the FPGA (or pieces thereof) as a
resource, and a driver to integrate into OpenStack so that OpenStack can communicate
with the Agent. The Agent is responsible for managing the actual resource provided to the
user, in this case an FPGA or portion thereof, and therefore must be capable of receiving
an FPGA programming file (hereafter referred to as a “bitstream”) and configuring or
reconfiguring the device. It must also track which FPGAs or FPGA portions have user
hardware running in them, and which hardware belongs to which user.
Additionally, if full virtualization is to be achieved, the physical FPGA device must
be abstracted and sharable between different users. This will require a base hardware
architecture to “virtualize” the device, somewhat similar to a hypervisor. The Agent
must be aware of this virtualization layer to manage the resources as well.
The following sections will describe the design of a system that meets the afore-
mentioned requirements, using OpenStack in the SAVI Testbed as the cloud computing
platform. The base hardware architecture virtualizing the device will be described, and
then the Agent that provides the virtualized hardware to OpenStack.
3.2 Hardware Architecture
Though the current FPGA resources in the SAVI Testbed are managed by OpenStack,
they still leave much to be desired in terms of commercial systems and user-friendliness.
The resources are still relatively non-virtualized – a single physical device is allocated to
one user, whereas in a fully virtualized system, one physical device should be sharable
among different users simultaneously. There is more motivation for this when considering
that one user may not make fully use of an entire FPGA, wasting some reconfigurable
fabric that could go to another user. A full FPGA can be more difficult to design
Chapter 3. Architecture for FPGA Virtualization 19
and program, especially when integrating complex IP (such as memory controllers), and
without physical access to the device.
The architecture presented in this thesis seeks to resolve these problems through full
hardware virtualization.
3.2.1 Fully Virtualized Hardware
Virtualization in a cloud computing context has several characteristics that are usually
presented in terms of Virtual Machines:
• Physical Abstraction – The physical device itself is abstracted and the user is not
aware of the underlying hardware. For example, a VM may be running via a
hypervisor running on an Intel Xeon processor, however the user is only aware of
how many Virtual CPUs they have. They are unaware of the real hardware.
• Sharing – A single physical device provides one or more virtual instances to one or
more users. Such devices can also be referred to as Multi-tenant devices.
• Illusion of Infinity – The actual number and physical location of resources is also
abstracted, and from the user’s view there exists a seemingly infinite pool of re-
sources.
The objective of the work presented in this thesis is to enable fully virtualized FPGA
hardware by designing an architecture and system that has the above characteristics (or
characteristics that are analogous). The physical device should be abstracted – a user
should be able to specify a hardware design, in HDL for example, and rely on the system
to run their design in the cloud. They should not have to worry about what specific device
their hardware must run on, nor about compiling for different devices. They should also
not be aware of other physical aspects of the system – such as the physical location of
resources or the number of available resources (i.e. the illusion of infinite resources in the
cloud should be maintained.)
Chapter 3. Architecture for FPGA Virtualization 20
FPGA size and density has grown to the point where it is feasible for the device to
be virtualized and shared between users, with enough fabric leftover for each user to run
non-trivial hardware designs. This also has the benefit of improving the usage efficiency
and cost-effectiveness of a device, since many useful hardware designs do not need as
much logic as the entire device provides. Full hardware virtualization would allow the
cloud provider to put the unused fabric to work for other users.
Another issue fully virtualized hardware would solve is that of security. Giving a user
full control over an entire FPGA directly connected to the network may be risky for a
cloud provider. It would allow a nefarious user the ability to inject malicious data directly
into the provider datacenter, at extremely high rates (10Gb/s or more). A hypervisor, in
the case of a regular VM, acts as a buffer between the user’s guest OS and the provider
hardware, allowing the provider to set up security and police network traffic before it
gets onto the internal network. The hardware virtualization layer is therefore designed
to allow the provider to police the data going in and coming out of the user hardware.
The general approach for virtualization of the hardware is based on Partial Reconfig-
uration (PR) of the FPGA. This technique of reconfiguring specific portions of an FPGA
while the rest remains running can be used to effectively split the device into several
regions that can be offered individually as resources to cloud users. To familiarize the
reader, the basics of Partial Reconfiguration will be reviewed in the following section.
3.2.2 FPGA Partial Reconfiguration
Partial Reconfiguration is a capability of some FPGAs where portions of the device can
be reconfigured independently, without affecting other circuits running on the device.
Physical portions of the device must be specified to be a Partially Reconfigurable Region
(PRR), and specific hardware modules of the overall design must be mapped to one of
these PRRs. Multiple modules can be compiled for one PRR, however only one can
be configured at run time. These modules are called Partially Reconfigurable Modules
Chapter 3. Architecture for FPGA Virtualization 21
(PRMs). Generally, the logic surrounding the PRRs is fixed, and is referred to as the
static logic. Major FPGA vendors support partial reconfiguration [48, 49, 50].
Figure 3.2 depicts a partially reconfigurable FPGA system. There is one PRR, and
three PRMs (PRM A, B and C). Each PRM contains different hardware implementing
different functionality, and each PRM can be dynamically configured into the PRR at
run-time while the Static Logic remains running.
PR introduces complexities into the hardware compilation process. The interface from
the static logic to a PRR, called the PR Boundary, must be dealt with carefully by the
CAD compile process. The static logic must be compiled once, since it does not change,
along with one of the PRMs. After placement and routing, the physical wires crossing
the PR boundary are set permanently since they connect to the static logic, shown in
Figure 3.2 as Static Connection Points. Further PRMs compiled with the static logic
must have their connections routed to the same physical locations, so that when they
are partially reconfigured, their connections actually connect to the running static logic.
This is also shown in Figure 3.2, where any PR Boundary crossing signal is routed to the
same location in each PRM. This is usually accomplished by locking the placement of
the logic cells whose wires cross the boundary (called anchor LUTs) after compiling the
static logic. Obviously, every PRM for a given PRR must have the same logical top-level
ports, whether or not they use them all.
Other considerations must be made by designers using PR. During reconfiguration,
outputs of a PRR may be in flux and have unknown values. The static logic should
have a method of freezing these outputs or ignoring them while reconfiguration takes
place. Timing constraints can also be harder to meet in PR systems, since the CAD
tools are unable to perform any logic optimization across the PR boundary. Xilinx Inc.
suggests registering signals both before and after the PR boundary to improve timing
performance [48]. Lastly, the designer should ensure that a freshly reconfigured PRM is
fully reset to a known state.
Chapter 3. Architecture for FPGA Virtualization 22
PRM C
PRM A
PRM B
Static Logic
FPGA
PRR
PR BoundaryStatic Connection Point
Figure 3.2: FPGA Partial Reconfiguration.
3.2.3 Virtualization via PR
In a cloud computing context, the static logic surrounding the Partially Reconfigurable
Regions implements hypervisor functions – providing a buffer under control of the cloud
provider in between the network and the user-defined hardware in the PRRs. Just as
in a VM hypervisor, this will allow the cloud provider to implement some measure of
security, and possibly other required management functions. The static logic also has
several other functions that will become apparent as the full design is described below.
3.2.4 Static Logic Design
As mentioned previously, partial reconfiguration is used to split a single FPGA into
several reconfigurable regions, each of which are managed as a single Virtualized FPGA
Resource (VFR). In effect, this virtualizes the FPGA and makes it a multi-tenant device,
although still requiring the external control of the Agent. A user can now allocate one
of these VFRs and have their own custom designed hardware placed within it. The
static logic surrounding the VFRs is still under control of the cloud provider, and must
Chapter 3. Architecture for FPGA Virtualization 23
accomplish several functions. The method of data transfer between a user and their VFR
is over the network, and therefore the static logic must facilitate forwarding of packets to
the correct VFR. To do this, the static logic system must track the network information
(i.e. MAC addresses) of each VFR as provided by the OpenStack Controller. The static
logic is also designed in a way that enables it, and thus the cloud provider, to police
the interfaces to the VFRs, to maintain some basic network security, such as prevention
of sniffing traffic and spoofing addresses. The static logic contains interface hardware
for for 10G Ethernet ports, memory controllers, all chip level I/O, and a method of
communicating with the Agent. The following paragraphs and subsections describe the
design choices made for this thesis to accomplish these functions.
Figure 3.3 shows a block diagram of the on-FPGA portion of the system. A Soft
Processor (that is, a microprocessor implemented inside the FPGA fabric) communicates
with the Agent that runs on the Host machine. The Soft Processor is attached to a Bus
that allows it to communicate with and control the different components of the system. A
bus is a two-way communication system with two actors: Masters and Slaves. Masters can
initiate read and write transactions by addressing the Slave they wish to communicate
with, while Slaves can only respond to reads or writes. The Soft Processor is a Bus
Master. The DRAM Controller is a Bus Slave that facilitates access to Off-Chip DRAM.
The MAC Memory-mapped (Memmap) Registers are also Slaves that allow the Soft
Processor to control the Input Arbiter and Output Queues (described in the following
subsection). The VFRs are wrapped in Bus Masters, which allows them to access the
DRAM Controller slave. Packet streams from 10G interfaces pass through the Input
Arbiter and the VFR Bus Master wrappers and enter the VFRs. Output streams exit
the VFR and connect to the Output Queues and subsequently the egress interfaces. The
Agent can reprogram the entire chip or individual VFRs through an external Programmer
that operates over JTAG. The major subcomponents shown in Figure 3.3 are described
in the following subsections.
Chapter 3. Architecture for FPGA Virtualization 24
Virtex 5 VTX240TSoft
Processor
Input Arbiter
OutputQueues
MACMemmapRegisters
VFR
VFR
VFR
Agent(Host)
Programmer(JTAG)
Off-ChipDRAM
128 MB
DRAMController
1
2
N
Stream In256b Wide
From 10GE(160 MHz)
Stream Out
To 10GE
32-bit Bus
VFR Wrappers
Figure 3.3: System view of the on-FPGA portion of the virtualization hardware.
Data Transfer and I/O
A Streaming Interface facilitates packet transfer into the system from 10G Ethernet ports.
Streaming Interfaces are one-way communication channels. They consist at a minimum
of four signals:
1. A variable bit width Data signal.
2. A single bit Valid signal.
3. A single bit Ready signal.
4. A Clock signal.
The actual data of the packet is transferred in chunks through the Data field, where
each chunk is the bit width of the Data field. These chunks are referred to as flits. At
the positive edge of the Clock signal (that is, a logic low to logic high transition), one
flit is transferred if and only if both the Valid and the Ready signals are logic high. The
Valid signal is asserted by the sender along with the flit in the Data field, and the Ready
Chapter 3. Architecture for FPGA Virtualization 25
signal is asserted by the receiver to indicate it is ready to receive data. The Valid and
Ready signals are also referred to as handshake signals. One flit can be transferred over
the Data signal every clock cycle.
The Input Arbiter block, shown in Figure 3.4, is responsible for directing incoming
packets to the correct VFR. The Input Arbiter contains a Content Addressable Memory
(CAM) that the arbiter uses to match a packet’s Destination MAC address to a specific
VFR. An incoming packet is stalled for one clock cycle while the CAM looks up the
destination VFR. In the case that there is no matching VFR in the CAM, the packet
is simply dropped. This has the benefit of preventing VFRs from inadvertently (or
intentionally) sniffing Ethernet traffic that does not belong to them, but the drawback of
being unable to receive broadcast packets. This could be addressed by designing a more
complex switching fabric within the Input Arbiter.
The CAM must be programmed with the VFR MAC addresses (provided by Open-
Stack) before any packets can be received. MAC addresses for new VFRs are programmed
into the CAM by a soft processor via several memory mapped registers. The software
running on the soft processor receives the MAC address and corresponding VFR region
code over a UART link from the Agent running on the host. The software running on the
processor and the communication protocol with the Agent will described in more detail
in Section 3.3.
The output queues operate similarly, except this block simply tracks each VFR’s MAC
in a register, and prevents spoofing by forcing an outgoing packet’s source MAC address
to be the VFR’s MAC address. The output queue MAC addresses are also updated via
the same memory mapped registers as the input arbiter.
Virtualized Accelerator Wrappers
The VFRs are wrapped inside Bus Masters, as can be seen in Figure 3.3. These are
labelled 1 to N. Figure 3.5 shows a closer view of this wrapper design, which must
Chapter 3. Architecture for FPGA Virtualization 26
48 Bit CAM
MACMemmapRegisters
VFR Region Code Out
write MAC Addr VFR Region Code
Stream In
Bus
To VFRsInput Arbiter
Figure 3.4: Virtualization hardware input arbiter block.
accomplish two things – facilitate safe partial reconfiguration of a VFR, and provide
the VFR access to off-chip DRAM. Safe reconfiguration is of paramount importance,
since the cloud provider should seek to guarantee that user hardware will be configured
correctly. The system must ensure that no information or packet is being transferred
across the PR boundary during reconfiguration. This can result in lost data, or a newly
reconfigured VFR receiving data starting in the middle of a packet, which could cause
errors in the user hardware.
The VFR reconfiguration process is handled as follows: When a request comes to
the Agent to configure a new VFR, the Agent sends a command to the soft proces-
sor to freeze the interfaces of the selected region. The soft processor will de-assert the
PR INTERFACE ENABLE signal by writing to the register, which causes the wrapper
hardware to set all streaming interface handshake signals low after any current transfer
finishes. This is accomplished by using a register for each Stream Interface that records
whether or not the stream is in the middle of a transaction. When both transaction reg-
isters are logic 0, the hardware sends an ACK to the processor through another memory-
Chapter 3. Architecture for FPGA Virtualization 27
VFR(User IP)
PR_INTERFACE_ENABLE
ACK
StreamIn
BusMemmap Register
Memory Operation Queues
PR_RESET
StreamOut
ACK
Figure 3.5: The VFR wrapper design.
mapped register that the processor is polling, and the processor asserts the PR RESET
signal and notifies the Agent that it is now safe to reconfigure that region. The VFR
is held in reset, and after the new user hardware is configured via an external JTAG
connection and the MAC address programmed into the input arbiter and output queue,
the Agent instructs the soft processor to release PR RESET and enable the interfaces
again. This method also ensures that new user hardware is fully reset before interfaces
are enables.
The wrapper also facilitates access to low-latency off-chip DRAM for a VFR. The VFR
wrapper has read and write ports so that user hardware can insert memory operations
into a queue that acts as a bus master. The queue would be implemented as a FIFO
so that all writes added to the queue are done in order, and read data is also returned
in order. The queue, which is part of the wrapper hardware under provider control,
partitions the memory address space so that each VFR in the physical FPGA only sees a
subset of the total DRAM – effectively giving each VFR a private off-chip memory, while
making it impossible for any VFR to access another’s memory. This is done by dividing
Chapter 3. Architecture for FPGA Virtualization 28
the address space and offsetting addresses by an appropriate amount before entry into
the operation queue. For example, if the memory is 64MB then the memory address
space might be 0x03FFFFFF <-> 0x00000000 or a total number of address locations
of 0x04000000. If there are four VFRs then this space is divided into four, and each
VFR would have log2(0x01000000) = 26 address bits, or a range of 0x00FFFFFF <->
0x00000000 (16MB each). When a VFR makes a read or write operation, the 26 bits
are zero-extended to 32 bits (the system bus width) and offset an appropriate amount
by adding a multiple of 0x01000000. The offset is done outside the VFR and inside the
provider-controlled static logic, making the user hardware entirely unaware that memory
beyond its partition exists. The queue is reset and cleared upon VFR reconfiguration via
the PR RESET signal.
In the prototype system using the NetFPGA10G, the RLDRAM modules have a
minimum read latency of three clock cycles, and a minimum write latency of four clock
cycles, plus several cycles for the controller running on the FPGA. The module can burst
read and write up to a length of eight. Maximum aggregate bandwidth is 38.4 Gb/s.
This memory system is admittedly rudimentary compared to other possibilities such
as a multi-port memory controller, but it was chosen to keep the static logic simple and
low-area.
3.3 Agent Design
This section describes the Agent – the piece of software that communicates with Open-
Stack, performing resource specific management. One Agent manages all the VFRs on
one FPGA, although the design could easily be extended to manage VFRs on multiple
FPGAs. This section will describe Agent requirements in general, but also focus on the
Agent implemented in the SAVI Testbed prototype, which uses the NetFPGA10G card.
Generally, the Agent must implement the resource-specific management commands
Chapter 3. Architecture for FPGA Virtualization 29
from OpenStack (issued via the Driver). At the very least, these must include boot
(instantiate a new resource with the specified parameters) and delete (tear down a running
resource).
The Agent and embedded software use a simple command-acknowledge protocol to
communicate: the Agent will send a command string, and the embedded software will
respond with an acknowledge, and an additional acknowledge for each data parameter
sent as part of the command. If no acknowledge is received, the command is aborted
and re-attempted. In the prototype hardware, the embedded software is run on a Xilinx
Microblaze soft processor. The Agent and OpenStack Driver communicate over the
Testbed network using a text-based protocol over TCP .
The following subsections describe how the Agent boots and deletes VFRs.
3.3.1 Booting
When a boot command is issued in OpenStack, several pieces of information are received
by the Agent:
• UUID – a universal unique identifier for the resource.
• Network information – an IP and MAC address for the resource.
• Image – usually an OS image for VMs, but repurposed for VFRs.
The Agent uses the UUID to track VFRs – UUIDs in OpenStack are an absolute
reference for any object or resource. The image is the most important piece of data
received. It contains PR bitstreams corresponding to the user-designed hardware. How
this is compiled and created will be explained in Section 3.5. The image contains one
PR bitstream corresponding to each physical PRR on the FPGA, numbered so that the
Agent can tell which one is for which region. This is explained in detail in Section 4.2.1.
The Agent chooses the first available unconfigured PRR for the incoming VFR to be
booted, and then begins the reconfiguration process – note that this requires the Agent
Chapter 3. Architecture for FPGA Virtualization 30
to maintain the current state of the system, remembering what regions are currently
configured and what users they belong to. A simple data structure could be used to
store the state of each physical VFR and, if configured, the associated UUID.
First, a “disable interfaces” command is sent to the embedded software. This com-
mand has a single parameter, the region code corresponding to the PRR about to recon-
figured. The soft processor freezes the packet stream interfaces as described in Section
3.2.3, and places the PRR (VFR) into reset. The Agent then reconfigures the PRR using
an external reconfiguration tool, in the case of the prototype system, Xilinx iMPACT
over JTAG. Then, the MAC address is programmed into to the static logic’s input ar-
biter and output queues, which allow packets in and out of the newly configured VFR.
The “load MAC address” command is sent to the embedded software followed by the six
byte MAC address received from the OpenStack Controller. Once successful, the “enable
interfaces” command is issued, and the embedded software releases the VFR from reset
and enables the packet stream interfaces.
This method ensures that no packet is in the middle of transfer during a reconfigura-
tion, avoiding the situation where a piece of user hardware might receive a fragmented
packet. It also ensures that user hardware is fully reset before running.
3.3.2 Deleting
Deleting a VFR is similar to booting in terms of what the Agent must do, however the
only piece of data sent is a UUID. The Agent finds the VFR with the corresponding
UUID, and proceeds to reconfigure the PRR with a black box bitstream. This effectively
removes any user hardware from the system. Again this requires that the Agent store
the state of each VFR and associated UUIDs.
Chapter 3. Architecture for FPGA Virtualization 31
3.4 Booting VFRs in OpenStack
Several modifications were needed in the OpenStack Controller (Nova) to get a complete
working system. Major modifications were already made by the SAVI Testbed team to
enable management of bare-metal servers – specifically modifications that allowed the
integration of custom Drivers and therefore custom Agents.
Recall that in Chapter 2 the notion of resource flavor was discussed. Normally, flavor
refers to the number of virtual processors and amount of memory for a VM, but it can
be repurposed to refer to more of a resource type. This is important because the Nova
scheduler uses the flavor submitted by the user to select which Driver to use to contact
the Agent. Each resource in the database references a flavor, so many single resources
can fall under one flavor.
Figure 3.6 shows a diagram of the boot sequence for a VFR, which proceeds as follows:
Upon receiving the Boot command from the User, the OpenStack Controller uses the
specified flavor to choose a resource in the database. The database entry references
the Agent associated with it through an IP and port number. Nova calls the Driver
implementation of the “boot” function, and passes generated networking information,
the Agent IP and port, UUID and any other required information. The Driver then
communicates with the Agent (FPGA host), instructing it to boot a new resource, and
in the case of VFRs, sending over networking information and the image containing
partial bitstreams. The Agent selects the partial bitstream in the image matching a free
region, and programs it along with the network information as in Section 3.3.1. Success
is indicated to the Controller, which then passes a reference to the user.
3.5 Compiling Custom Hardware
Through the modifications and systems described in this chapter, custom hardware ac-
celeration is now available to cloud computing users. However, there is still the question
Chapter 3. Architecture for FPGA Virtualization 32
OpenStackController
Boot (User)
Agent(FPGA host)
disable VFR interfaces,configure region
programVFR MAC
Agent selectsPR bitstreammatching freeregion.bit
.bit
Controller passes image, network info
enable VFRinterfaces
.bit
Figure 3.6: VFR boot sequence.
of how to compile hardware for the system. The VFRs have specific interfaces that any
user hardware must match exactly – not only logically, but also physically, due to the
nature of PR discussed previously. The static logic remains constantly running while
user hardware is partially reconfigured as resources are booted and deleted. Therefore,
any new user hardware must be compiled with the currently running static logic that
is part of the cloud provider’s systems. A compile flow must be developed to allow end
users to compile their custom hardware for use with the virtualization system.
First, the user hardware top-level ports must match those of the static logic PR
boundary logically – a template HDL file is provided to end users inside which they can
define their hardware. The template file contains a module definition whose top-level
ports match the PR boundary in the static logic.
It is assumed at this point that the provider has already synthesized, placed and
routed the static logic. Any user hardware has to be compiled with the placed and
routed static logic, so that the physical wires crossing the PR boundary are placed in the
right locations in the PRM. How this is done in practice depends entirely on the CAD
tool flow provided by the FPGA vendor. For the prototype system implemented for this
thesis, Xilinx PlanAhead is used to perform PR compilation.
The general hardware compile procedure is performed as follows for the prototype
Chapter 3. Architecture for FPGA Virtualization 33
system, but should generalize well to other vendors also. First, the user’s HDL (written
in the provided template) is synthesized to a netlist. This netlist is added to a new
compile “run” in Xilinx PlanAhead – the netlist is assigned to all physical VFRs present
in the system (in the case of the prototype, all four of them). The run is also configured to
use the already placed and routed static logic, and PlanAhead ensures that all boundary
crossing wires are placed in the correct locations in the PRM. The compile run is initiated,
and the tool maps, places and routes the user netlist in all VFRs (which are PRRs). One
design is placed and routed in all VFRs because the Agent requires the flexibility to
place the user’s design in any VFR in the system – PRMs can only be configured inside
PRRs for which they are specifically compiled. Bitstream generation is then completed,
creating one partial bitstream for each physical VFR. These partial bitstreams will be
used by the Agent to configure the user hardware into a running system when a VFR is
booted in OpenStack.
Recall that user hardware is sent to the Agent as an “image”. All partial bitstreams
generated by the compile are added to a compressed archive, and this archive is uploaded
as an image to OpenStack by the user with the image management tool called glance.
When the Agent selects which physical VFR partition to use for the users hardware,
it simply selects the partial bitstream corresponding to that PRR, configures it, and
disregards the others.
Figure 3.7 shows the general compile flow, upload and boot procedure. For the
prototype, such a compile system is realized using a script-based approach. The user
places their netlist in a specific folder, and then executes a compile script that uses
PlanAhead to perform the aforementioned compilation steps and bitstream packaging.
The end result is the zip archive containing the partial bitstreams, ready for upload as
an image. Steps inside the dashed lines of Figure 3.7 are part of the compile script.
Chapter 3. Architecture for FPGA Virtualization 34
user_ip.v static.ncd
XST (ISE 13.4)
user_ip.ngc
PR Compile Flow
Generate Bitstream
Static Logic fromprovider(placed and routed)
User IPPR Bitstreams in zipped folder
Upload via glance
OpenStackImage
.bit.bit
.bit
PR Compile with Xilinx PlanAhead13.4
Figure 3.7: Compile flow.
Chapter 4
SAVI Testbed Prototype
A prototype of the system described in Chapter 3 has been created using the SAVI
testbed at the University of Toronto. The goal of the prototype system is to validate the
system architecture and show its feasibility, as well as to evaluate and attempt to quantify
the benefit of reconfigurable hardware resources in an Infrastructure-as-a-Service cloud
system.
4.1 FPGA Hardware
The prototype is based on the SAVI testbed OpenStack cloud, with its ability to man-
age heterogeneous resources. The FPGA-based portion is implemented using the NetF-
PGA10G [45], available in the SAVI testbed as a non-virtualized resource in the form
of a baremetal server with NetFPGA10G connected to a PCIe slot. The NetFPGA10G
comes equipped with a Xilinx Virtex 5 VTX240T, four 10G Ethernet interfaces, and 128
MB of off-chip reduced latency DRAM. Although of an older generation than currently
available, the Virtex 5 [51] is still large enough to realize a sufficiently non-trivial sys-
tem that demonstrates the required functionality. The static logic architecture uses as
a base the NetFPGA10G open source infrastructure [52]. Certain components of the
infrastructure were modified to realize the architecture described in Chapter 3:
35
Chapter 4. SAVI Testbed Prototype 36
1. Input Arbiter: The open source infrastructure provides an input arbiter that simply
forwards packets from all four interfaces to a single output stream in a round-robin
fashion. This was modified to several output streams, one corresponding to each
physical VFR in the system, and a CAM was inserted in the pipeline to realize cor-
rect forwarding to said VFRs. The CAM is created using Xilinx IP [53]. Additional
top-level ports were also added to the Input Arbiter that connect to the write ports
of the CAM. These top-level ports then connect to several memory-mapped regis-
ters (implemented as GPIO peripherals) connected to the microprocessor system
bus. This allows a Xilinx Microblaze soft processor to program the MAC address
for any specific VFR, by writing data to the GPIO registers. A single bit in one of
the registers forms a Write signal to the CAM, which the software toggles to write
the MAC and VFR region code into the CAM. The VFR datapath is clocked at a
higher frequency than the embedded processor system, which may cause the CAM
to be written several times in succession before the processor can write a ‘0’ to the
write bit. This is generally not a problem however, since the same location will be
written with valid data each time; it is only a minor inefficiency.
2. Output Queues: The NetFPGA10G infrastructure Output Queues were also mod-
ified to accomodate multiple incoming packet streams. Registers that contain the
MAC addresses of each VFR were added, allowing the hardware to force source
MAC address fields. Top-level ports were also added to program these registers via
the same GPIO peripherals as the Input Arbiter.
All packet streams in the design are AXI Stream [54] streaming interfaces, with a
width of 256 bits running at 160 MHz, equating to just over 40 Gb/s peak throughput.
The stream widths are kept overprovisioned at 256 bits for the sake of simplicity. The
prototype system contains four physical VFRs, each a Partially Reconfigurable Region
(PRR). Each VFR region contains 11376 LUTs and 15 36K BRAMs. The number of
four is chosen because it is a non-trivial number, and because it allows maintenance of
Chapter 4. SAVI Testbed Prototype 37
a region size that can implement meaningful hardware. It should also be noted that
increasing the number of VFRs in the system also increases the number of required
streaming interfaces, causing a large increase in required routing resources. Normally the
placement and routing algorithms may handle this acceptably, but the problem is vastly
more complicated due to the fact that PRRs are physically fixed, and the CAD tools
are unable to do any optimization across the PR boundary. These constraints make it
significantly more difficult for the CAD tools to meet timing, especially as the number
of PRRs increases. This also contributed to the limiting of the number of VFRs.
Resource utilization for the static logic is shown in Table 4.1. The counts in Table 4.1
do not include the counts for the VFR regions. The design with four VFRs successfully
meets timing with the AXI streams running at 160MHz and the soft processor system
running at 100MHz. The VFRs are connected to the AXI streams, so they also run at
160 MHz and make this clock available to user hardware. User hardware must meet the
160 MHz timing constraint to work properly, as there are no other clocks provided.
Table 4.1: Resource Usage for System Static Hardware
Resource Usage (Used / Device Total)
Flip-flop 29327 / 149760 (19%)
LUT 28711 / 149760 (19%)
36K BRAM 105 / 324 (32%)
4.2 Agent Software
The Agent for the prototype is implemented in Python, in keeping with the rest of the
OpenStack project, and because Python provides high functionality with low coding
effort. Although this can come at the cost of performance, the Agent is not required to
be a high performance component.
The Agent uses a collection of Python objects and standalone functions to accomplish
Chapter 4. SAVI Testbed Prototype 38
its required tasks, and is designed in a way to make it FPGA hardware agnostic – the
idea being that the same Agent software can be used for any FPGA device realizing the
VFR system, with minimal modifications, provided the communication protocol to the
soft processor is implemented correctly.
The software contains two global objects: a statusTable object, which holds infor-
mation about the status of the hardware, and a serial object, which provides RS232
send and receive functionality for communication with the soft processor system. The
serial object is implemented using the PySerial library.
4.2.1 statusTable and Associated Objects
The statusTable global object stores the needed state of the entire FPGA hardware
system. This includes:
1. The number of physical VFR regions in the managed FPGA hardware.
2. The FPGA system type. This refers to the FPGA device or part number, and
the specific board it is mounted on. For example, the prototype system type is a
NetFPGA10G.
3. A Python list of region objects. region objects correspond to one unique PRR
(VFR) in the FPGA system.
4. A string containing information about the system type, useful for debugging.
The statusTable object also provides several methods (functions) that allow the
Agent to manage the FPGA hardware. The first is (in pseudocode):
statusTable.program(bitPkg, macAddr, serial, uuid)
Chapter 4. SAVI Testbed Prototype 39
The first argument to the function (bitPkg) points to a compressed file containing the im-
age – a collection of bitstreams, one matching each PRR in the system, that are essentially
hardware images supplied by the user, and sent to the Agent by the OpenStack controller.
macAddr is a string containing the MAC address of the VFR about to be booted. serial
is a reference to the serial object, and uuid is a string containing the OpenStack gen-
erated UUID of the resource. Algorithm 1 shows the basic operation of the function.
Algorithm 1: The statusTable.program() function
Data: bitPkg, regionList, uuid, MAC
Result: A unconfigured VFR in regionList is configured with the user hardware
in bitPkg, and the MAC address is set to MAC
for region in regionList do
if region is not configured then
region is free, find bitstream file matching this region
pkg = uncompress(bitPkg)
for file in pkg do
if match(file, ‘‘* %d’’ % region.id) then
region.configure(file, uuid)
region.setMAC(MAC)
return Success
end
end
end
end
a free region was not found, fail
return Failure
Chapter 4. SAVI Testbed Prototype 40
The match function call in Algorithm 1 is a regular expression matching the region
object identifier (an integer id ∈ [0..N−1], where N is the number of PRRs) to a portion
of a particular bitstream filename. The filename format of the bitstreams in the package
are [name] [region id].bit. For example, if the users hardware system was named myhard-
ware and was compiled for a system containing four VFRs, the bitstreams generated by
the script-based compile would be named myhardware 0.bit ... myhardware 3.bit.
This is how the Agent knows which bitstreams in the package correspond to which PRRs.
The second method in the statusTable object is the opposite of the first:
statusTable.release(serial, uuid)
This function is used to release a currently running VFR with UUID uuid and con-
figure a black box PRM to remove the user hardware from the system. It also removes
that resources MAC address from the hardware (i.e. from the Input Arbiter CAM and
the Output Queues registers.) The basic operation is shown in Algorithm 2.
Algorithm 2: The statusTable.release() function
Data: regionList, uuid
Result: A configured VFR in regionList is released
for region in regionList do
if region is configured AND region.uuid == uuid then
region matches uuid to be released, remove MACs and configure black box
region.resetMAC()
bb = “blackbox ” + region.id + “.bit”
region.release(bb)
return Success
end
end
the configured region was not found, fail
return Failure
Chapter 4. SAVI Testbed Prototype 41
As can be seen in the program and release functions, there are non-global region
objects, lists of which are in the statusTable object, where each region object in the
list corresponds to a PRR on the FPGA. The region object contains data describing a
single VFR:
1. An integer identifier for the particular region.
2. A string containing a configured VFRs OpenStack UUID.
3. A boolean variable indicating whether or not the region is configured.
4. A string containing the MAC address of the region.
The region object also contains several methods for managing a single VFR. These
are:
1. region.resetMAC(serial) : Send a zero MAC address over the serial connection
to the soft processor to program into VFR with region.id.
2. region.setMAC(serial, newMAC) : Send newMAC over the serial connection for the
soft processor to program into the VFR corresponding to region.id.
3. region.configure(serial, bitstream, uuid, sysType) : Configure bitstream
into VFR region. This involves sending a command to disable stream interfaces,
receiving an acknowledge, using Xilinx iMPACT to program the partial bitstream
for the correct FPGA (dependent on sysType), and re-enabling the stream inter-
faces and receiving a final acknowledge. Also sets region.isConfigured to True.
4. region.release(serial, bitstream, sysType) : Disable stream interfaces as
in the configure method, use Xilinx iMPACT to configure bitstream (a blackbox)
into the FPGA dependent on sysType, and set region.isConfigured to False.
Chapter 4. SAVI Testbed Prototype 42
Standalone Functions
Several standalone functions are also used to implement the required functionality of
the Agent. These include functions to setup a TCP socket for communication with the
OpenStack controller, reading and writing bytes, null terminated lines and files from
said socket, as well as a function to set up and run a TCP server that the OpenStack
controller can connect to. Also included is a main() function where execution begins
when the Agent is started. This is described in the following section.
4.2.2 Initialization and Operation
The Agent execution starts in a main() function, and first sets up a serial connection
(preferred to PCI Express for simplicity). to the FPGA hardware, specifically the soft
processor, and queries the embedded system for information on the system. The embed-
ded system responds with a string containing the number of VFRs and the system type.
With this information, the Agent creates a new statusTable object whose constructor
sets up the list of region objects and initializes them. If the query to the embedded
system fails, the Agent will retry several times and exit with a failure message if there is
no response.
Upon initializing the statusTable object, the Agent acts as a server, listening on
a predefined TCP port for incoming connections from the OpenStack controller. The
Driver, which integrates with the controller, gets the connection parameters from the
resource flavor information passed to it after a boot command is issued by the user. The
Driver sets up a TCP connection to the Agent when a resource needs to be booted or
torn down, and the Agent will accept the connection and begin reading null-terminated
lines from the Driver. These lines contain string-based commands followed by other
string-based parameters and files. The Agent carries out programming the resource as
described in other sections and a successful return is propagated back to the user. Any
failure along the way is also propagated back to the user.
Chapter 4. SAVI Testbed Prototype 43
Figure 4.1 shows a sequence diagram of the entire boot procedure. A User would use
OpenStack Nova to issue a Boot VFR command and the OpenStack Controller would
map the resource Flavor to the correct Driver (the VFR Driver) and relay the Boot
VFR command. The VFR Driver extracts the Agent IP address and port and sets up
a TCP connection. Then, the VFR Driver sends a PROGRAM command, followed by
arguments containing the Resource UUID, the Resource MAC address, and the Image
(PR bitstreams). The Agent calls statusTable.program(), which configures the PR
bitstream from the image into an available VFR and programs the MAC address. Success
is returned to the Driver, which closes the TCP connection, and returns Success to the
OpenStack controller, which in turn returns Success and a reference to the User.
Figure 4.1: A sequence diagram of the entire boot procedure in the SAVI testbed proto-type system.
Chapter 5
Platform Evaluation and
Application Case Studies
The previous chapters have described the architecture for FPGA virtualization and the
prototype system in the SAVI testbed used to realize it. In this chapter, a brief analysis
of the prototype is presented. Comparisons are made in terms of time to boot resources
(comparing VMs and VFRs), and the trade off between virtualized and non-virtualized
hardware is discussed. Furthermore, two application case studies are presented – hard-
ware applications run in the testbed using the virtualization infrastructure.
5.1 Platform Evaluation
It is desirable to see how efficiently OpenStack can handle these new resources, relative
to existing Virtual Machines. At the very least, one would expect that a VFR should be
able to be allocated in the same amount of time or less than a VM. An experiment was
set up to determine how quickly a VFR can be booted compared to a VM.
The experiment attempts to measure the amount of time from when the command
is issued to OpenStack to the point at which the resource becomes usable. Usability will
obviously differ for different resources, and the term is defined for VMs and VFRs as
44
Chapter 5. Platform Evaluation and Application Case Studies 45
Table 5.1: Boot Times for VMs and VFRs
Resource Boot Time (seconds)
Virtual Machine 71
VFR 2.6
follows:
• Virtual Machine: A VM is defined as usable at the point where an SSH connection
can successfully be established.
• Virtualized FPGA Resource: A VFR is defined as usable at the point where the
hardware can successfully process packets as its design intends.
The VFR hardware used for the experiment is a simple Layer 2 packet reflector.
An incoming packet is buffered, and has its source and destination MAC address fields
swapped. The packet is sent back out, effectively returning to its original sender. This
provides a low-latency method of determining whether a VFR is usable or not.
The experiment is run as follows: From a node within the SAVI testbed, the boot
command to allocate a VFR is entered. Time is measured while a small program sends
out a packet destined to the MAC address of the new VFR (extracted from OpenStack)
every tenth of a second. In between the packet sendings, the program listens for the
reflected response, and if found, exits and stops the timer. This is done five times and
an average is taken. For the VM, the boot command is entered and time is measured
while a shell script continually attempts to make an SSH connection. The flavor of the
VM is a small size, with 2048 MB of RAM and one virtual CPU. The timer stops once
a successful connection is made. Again this is done five times and an average is taken.
Results of these averages are shown in Table 5.1.
As can be seen in the results, a VFR can be booted much faster than a VM. This is
unsurprising however, considering that it takes on the order of milliseconds to partially
reconfigure a PRR over JTAG. The time to reconfigure a PRR may increase with the
Chapter 5. Platform Evaluation and Application Case Studies 46
physical size of the region, however it would still be on the order of milliseconds. A VM
must first be initialized by the hypervisor, and would then take generally around the
same time to boot up as a regular non-virtualized machine. It can be concluded though,
that VFR systems can be scaled up or down relatively quickly.
Virtualization Trade-offs
Virtualization usually comes with trade-offs in terms of performance – Virtual Machines
typically have worse performance than a baremetal machine due to the underlying ab-
stractions, although much progress has been made in reducing this performance gap to
the point where it is very small, within a few percentage points [55, 56]. VFRs also come
with some trade-offs in terms of performance, as well as some that are not analogous to
processor virtualization. First, the performance trade-offs are analyzed.
Since VFRs are network connected accelerators, performance can be framed in terms
of throughput and latency. The static logic in the virtualization system will cause addi-
tional latency, although the system has been designed to balance this trade-off in favour
of the user’s hardware. The baseline for the following comparisons is the open source
NetFPGA10G hardware, unmodified. In the virtualization system, the input arbiter is
modified to include a CAM – this CAM is generated using Xilinx IP and adds one cycle
of latency, but this only happens once per packet. This is because only the first flit of the
packet is needed by the CAM to decide which VFR to route the packet to, after which
one flit is transferred per cycle as normal. From the arbiter the streams connect to each
physical VFR wrapper module. Recall that the wrapper modules contain logic under con-
trol of the cloud provider, and the Partially Reconfigurable Region (PRR) into which the
user hardware gets configured. The wrapper module contains two FIFO packet buffers –
one to buffer packets going into the PRR, and another to buffer outgoing packets. Each
of these buffers adds another cycle of latency. The output queues in the NetFPGA10G
hardware were also modified to force outgoing packet source MAC addresses, but this is
Chapter 5. Platform Evaluation and Application Case Studies 47
done completely in combinational logic during the cycle in which the hardware decides to
which output port the packet should be directed, therefore adding no additional latency.
Thus, in terms of performance, the user sees only a one-cycle pipeline stall per packet,
and two additional cycles of latency, which, at 160MHz, is a mere 12.5 nanoseconds.
Buffering will not affect throughput, but the one cycle stall per packet will affect it.
The larger the packet however, the less the throughput will be affected because there
will be more flits per packet. Peak theoretical throughput of the datapath is 256 bits per
cycle, or 256 / 6.25ns = 40.96 Gb/s, but with a minimum size packet of 64 bytes and a
one cycle pipeline stall, this falls to 27.3 Gb/s. For maximum size Ethernet packets of
1500 bytes the maximum throughput increases to 40.11 Gb/s.
There are however, additional penalties for virtualization in terms of available area
for use in the FPGA fabric. For example, in the prototype system, each VFR has 11376
LUTs, which amounts to 7.6% of the entire device, or 9.4% when considering fabric
already taken by the static logic. This was the maximum VFR area that could be
achieved in a four-VFR system while manually placing and adjusting the PRRs. In a
production system, a cloud provider would likely have several different resource flavours
of VFR, with different partition sizes that could help to alleviate this problem.
5.2 Case Study: Load Balancer
In this section the first application case study for the virtualization system is presented
– an application-layer load balancer. Load balancers are an essential part of large dat-
acenter systems, allowing requests to be distributed among active servers, increasing
performance and system stability. In a virtualized cloud environment however, users
generally would not have access to hardware load balancers, especially not if their in-
coming data or requests were in a proprietary or non-HTTP protocol. Their only option
is to use a VM-based software load balancer. In addition, this type of end-user applica-
Chapter 5. Platform Evaluation and Application Case Studies 48
tion will become increasingly more important as datacenter networks become virtualized
through software defined networking (SDN), and users have full control over private in-
ternal Ethernet LANs. This case study shows how VFRs can be used to implement
such an arbitrary-protocol load balancer that can vastly outperform a software version
in terms of throughput and latency predictability.
5.2.1 Load Balancer Designs
The load balancer operates on a hypothetical protocol that runs on top of UDP. The
protocol has an identifier field in the first 16 bits of the UDP payload, which the load
balancers recognize. There are two identifiers – one designating request or data packets,
coming from a client, required to be distributed to servers, and the other designates
update packets, sent from servers to the load balancers such that the servers can be
added or removed from the distribution system.
Software
The software load balancer is written in Python using the low-level Berkeley Sockets
library [57]. A list is used to track active servers to distribute request packets to, and
incoming request packets are distributed in round-robin fashion. If an update packet is
received from a server, that server’s IP address is added or removed from the distributing
list accordingly.
Hardware
The hardware load balancer is implemented in a similar manner. Since the VFR stream
interfaces are 256 bits wide, the first two flits of the packet are required to detect the
identifier field in the UDP payload. If an update packet is detected, the hardware stores
the server’s IP address and MAC address in a memory. For request packets, the hardware
will replace the packet’s destination IP and destination MAC fields with values read from
Chapter 5. Platform Evaluation and Application Case Studies 49
Table 5.2: Resource Usage for VFR Load Balancer
Resource Usage (used / VFR Total)
Flip-flop 7523 / 11376 (67%)
LUT 3594 / 11376 (31.6%)
36K BRAM 11 / 15 (74%)
the memory location corresponding to the current destination server in the round robin
schedule. The packet’s UDP checksum is then recalculated, and the rest of the packet is
buffered and sent along.
Because the interface to the VFR is a Xilinx AXI Stream, it was possible to com-
pile this hardware using Xilinx’s high level synthesis tool, Vivado HLS [58], and simply
instantiate the generated HDL within the VFR template file. The design is described
in approximately 150 lines of C code, and compiled for the virtualization system as in
Figure 3.7. Using this method, the design iteration time is reduced from days or weeks
to as little as a few hours. The resource utilization for the hardware load balancer is
shown in Table 5.2.
Many sources of frustration are also removed from the design cycle because the static
logic is already placed, routed, tested and guaranteed working by the cloud provider.
The user need not worry about complex memory interfaces, pin constraints and I/O, or
special timing considerations for high speed transceivers. Because the hardware is just
another generic cloud resource, it is possible to quickly scale up or down the number of
active load balancers according to system-level loads. This can also be done much faster
than the software version, since VFRs can be allocated much more quickly than VMs.
Chapter 5. Platform Evaluation and Application Case Studies 50
5.2.2 Performance Analysis
Test System and Prototype
The prototype system in the SAVI Testbed is used for all tests. Recall that the SAVI
Testbed provides an “extended cloud” consisting of Core nodes (conventional cloud com-
pute) and Edge nodes that contain additional resources such as the NetFPGA platforms
used for the prototype. All of the experiments are limited in scope to a single edge node
in the SAVI testbed.
Comparison of VM and VFR Load Balancers
An experiment is set up to compare latency and throughput of the different load bal-
ancers. The experiment uses a single load balancer, three receiving servers, and two
clients. The servers and clients are all VMs spread amongst two physical machines. Care
is taken to make sure that the two clients are on different physical machines so that they
do not share a single physical interface and interfere with one another. One client is used
to take measurements of latency, while the other is used to inject additional traffic at
prescribed rates onto the network, bound for the load balancer. A third physical machine
hosts the NetFPGA 10G platform, which in turn provides the VFRs to OpenStack as
described in the prototype. The machines are connected via the SAVI Testbed OpenFlow
network. The network physical layer is implemented via a gigabit switch, with four 10G
ports connecting the NetFPGA and VFRs. The test setup is shown in Figure 5.1.
The latency is measured as the round trip time from Client 1, through the VM
or VFR Load Balancer, to a Server, and back to Client 1, averaged over 10000 mid-
sized (760 byte) packets. The load balancer is on the forward path, but not the return.
The latency is continually measured and the number of dropped packets counted as
Client 2 is used to inject additional traffic into the load balancer. In this way it is
possible to see what level of throughput the load balancer can handle. Figures 5.2 and 5.3
Chapter 5. Platform Evaluation and Application Case Studies 51
Hypervisor
Hypervisor
Server
Physical Server 1
Physical Server 2
Server
Server
Client 1Measure
Client 2Inject
VM Load Balancer
Physical Server 3
VFR Load Balancer
Sw
itch
1G
1G
1G 10G
Figure 5.1: Experiment setup for load balancer tests.
show the results for the VFR and VM load balancers as a function of the Injection
Rate from Client 2 – each latency measurement is an average of 10000 packets, so the
standard deviation is included as well, which gives a sense of system load and performance
predictability. Socket timeout at one second is counted as a dropped packet, and is
not counted in the latency average. Transmission continues until 10000 packets are
successfully received back. Dropped packets for the VM load balancer at each data point
are shown in Figure 5.4. The VFR load balancer had no dropped packets.
Even at 25 MB/s, the VM load balancer begins to drop a number of packets. At
45 MB/s, the VM load balancer drops a significant number of packets, and latency in-
creases and varies widely. The VFR balancer however has predictable and fairly constant
latency. Even at 100 MB/s the VFR balancer dropped no packets and maintained pre-
dictable performance. Latency in general was dominated by software in both send and
receive. Unfortunately, the test system did not allow testing of the full capabilities of the
VFR, since there were not enough physical servers available in the edge node to saturate
the 10GE links to the VFR load balancer. Clearly though, the VFR provides a significant
benefit to the cloud user – several VMs can be replaced with a single VFR, simplifying
and streamlining the user’s systems, and potentially lowering operating costs. The cloud
provider also benefits by having fewer VMs per user, which may reduce overall power
Chapter 5. Platform Evaluation and Application Case Studies 52
200
400
600
800
1000
1200
5 10 15 20 25 30 35 40 45 50
Late
ncy
(mic
rose
cond
)
Injection Rate (MB/s)
VFR LatencyVFR Std
Figure 5.2: VFR load balancer latency at different throughput levels.
200
400
600
800
1000
1200
5 10 15 20 25 30 35 40 45 50
Late
ncy
(mic
rose
cond
)
Injection Rate (MB/s)
VM LatencyVM Std
Figure 5.3: VM load balancer latency at different throughput levels.
Chapter 5. Platform Evaluation and Application Case Studies 53
0
50
100
150
200
250
300
4.55 6 7 8 910 15 20 25 35 45
Drop
ped
Pack
ets
Injection Rate (MB/s)
1 1 1 9
75
107
241
Figure 5.4: Number of dropped packets for the VM load balancer.
consumption and costs of the datacenter.
5.3 Case Study: Extending OpenFlow Capabilities
Software-Defined Networking (SDN) is a growing paradigm that sees network manage-
ment move from a decentralized approach to a centralized, software-managed system.
Network management rules are centrally defined for an entire network, and these rules
are then translated into matching and forwarding actions for the data plane. OpenFlow
is a popular realization of SDN that has also significantly affected the development of
SDN in general [59, 60].
OpenFlow operates by having a central controller specify rules that packets coming
into the network are matched against. These matches and the specific packet fields they
operate on form what are know as flows. A flow consists of a match field, a priority field,
statistics counters, timeouts, and most importantly, actions. When an incoming packet
is matched to a flow (with multiple matches the highest priority is taken), the action(s)
specified in the flow are taken on the packet. This may simply be to forward the packet
out of a particular switch port, or it may be to drop the packet, modify certain fields in
the packet header, swap certain fields or insert VLAN (Virtual Local Area Network) [61]
Chapter 5. Platform Evaluation and Application Case Studies 54
tags or MPLS (Multi-Protocol Label Switching) [62] shims. In current OpenFlow enabled
switches, not all of these actions will be supported in the hardware datapath – certain
actions, usually more complex ones, require the processing to be done in switch software,
which is a much slower path. Many desirable actions are not even possible on current
OpenFlow switches, and some desirable actions are not even specified in the OpenFlow
specification [63].
In this case study, virtualized hardware is used to extend the capabilities of OpenFlow.
By forwarding matched packets to a VFR, custom hardware can be used to implement
arbitrary actions or matches on packets at 10 Gb/s line rates. In particular this case study
will look at actions pertaining to the Virtual eXtensible Local Area Network protocol,
or VXLAN, which allows bridging of LANs over IP.
5.3.1 VXLAN
Virtually eXtensible Local Area Network (VXLAN) is a method of bridging two OSI
Layer 2 networks by encapsulating the Layer 2 packets inside UDP packets. This “virtu-
ally” connects two Layer 2 networks. The packet structure for VXLAN is straightforward.
It consists of a standard UDP packet, with a port number of 4789, and a payload con-
sisting of a VXLAN header and the encapsulated Layer 2 packet. Figure 5.5 shows the
format of a VXLAN packet. Currently, OpenFlow does not specify any actions that can
act on the encapsulated packet, or in fact anything that lies within the payload of the
encapsulating UDP packet. It would be very useful from a network management perspec-
tive to be able to manage virtually connected LANs directly in OpenFlow without having
to invest in additional infrastructure. This amounts to the ability to match and/or edit
fields within the encapsulated packet.
Chapter 5. Platform Evaluation and Application Case Studies 55
L2 Ethernet L3 Internet Protocol Encapsulated Packet ...VXLANHeader
L4 Transport Protocol (UDP)
L3 Internet Protocol L4 Transport Protocol ....
Reserved8 Bits
Reserved24 Bits
VNID24 Bits
Reserved8 Bits
L2 Ethernet
Figure 5.5: Packet diagram for the VXLAN protocol.
5.3.2 Virtualized Hardware for New OpenFlow Capabilities
Hardware can be written and compiled for a VFR via the methods discussed in Chapters
3 and 4. For this application, the chosen enhancement to implement is the ability to
perform matches on the fields of the encapsulated field and either forward or drop the
packet. This amounts to a custom, OpenFlow-controlled in-network firewall for VXLAN
tunneled networks. The end result will be that a user will be able to control forwarding
in a VXLAN-bridged Layer 2 network using nothing but OpenFlow. Theoretically of
course, the VFR architecture allows access to the entire packet, and therefore any field
could potentially be parsed, modified or inserted.
The VFR used for this case study is designed to be capable of matching transport layer
(i.e. TCP, UDP) port numbers inside VXLAN encapsulated packets, which will allow it
to block certain communication protocols. The hardware is implemented in Verilog, and
consists of a packet datapath, a programmable drop port register, and a simple control
circuit. Control packets are defined as having the Ethernet Type field equal to 0x88B3.
If the hardware detects a control packet in the datapath, it will extract the first 16 bits
of the payload, and place it in the drop port register. The control packet is then dropped.
Any other packet will be parsed to detect destination UDP port 4789, signifying that
Chapter 5. Platform Evaluation and Application Case Studies 56
the packet is a VXLAN packet. If this port is detected, the hardware then tests if source
or destination transport layer port of the encapsulated packet matches the one currently
stored in the drop port register. If it matches, the packet is dropped. If not, the packet
is simply forwarded. Using the control packets, the user can change what port numbers
the hardware matches. The hardware module also contains a FIFO buffer to help the
design meet timing. The design introduces no pipeline stalls and can run at full pipeline
throughput. To avoid broadcast loops, any broadcast Ethernet packets entering the VFR
are dropped.
All that remains is the question of how the VXLAN packets get to the VFR in the
first place. This is achieved by programming several regular flows into the network such
that all switches that receive packets with a destination UDP port of 4789 will forward
the packet to the VFR. To avoid a forwarding loop, additional flows are added such that
if the VXLAN packet ingress port corresponds to the VFR hardware, it is sent on to its
original destination. In the SAVI testbed, physical topologies can be queried by the user,
and it is possible to determine exactly what physical or virtual switches lie on the path
between VXLAN-tunnelled machines and the VFR. This allows the aforementioned flows
to be set manually. Also recall that the SAVI testbed has a fully virtualized network that
allows users to allocate a private slice of the network with their own OpenFlow controller.
In such an environment, the flows necessary for routing VXLAN tunnels through the VFR
can be set up programmatically via the controller. Figure 5.6 depicts the structure of the
VXLAN port firewall. Incoming packets are buffered, and if a packet is a VXLAN packet
(i.e. the destination UDP port is equal to 4789) and the encapsulated packet contains a
transport layer port (that is, TCP or UDP) equal to the Drop Port register, the packet is
dropped. If the packet is not a VXLAN packet, it is also dropped. Otherwise, it is simply
forwarded out of the VFR, and regular flows move the packet to it’s original destination,
avoiding a loop.
Chapter 5. Platform Evaluation and Application Case Studies 57
IS VXLAN?
PKT InPKT Out
IS TP_PORT
X?
Drop
VM1 VM2
VFR
N
Y
Y
N
VXLAN Tunnel
Figure 5.6: VXLAN Port Firewall
5.3.3 Performance Analysis
The hardware for the VXLAN port firewall was written in Verilog and compiled using
the procedure outlined in Section 3.5. This uses Xilinx ISE 13.4 to synthesize a netlist,
and Xilinx PlanAhead to perform the PR compilation. Resource usage for the hardware
is shown in Table 5.3. At only 15.6% of available LUTs in a VFR, the hardware is lean
and leaves room for more features, such as matching multiple ports using a CAM, or
matching and blocking based on other encapsulated packet fields.
Table 5.3: Resource Usage for VFR VXLAN Port Firewall
Resource Usage (used / VFR Total)
Flip-flop 1778 / 11376 (15.6%)
LUT 1446 / 11376 (12.7%)
36K BRAM 0 / 15 (0%)
Experimental Setup
The port firewall VFR is booted using OpenStack along with two VMs. OpenVSwitch
(OVS), a virtual OpenFlow switch, is installed on both VMs and used to set up a VXLAN
Chapter 5. Platform Evaluation and Application Case Studies 58
tunnel between the two VMs. Important to note is that both VMs are booted on separate
physical machines, and separated by a physical OpenFlow switch. The two VMs are
connected through the physical OpenFlow switch to the VFR port firewall. Flows are
installed in the virtual switches and the physical switch to redirect any VXLAN packet to
the VFR hardware, so it can achieve its intended function. It is obviously more desirable
to use a custom OpenFlow controller to accomplish this rather than manually installing
the flows, but at the time of writing this thesis, private OpenFlow networks were not
fully compatible with heterogeneous resources in the SAVI testbed. Therefore, the flows
are manually installed.
Results
After the flows are installed, VXLAN encapsulated traffic between the two VMs operating
on a given transport layer port is successfully blocked. The port can be changed at any
time by sending a control packet with a different port number to the VFR.
The iperf tool is used to measure throughput from one VM to another, and the
ping tool is used to measure latency. First, a baseline measurement is taken, that is
with no additional flows installed, to see what the throughput and latency are between
the VMs without the VFR in the network path (Figure 5.7a). iperf is run five times
and an average is taken, while ping is run until 20 pings are completed. These tests are
run both with and without VXLAN tunnelling. Results are shown under No Tunnel and
VXLAN Tunnel in Table 5.4
Without tunnelling the throughput is near line rate (1 Gb/s) as expected. When
tunnelling using VXLAN, throughput takes a large performance hit. Normally this is not
the case, but the Ubuntu operating systems used in the machines to boot the VMs have
a dynamic CPU frequency scaling system to save power – the lower clock frequencies
significantly affect the performance of the software switch (OpenVSwitch), since the
VXLAN encapsulation and decapsulation is done inside OVS. The core frequency was
Chapter 5. Platform Evaluation and Application Case Studies 59
VM1 VM2
VXLAN
Switch
OVS OVS
(a)
VM1
VXLAN Switch
BM
VM2
OVS OVS
(b)
VXLAN Switch
VFR
VM1 VM2
OVS OVS
(c)
Figure 5.7: Experimental setups for VFR-based VXLAN firewall.
limited by Ubuntu to 1.2 GHz, which was still enough to saturate the 1 Gb/s link when
not running a tunnel, but the processor was not able to maintain this bandwidth when
running the VXLAN tunnel. This effect was verified by disabling the CPU frequency
scaling and observing a rise in throughput over the VXLAN tunnel, however, the CPU
scaling was left on for the experiments to maintain default settings across all VMs in the
system.
Table 5.4: Throughput and Latency for VXLAN Port Firewall
Throughput Latency
No Tunnel 941 Mb/s 0.465 ms
VXLAN Tunnel 517.4 Mb/s 0.532 ms
VXLAN through VFR 513.2 Mb/s 0.600 ms
VXLAN through BM 480.4 Mb/s 0.801 ms
Now that a baseline for comparison is established, the flows are installed to reroute
VXLAN traffic to the running VFR and iperf is run again to determine the overhead of
rerouting the VXLAN traffic through the VFR port firewall, shown in Figure 5.7c. An
average of five runs gives a throughput of 513.2 Mb/s, slightly less than the 517.4 Mb/s
achieved without the VFR in the network path. This is not a large difference and it can
be concluded that this technique introduces little to no overhead in terms of throughput.
The ping test is also run again, and results show that rerouting to the VFR introduces a
Chapter 5. Platform Evaluation and Application Case Studies 60
small increase in latency (∼ 12%), but this is expected when adding an additional hop.
These results are also summarized in Table 5.4 under VXLAN through VFR.
Lastly, it is also useful to see how the VFR compares in performance to a software
version executing the same task. A software VXLAN port firewall is programmed in C,
based on the packet capture library libpcap. Normally it would be best to run the software
on a VM in the cloud as VMs are the usual processing resource available. However,
many VM network interfaces, including those in the current SAVI testbed, do not have
the ability to put an interface into promiscuous mode. In promiscuous mode, a network
interface passes along all detected packets, even those not addressed to the receiving
interface. To implement a pass-through firewall, a promiscuous interface is required.
Therefore, for the sake of the experiment, a non-standard bare-metal server is used to
run the software on which the network interface can be configured into promiscuous
mode.
The packet flows installed on the switch are slightly modified to enable the network
path shown in Figure 5.7b, where VXLAN packets are sent to the bare-metal server (BM)
for software processing rather than the VFR hardware. The iperf and ping tests are
run again, and the results are also shown in Table 5.4 under VXLAN through BM for
comparison.
The software introduces a rather large increase in latency compared with both the
VFR and the direct path from one VM to another. This is expected since each packet
must travel up through several software layers before being processed. That being said,
network stacks in modern servers are very efficient, and this is evident in the throughput
of 480.4 Mb/s, only a 6.6% decrease from the VFR. The main reason for this good
performance however, is that the application is relatively simple. For a single port check,
the software is only required to move a pointer and make a comparison, which is less
than tens of instructions, then either sending or dropping the packet. More complex
operations, such as an IP address Longest Prefix Match or multiple exact matches, would
Chapter 5. Platform Evaluation and Application Case Studies 61
likely widen the performance gap since they can be fully pipelined in hardware. Such
enhancements are left to future work.
Summary
This case study has shown that VFRs can be used in a straightforward manner to enhance
the capabilities of an OpenFlow network, allowing custom, in-network matching and
actions to be added to the system with little to no degradation in throughput. This
comes at the cost of slightly increased latency. It has also been shown that even for
a simple application, a VFR will outperform a software implementation. VFRs thus
provide a promising path toward fully custom complex network configurations, with all
data plane operations done entirely inside hardware on high-speed paths.
Chapter 6
Conclusion
This thesis has presented a hardware and software architecture that represents a first
attempt at integrating FPGA hardware acceleration as a first-class citizen into cloud
computing infrastructures. This work shows that FPGA hardware acceleration can be
virtualized and feasibly integrated with existing infrastructure-as-a-service cloud man-
agement software. It also shows that FPGA Partial Reconfiguration (PR) can be used to
virtualize an FPGA device with very little performance overhead in terms of throughput
and latency when considering network-connected accelerators. The architecture pre-
sented shows that this style of virtualization can be very accessible to the end user as
well, trading off potential application scope for a template-based automated compila-
tion system that effectively removes many common design complexities. End users also
benefit from the fact that the very same management commands for traditional cloud
resources are used to manage the new heterogeneous resources.
6.1 Future Work
As stated, this work is merely a first attempt at the integration of FPGAs into cloud
computing. There are many open avenues of exploration, both for the general concept
and for the particular architecture presented in this thesis. This section on Future Work
62
Chapter 6. Conclusion 63
will discuss some of these possible avenues, focusing mainly on the work presented in this
thesis.
6.1.1 Architectural Enhancements
While the Agent is fairly generic and not performance critical, there are areas in which
the FPGA static hardware virtualization layer could be improved. VFRs are limited
in size, and many applications may need more than one to fully implement a desired
function. The performance of such a system may be heavily dependent on where each
constituent VFR is placed physically by the scheduling algorithm. A commercial system
could consist of hundreds or thousands of FPGA systems spread across one or possibly
several datacenters. VFRs that are part of the same system could take a severe latency
penalty if placed very far away from each other. This leads to two related future work
items:
1. VFR-aware resource scheduling: modify the scheduling algorithm to recognize that
VFRs that are part of the same system (or belonging to the same user) should be
placed as close as possible to each other physically. Ideally, they should be placed
in the same physical device. In the current implementation, however, this is not as
good as it could be – transferring packets between VFRs still requires the packet
to be switched external to the FPGA.
2. Static Logic Switching Fabric: Related to the second point of the first item de-
scribed above, a switching fabric in the static logic would allow packets addressed
from one VFR to another in the same device to be redirected without exiting the
device. VFRs could be chained together with extremely low-latency paths, leading
to higher performance for larger, multi-VFR systems.
Currently, the virtualization architecture lacks a dedicated promiscuous interface
function. A promiscuous interface is one that allows any and all packets detected on
Chapter 6. Conclusion 64
the wire into the device, regardless of whether they are addressed to the device. For ex-
ample, all ports on a regular Ethernet switch would be promiscuous. What this means is
that it is difficult for a user to use a VFR to do generic switching and processing, because
the static logic only allows addressed packets into a VFR. A useful enhancement would
be a mechanism that allows VFRs to act in a promiscuous mode, while still enabling the
provider to enforce security.
Another thing to consider, especially from the view of the cloud provider, is what
assumptions can be made about the user-designed hardware. A commercial provider
would likely make very few assumptions – user-designed hardware may be bug-ridden,
faulty, or have any number of other problems. Error detection and fault tolerance in the
face of shoddy or unpredictable user hardware may very well become a top priority. For
example, if one VFR begins to make transfers incorrectly on the stream interface, the
entire static logic may be affected by this, at the very worst locking up and forcing an
entire chip reconfiguration. A poorly implemented VFR may also transmit packets below
the network minimum packet size (e.g. below 64 bytes for Ethernet networks), which
could result in network problems for the provider.
Because of these problems, much research in fault-tolerant embedded systems could
likely be applied to the static logic to improve its commercial viability.
Additionally, the memory system connecting the VFRs to off-chip DRAM is currently
simple and inefficient in its partitioning scheme. Future work may address this problem
by replacing the partitioning scheme with a full multi-port memory controller or similar
technology to more efficiently share the off-chip DRAM.
6.1.2 Failures and Migration
As briefly discussed in the previous subsection, VFRs may occasionally fail due to a
variety of circumstances such as faulty user hardware, a provider static logic bug or even
network faults. In the case of a Virtual Machine, the hypervisor always retains the ability
Chapter 6. Conclusion 65
to kill or reboot a faulty virtual machine, however, this feature does not translate well to
VFRs since they are so tightly integrated with the static logic that performs hypervisor-
like functions. Future work would devise methods of maintaining a more strict separation,
again possibly drawing on fault-tolerant embedded systems research, to allow the static
logic and/or Agent to deal with faults in ways that do not involve bringing down the
entire system.
Another key capability of Virtual Machines that future work would seek to replicate
for VFRs is migration. Migration involves saving the state of a running VM, transferring
it to another hypervisor, and then continuing execution in the new location, all done
transparently to the user. This problem is more difficult for FPGAs because they are
not designed with this feature in mind, and the vendor tools do not support moving a
PRM between different PRRs. This would require accessing a running circuit’s state and
FPGA configuration bits, saving them, and reprogramming it on another FPGA device.
6.1.3 Further Heterogeneity
In the future, a provider would likely have multiple flavors of VFRs, each with different
sizes. Larger VFRs would provide more LUTs for more complex circuits, but would
cost more for end users. This is analogous to how one can allocate VMs with different
amounts of RAM and virtual CPUs, and different operating systems.
New FPGA devices that combine ARM CPUs with traditional reconfigurable fabric
have also recently entered the market [64, 65]. Future work could investigate how these
devices could be virtualized as resources that provide both an ARM-based CPU or VM,
with a slice of closely-coupled reconfigurable fabric.
6.1.4 Applications
The application case studies in this thesis have only described a taste of what may be
possible with this type of system. Multi-VFR applications have yet to be investigated
Chapter 6. Conclusion 66
and analyzed.
The concept of extending available OpenFlow capabilities may also be an area of
potential future work. There are other tunnelling protocols similar to VXLAN that
could be added to OpenFlow (e.g. GRE tunnelling), and there are many other higher
layer protocols that OpenFlow does not consider. Although these higher layer protocols
are complicated because of the large amount of state information required, there is still
potential for implementation via VFRs.
The current VXLAN port firewall application is also an area of future work. The
VFR resource usage was low, and this leaves room for additional functionality, such
as matching and blocking based on multiple ports, encapsulated Layer 2 and Layer 3
headers, or another part of the packet payload. Furthermore, the experiments revealed
a significant loss in performance due to OVS having to encapsulate and decapsulate
packets in software. This function could also be moved into hardware via a VFR, and
the encapsulation and decapsulation done at line rate, incurring no losses in performance.
VFRs are also well suited to the many streaming protocols that exist. Real-time
Transport Protocol packet streams could be redirected to VFRs for hardware accelerated
processing or custom switching and routing. RTP is used for many applications like
telephony, video streaming and Voice over IP (VoIP). Other applications may include
Event Stream processing or financial data stream processing.
6.1.5 FPGA CAD as a Service
Recall that the user must compile their hardware for VFRs in a way that uses pre-placed
and routed static logic. In the implementation discussed in this thesis, this is done using
an automated script-based compile system. Currently a user must have access to the
entire project, but this is not feasible in a production environment. Ideally, the user
would submit their Verilog module to a cloud-based compile service that would analyze
and compile their hardware to run on the cloud provider’s systems. CAD as a service as
Chapter 6. Conclusion 67
a general topic may also make an interesting topic for future work, to see how it could
be useful for traditional hardware design and also for cloud-based hardware as described
in this thesis.
6.1.6 Complementary Studies
Overhead Area Trade-off Study
In the future a study could be performed to determine how the overhead of the static
logic (in terms of FPGA area) scales according to how many VFRs are implemented on
a physical chip. Larger device sizes could also be examined. Ideally, an optimal point
would be discovered where the lowest static logic overhead gives the highest percentage
of chip area to VFRs.
Deployment Costs Analysis
In this future study, a full, commercial-scale deployment of a VFR system would be
examined. The primary focus would be on costs and considerations from a cloud provider
perspective. Approximate values for how many FPGAs and VFRs of different flavors
could fit in one standard rack would be determined, as well as how much they would cost
– this will be dependent on the system architecture, whether or not there are failsafe or
redundant devices, as well as list cost prices of FPGA chips and boards, which are subject
to volume and availability. Approximate power consumption and other operating costs
would be compared to Virtual Machine deployments. The result of the study would show
how virtualized FPGAs can provide benefits to providers by giving more computation
per unit cost than Virtual Machines alone.
With regard to power consumption and performance per Watt, recent work involving
large-scale datacenter systems using FPGAs has shown that for a small (10%) increase in
power, computation performance can be doubled [66]. These are promising results that
can hopefully be replicated or improved with virtualized reconfigurable hardware.
Bibliography
[1] Gartner. Gartner Says Worldwide Public Cloud Services Market to Total $131
Billion. Gartner, Inc., 2013.
[2] Michael Armbrust, Armando Fox, Rean Griffith, A Joseph, R Katz, A Konwinski,
G Lee, D Patterson, A Rabkin, I Stoica, and Matei Zaharia. Above the Clouds: A
Berkeley View of Cloud Computing. Dept. Electrical Eng. and Computer Sciences,
University of California, Berkeley, Rep. UCB/EECS, 28, 2009.
[3] Daniele Catteddu. Cloud Computing: Benefits, Risks and Recommendations for
Information Security. Springer, 2010.
[4] Maire McLoone and John V McCanny. High Performance Single-Chip FPGA Ri-
jndael Algorithm Implementations. In Cryptographic Hardware and Embedded Sys-
temsCHES 2001, pages 65–76. Springer, 2001.
[5] S. Rigler, W. Bishop, and A Kennings. Fpga-based lossless data compression using
huffman and lz77 algorithms. In Canadian Conference on Electrical and Computer
Engineering, pages 1235–1238, April 2007.
[6] F. Braun, J. Lockwood, and M. Waldvogel. Protocol Wrappers for Layered Network
Packet Processing in Reconfigurable Hardware. IEEE Micro, 22(1):66–74, Jan 2002.
[7] Sai Rahul Chalamalasetti, Kevin Lim, Mitch Wright, Alvin AuYoung, Parthasarathy
Ranganathan, and Martin Margala. An FPGA Memcached Appliance. In Proceed-
68
Bibliography 69
ings of the ACM/SIGDA International Symposium on Field Programmable Gate
Arrays, pages 245–254. ACM, 2013.
[8] Joon-Myung Kang, H. Bannazadeh, H. Rahimi, T. Lin, M. Faraji, and A. Leon-
Garcia. Software-Defined Infrastructure and the Future Central Office. In IEEE In-
ternational Conference on Communications Workshops (ICC), pages 225–229, 2013.
[9] Joon-Myung Kang, Hadi Bannazadeh, and Alberto Leon-Garcia. Savi Testbed:
Control and Management of Converged Virtual ICT Resources. In IFIP/IEEE In-
ternational Symposium on Integrated Network Management, pages 664–667. IEEE,
2013.
[10] Ian Kuon, Russell Tessier, and Jonathan Rose. FPGA Architecture: Survey and
Challenges. Foundations and Trends in Electronic Design Automation, 2(2):135–
253, February 2008.
[11] IEEE Standard for Verilog Hardware Description Language. IEEE Std 1364-2005,
pages 1–560, 2006.
[12] IEEE Standard VHDL Language Reference Manual. IEEE Std 1076-2008, pages
c1–626, Jan 2009.
[13] Rishiyur Nikhil. Bluespec System Verilog: Efficient, Correct RTL from High Level
Specifications. In Proceedings of the Second ACM and IEEE International Confer-
ence on Formal Methods and Models for Co-Design, pages 69–70, June 2004.
[14] Robert M. Metcalfe and David R. Boggs. Ethernet: Distributed Packet Switching
for Local Computer Networks. Communications of the ACM, 19(7):395–404, July
1976.
[15] Don Anderson, Tom Shanley, and Ravi Budruk. PCI Express System Architecture.
Addison-Wesley Professional, 2004.
Bibliography 70
[16] Peter Mell and Timothy Grance. The NIST Definition of Cloud Computing (draft).
NIST Special Publication, 800(145):7, 2011.
[17] Amazon Web Services Inc. Amazon Web Services (AWS) – Cloud Computing Ser-
vices. http://aws.amazon.com, 2014.
[18] GENI. Global Environment for Networking Innovations (GENI) Project. http:
//geni.net/, 2014.
[19] Emulab. Emulab – Network Emulation Testbed. http://emulab.net/, 2014.
[20] PlanetLab. Planetlab – A Open Platform for Developing, Deploying, and Accessing
Planetary-Scale Services. http://planet-lab.org/, 2014.
[21] Internet2. http://www.internet2.edu/, 2014.
[22] ORION. Ontario Research and Innovation Optical Network. http://www.orion.
on.ca/, 2014.
[23] CANARIE. Canada’s Advanced Research and Innovation Network. http://www.
canarie.ca/, 2014.
[24] OpenStack. http://www.openstack.org/, 2013.
[25] R.T. Fielding. REST: Architectural Styles and the Design of Network-Based Software
Architectures. PhD thesis, University of California, Irvine, 2000.
[26] OpenStack. Nova Developer Documentation. http://nova.openstack.org, 2014.
[27] OpenStack. Keystone Developer Documentation. http://keystone.openstack.
org, 2014.
[28] OpenStack. Glance Developer Documentation. http://glance.openstack.org,
2014.
Bibliography 71
[29] OpenStack. Quantum Developer Documentation. http://quantum.openstack.
org, 2014.
[30] OpenStack. Swift Developer Documentation. http://swift.openstack.org, 2014.
[31] OpenStack. Cinder Developer Documentation. http://cinder.openstack.org,
2014.
[32] Ryu. Ryu SDN Framework. http://osrg.github.io/ryu/, 2014.
[33] Rob Sherwood, Glen Gibb, Kok-Kiong Yap, Guido Appenzeller, Martin Casado,
Nick McKeown, and Guru Parulkar. Flowvisor: A Network Virtualization Layer.
OpenFlow Switch Consortium, Technical Reports, 2009.
[34] Joon-Myung Kang, Lin T., Bannazadeh H., and A. Leon-Garcia. Software-Defined
Infrastructure and the SAVI Testbed. In TRIDENTCOM 2014, 2014.
[35] K. Redmond, H. Bannazadeh, P. Chow, and A. Leon-Garcia. Development of a
Virtualized Application Networking Infrastructure Node. In IEEE GLOBECOM
Workshops, pages 1–6, 2009.
[36] Steven Trimberger, Dean Carberry, Anders Johnson, and Jennifer Wong. A Time-
Multiplexed FPGA. In Proceedings of the 5th Annual IEEE Symposium on FPGAs
for Custom Computing Machines, pages 22–28. IEEE, 1997.
[37] D. Unnikrishnan, R. Vadlamani, Yong Liao, J. Crenne, Lixin Gao, and R. Tessier.
Reconfigurable Data Planes for Scalable Network Virtualization. IEEE Transactions
on Computers, 62(12):2476–2488, 2013.
[38] Esam El-Araby, Ivan Gonzalez, and Tarek El-Ghazawi. Virtualizing and Sharing
Reconfigurable Resources in High-Performance Reconfigurable Computing Systems.
In Second International Workshop on High-Performance Reconfigurable Computing
Technology and Applications, pages 1–8. IEEE, 2008.
Bibliography 72
[39] Ivan Gonzalez, Sergio Lopez-Buedo, Gustavo Sutter, Diego Sanchez-Roman, Fran-
cisco J. Gomez-Arribas, and Javier Aracil. Virtualization of Reconfigurable Copro-
cessors in HPRC Systems with Multicore Architecture. Journal of Systems Archi-
tecture, 58(67):247 – 256, 2012.
[40] C. Steiger, H. Walder, and M. Platzner. Operating Systems for Reconfigurable
Embedded Platforms: Online Scheduling of Real-Time Tasks. IEEE Transactions
on Computers, 53(11):1393–1407, 2004.
[41] K. Rupnow. Operating System Management of Reconfigurable Hardware Computing
Systems. In International Conference on Field-Programmable Technology, pages
477–478, 2009.
[42] Chun-Hsian Huang and Pao-Ann Hsiung. Virtualizable Hardware/Software Design
Infrastructure for Dynamically Partially Reconfigurable Systems. ACM Transactions
on Reconfigurable Technology and Systems (TRETS), 6(2):11, 2013.
[43] Khoa Dang Pham, A.K. Jain, Jin Cui, S.A. Fahmy, and D.L. Maskell. Microkernel
Hypervisor for a Hybrid ARM-FPGA Platform. In 24th International Conference
on Application-Specific Systems, Architectures and Processors, pages 219–226, 2013.
[44] C. Chang, J. Wawrzynek, and R.W. Brodersen. BEE2: A High-End Reconfigurable
Computing System. Design Test of Computers, IEEE, 22(2):114–125, 2005.
[45] NetFPGA. NetFPGA 10G. http://netfpga.org/, 2014.
[46] Terasic Technologies Inc. DE5Net. http://de5-net.terasic.com/, 2013.
[47] BEECube Inc. miniBEE - Research in a Box. http://www.beecube.com/products/
miniBEE.asp, 2014.
[48] Xilinx. Xilinx Partial Reconfiguration User Guide v12.3. http://www.xilinx.com/
support/documentation/sw_manuals/xilinx12_3/ug702.pdf, 2010.
Bibliography 73
[49] Altera. Partial Reconfiguration Megafunction. http://www.altera.com/
literature/ug/ug_partrecon.pdf, 2013.
[50] Xilinx. Vivado Design Suite User Guide - Partial Reconfiguration.
http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_
2/ug909-vivado-partial-reconfiguration.pdf, 2014.
[51] Xilinx. Virtex 5 User Guide. http://www.xilinx.com/support/documentation/
user_guides/ug190.pdf, 2012.
[52] NetFPGA. NetFPGA 10G Open Source Hardware. https://github.com/NetFPGA/
NetFPGA-public, 2014.
[53] Kyle Locke. Xilinx Parametrizable Content-Addressable Memory. http:
//www.xilinx.com/support/documentation/application_notes/xapp1151_
Param_CAM.pdf, 2011.
[54] Xilinx. UG761 Xilinx AXI Reference Guide v14.3. http://www.xilinx.com/
support/documentation/ip_documentation/axi_ref_guide/latest/ug761_
axi_reference_guide.pdf, 2012.
[55] Jianhua Che, Yong Yu, Congcong Shi, and Weimin Lin. A Synthetical Performance
Evaluation of OpenVZ, Xen and KVM. In Services Computing Conference (AP-
SCC), 2010 IEEE Asia-Pacific, pages 587–594, Dec 2010.
[56] The Xen Project. Baremetal vs. Xen. vs. KVM – Redux. http://blog.xen.org/
index.php/2011/11/29/baremetal-vs-xen-vs-kvm-redux/, 2013.
[57] Python. Socket - Low-level Networking Inteface - Python Documentation. http:
//docs.python.org/2/library/socket.html, 2013.
[58] Xilinx. Vivado High-Level Synthesis. http://www.xilinx.com/products/
design-tools/vivado/integration/esl-design/, 2014.
Bibliography 74
[59] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson,
Jennifer Rexford, Scott Shenker, and Jonathan Turner. Openflow: Enabling inno-
vation in campus networks. ACM SIGCOMM Computer Communication Review,
38(2):69–74, 2008.
[60] Scott Shenker, M Casado, T Koponen, and N McKeown. The Future of Networking,
and the Past of Protocols. Open Networking Summit, 2011.
[61] P.J. Frantz and G.O. Thompson. VLAN Frame Format, September 1999. US Patent
5,959,990.
[62] Bruce Davie and Yakov Rekhter. MPLS: Technology and Applications. Morgan
Kaufmann Publishers Inc., 2000.
[63] OpenFlow Specification. http://www.opennetworking.org/sdn-resources/
onf-specifications/openflow, 2013.
[64] Xilinx. Xilinx All Programmable SoC. http://www.xilinx.com/products/
silicon-devices/soc/index.htm, 2014.
[65] Altera Corporation. Altera SoC Overview. http://www.altera.com/devices/
processor/soc-fpga/overview/proc-soc-fpga.html, 2014.
[66] Andrew Putnam et. al. A Reconfigurable Fabric for Accelerating Large-Scale Dat-
acenter Services. In The 41st International Symposium on Computer Architecture.
IEEE, 2014.