by Stuart A. Byma - University of Toronto T-Space Virtualizing FPGAs for Cloud Computing...

Virtualizing FPGAs for Cloud Computing Applications

by

Stuart A. Byma

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2014 by Stuart A. Byma

Abstract

Virtualizing FPGAs for Cloud Computing Applications

Stuart A. Byma

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2014

Cloud computing has become a multi-billion dollar industry, and represents a computing

paradigm where all resources are virtualized, flexible and scalable. Field Programmable

Gate Arrays (FPGAs) have the potential to accelerate many cloud-based applications,

but as of yet are not available as cloud resources because they are so different from the

conventional microprocessors that virtual machines (VMs) are based on. This thesis

presents a first attempt at virtualizing and integrating FPGAs into cloud computing sys-

tems, making them available as generic cloud resources to end users. A novel architecture

enabling this integration is presented and explored, and several custom hardware appli-

cations are evaluated on a prototype system. These applications show that Virtualized

FPGA Resources can significantly outperform VMs in certain classes of common cloud

computing applications, showing the potential to increase user compute power while

reducing datacenter power consumption and operating costs.

ii

Dedication

To Jennifer.

iii

Acknowledgements

First I must sincerely thank my advisors Professors Greg Steffan and Paul Chow. I owe

my success in graduate school to them and their invaluable guidance and advice. I could

not have asked for better mentorship throughout my Masters research, or in this chapter

of my life. Thank you both.

Also to my esteemed colleagues and office mates Xander Chin, Charles Lo, Ruedi

Willenberg, Robert Heße, Fernando Martin Del Campo, Andrew Shorten, Jimmy Lin

and others: You have made my time at the U of T a true pleasure – thank you for all

the good times, and for being an ever present sounding board for thoughts and ideas.

A special thanks as well to members of the SAVI testbed: Professor Alberto Leon-

Garcia, Hadi Bannazadeh, Thomas Lin and Hesam Rahimi. Your help and advice, tech-

nical and otherwise, has made my work presented here possible.

Finally and most importantly, an everlasting thanks to my wife. Thank you for

encouraging me to pursue my passions, and thank you for your unfaltering belief in me.

None of this would have happened without you.

iv

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 The Smart Applications on Virtual Infrastructure Network . . . . . . . . 7

2.3.1 The Smart Edge Node . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2 Heterogeneous Resources in the SAVI Testbed . . . . . . . . . . . 10

2.4 FPGA Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Architecture for FPGA Virtualization 14

3.1 OpenStack Resource Management . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 SAVI Testbed FPGA Resources . . . . . . . . . . . . . . . . . . . 16

3.1.2 Requirements for FPGA Virtualization in OpenStack . . . . . . . 18

3.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Fully Virtualized Hardware . . . . . . . . . . . . . . . . . . . . . 19

3.2.2 FPGA Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . 20

v

3.2.3 Virtualization via PR . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.4 Static Logic Design . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Agent Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Booting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.2 Deleting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Booting VFRs in OpenStack . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Compiling Custom Hardware . . . . . . . . . . . . . . . . . . . . . . . . 31

4 SAVI Testbed Prototype 35

4.1 FPGA Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Agent Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2.1 statusTable and Associated Objects . . . . . . . . . . . . . . . . 38

4.2.2 Initialization and Operation . . . . . . . . . . . . . . . . . . . . . 42

5 Platform Evaluation and Application Case Studies 44

5.1 Platform Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Case Study: Load Balancer . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.1 Load Balancer Designs . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Case Study: Extending OpenFlow Capabilities . . . . . . . . . . . . . . . 53

5.3.1 VXLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.2 Virtualized Hardware for New OpenFlow Capabilities . . . . . . . 55

5.3.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusion 62

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.1.1 Architectural Enhancements . . . . . . . . . . . . . . . . . . . . . 63

6.1.2 Failures and Migration . . . . . . . . . . . . . . . . . . . . . . . . 64

6.1.3 Further Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . 65

vi

6.1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.1.5 FPGA CAD as a Service . . . . . . . . . . . . . . . . . . . . . . . 66

6.1.6 Complementary Studies . . . . . . . . . . . . . . . . . . . . . . . 67

Bibliography 68

vii

Acronyms

BRAM Block Random Access Memory. 27

CAD Computer Aided Design. 16, 24

CAM Content Addressable Memory. 17–19

DRAM Dynamic Random Access Memory. 19, 20, 26

FIFO First In First Out. 20

FPGA Field Programmable Gate Array. 1–4, 9–17, 19–22, 24

GPIO General-Purpose Input-Output. 18, 19, 27

LUT Look-up Table. 27

MAC Media Access Control. 17–20, 22

PR Partial Reconfiguration. 9, 15, 16, 19, 22, 24

PRM Partially Reconfigurable Module. 16, 17, 24, 30

PRR Partially Reconfigurable Region. 15–17, 22, 24, 25, 27–30

SAVI Smart Applications on Virtual Infrastructure. 5–9

TCP Transmission Control Protocol. 21

viii

UART Universal Asynchronous Receiver-Transmitter. 18

UUID Universal Unique Identifier. 21–23

VFR Virtualized FPGA Resource. 2, 3, 17–25

VM Virtual Machine. 2, 3, 17

ix

List of Tables

4.1 Resource Usage for System Static Hardware . . . . . . . . . . . . . . . . 37

5.1 Boot Times for VMs and VFRs . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Resource Usage for VFR Load Balancer . . . . . . . . . . . . . . . . . . 49

5.3 Resource Usage for VFR VXLAN Port Firewall . . . . . . . . . . . . . . 57

5.4 Throughput and Latency for VXLAN Port Firewall . . . . . . . . . . . . 59

x

List of Figures

2.1 Diagram of the SAVI testbed. . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 The SAVI testbed Smart Edge node. . . . . . . . . . . . . . . . . . . . . 9

2.3 The Driver-Agent abstraction used in the SAVI testbed OpenStack system. 10

3.1 A simplified view of resource management in OpenStack/SAVI Testbed. . 15

3.2 FPGA Partial Reconfiguration. . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 System view of the on-FPGA portion of the virtualization hardware. . . 24

3.4 Virtualization hardware input arbiter block. . . . . . . . . . . . . . . . . 26

3.5 The VFR wrapper design. . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.6 VFR boot sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 Compile flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 A sequence diagram of the entire boot procedure in the SAVI testbed

prototype system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Experiment setup for load balancer tests. . . . . . . . . . . . . . . . . . . 51

5.2 VFR load balancer latency at different throughput levels. . . . . . . . . . 52

5.3 VM load balancer latency at different throughput levels. . . . . . . . . . 52

5.4 Number of dropped packets for the VM load balancer. . . . . . . . . . . 53

5.5 Packet diagram for the VXLAN protocol. . . . . . . . . . . . . . . . . . . 55

5.6 VXLAN Port Firewall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.7 Experimental setups for VFR-based VXLAN firewall. . . . . . . . . . . . 59

xi

Chapter 1

Introduction

Datacenter-based cloud computing has evolved into a multi-billion dollar industry, with

continued growth forecast [1]. Cloud computing is based on virtualization technology,

which abstracts physical resources into virtualized resources. This virtualization provides

flexibility and system scalability (or elasticity) [2], and also allows many users to share

available resources in a datacenter in a transparent way. Cloud computing can also greatly

reduce Information Technology (IT) operating costs of companies and organizations [3],

making it a very attractive option for IT needs.

1.1 Motivation

Field-Programmable Gate Arrays (FPGAs) have the potential to accelerate many com-

mon cloud computing and datacenter-centric applications, such as encryption [4], com-

pression [5], or low-level packet processing [6]. FPGAs have begun to make their way

into datacenters, and their use in this context can be organized into three categories. The

first sees FPGAs being used in the technology that enables the datacenter itself, such as

switches and routers. FPGAs in this category are transparent, as neither the end user

nor datacenter operator are necessarily aware of their existence. In the second category,

FPGAs are used in special “appliances” – essentially boxes that accelerate certain tasks

1

Chapter 1. Introduction 2

or processing. An example could be FPGA-based Memcached appliances [7]. The appli-

ance may be available to end users, but the FPGAs inside are themselves not accessible,

programmable resources – they are still relatively transparent. The third category, which

is the focus of this thesis, sees FPGAs becoming fully user-accessible, programmable re-

sources. Users would be able to allocate FPGA resources just as a virtual machine using

the same control infrastructure – making FPGAs first-class citizens of the cloud.

Consider a motivating example: A large organization runs its Information Technology

(IT) services and website on an infrastructure-as-a-service cloud, using hundreds or even

thousands of VMs to serve their site and services to millions of users. Their applications

may require compute-intensive processing, or application-level packet processing that re-

quires many VMs to do efficiently. If user-accessible FPGA resources are available in the

cloud, the organization could design custom hardware to accelerate these tasks – elimi-

nating a number of VMs in exchange for a few FPGA resources, and potentially gaining

a boost in throughput and a reduction in latency. At the same time, the user retains

all the benefits of using a compute cloud, such as dynamic scalability, flexibility, and

reliability. The cloud provider also benefits by freeing up VMs, which could potentially

reduce power consumption and operating costs.

The work presented in this thesis aims to explore methods of enabling these FPGA

resources in commercial cloud computing systems.

1.2 Contributions

A hardware/software architecture enabling the virtualization of FPGAs and manage-

ment thereof using the OpenStack cloud system is presented in this thesis. The major

contributions are outlined as follows:

• A hardware and software infrastructure that splits an FPGA into a number of recon-

figurable regions, and allows these regions to be managed as individual resources in

Chapter 1. Introduction 3

an OpenStack cloud system. Introduction of the term Virtualized FPGA Resource

(VFR).

• A functional implementation of such an architecture using the Smart Applications

on Virtual Infrastructures testbed.

• An comparison of VFRs and VMs in terms of boot time performance.

• An evaluation and proof of concept of the prototype system by means of two ap-

plications:

– A hardware load balancer using a hypothetical UDP-based protocol

– A method of using virtualized hardware to extend capabilities in an OpenFlow

software-defined network (SDN).

1.3 Overview

The rest of this thesis is organized as follows: Chapter 2 will provide background and

context – reviewing prior work in virtualizing FPGAs or work similar to the techniques

used in this thesis. Chapter 3 will introduce the hardware and software architecture en-

abling FPGA virtualization. Chapter 4 will introduce the SAVI [8] testbed prototype [9]

and implementation details. Comparisons of VMs and VFRs, as well as evaluations of

the proof of concept applications, are shown in Chapter 5. Chapter 6 provides some

future vision and concludes the thesis.

Chapter 2

Background

This chapter introduces concepts and definitions used throughout this thesis, and provides

context. Work related to the techniques used in this thesis will also be examined.

2.1 Field-Programmable Gate Arrays

This thesis focuses on the use of Field-Programmable Gate Arrays (FPGAs) in data-

centers and cloud computing. An FPGA is a silicon chip whose functionality can be

reprogrammed an arbitrary number of times to become nearly any digital circuit – it

is a type of reconfigurable hardware. Modern FPGAs are typically made up of a large

array of programmable Look-up Tables (LUTs), each of which can implement a four, five

or six variable logic function, depending on the device architecture. LUTs are usually

coupled with flip-flops and organized into logic blocks, that can then be connected to-

gether through a dense, programmable routing fabric. For further reading on FPGA

architectures, the reader is directed to [10]. A set of CAD tools can map arbitrary hard-

ware designs described in Hardware Description Languages (HDLs) to the FPGA fabric.

The most common HDLs include Verilog HDL [11] and VHDL [12], but others such as

BlueSpec Verilog [13] are gaining popularity.

Modern FPGAs also have embedded hard blocks to increase their capabilities – these

4

Chapter 2. Background 5

include Digital Signal Processor (DSP) or multiplier blocks, block Random Access Mem-

ories (BRAMs), high speed serialize-deserialize (SERDES) transceivers, communication

controllers (Ethernet [14], PCIe [15]), and even full microprocessors.

2.2 Cloud Computing

It is useful to define what is meant by the term cloud computing, as many companies,

individuals and other sources may use the term in different ways. This thesis will fol-

low the definition of cloud computing given by NIST that describes several essential

characteristics [16], summarized here:

1. On Demand Self Service: ability to provision cloud resources in the cloud at any

time, on demand, without interaction with humans.

2. Broad Network Access: all resources are available and accessible over the network.

3. Resource Pooling: provider resources are organized into pools enabling multi-tenant

service.

4. Rapid Elasticity: amount of resources can be dynamically increased or contracted.

5. Measured Service: ability to monitor, control and report resource usage.

In addition there are also several different cloud service models. The NIST definition

above covers all these models, and they are described briefly here:

• Software as a Service (SaaS): Allocatable resources are software programs, usually

provided over the Internet via web browsers.

• Platform as a Service (Paas): Resources are operating systems, development tools

and frameworks for creating software and services.

• Infrastructure as a Service (Iaas): Resources are virtualized datacenter components

such as Virtual Machines (VMs), virtual storage, networking, bandwidth.


This thesis will focus primarily on IaaS type cloud computing, where allocatable

resources are virtualized datacenter components. A good example of this type of cloud

computing would be Amazon Web Services [17], or the SAVI testbed, which will be

discussed shortly.

The cloud computing paradigm has become immensely popular in the IT services and

related industries because it frees organizations from the physical aspects of computing

and IT infrastructures. There is no major capital investment required for physical servers

and networking equipment, nor maintenance costs on said equipment. These burdens

are shifted to the Cloud Provider, and the IT organization simply pays a set rate for

the cloud-based resources that it uses. The fact that the organization pays only for

what it uses, combined with the lack of capital investment represents a significant cost

reduction to the organization. The cloud generally guarantees a Service Level Agreement

(SLA), leaving the organization assured that its IT infrastructure will experience little

to zero downtime due to hardware problems. Cloud computing also allows the end

user to scale their systems up or down seamlessly, avoiding the need to over-provision

computing capabilities or bandwidth usually needed to mitigate the effects of bursty

traffic, again saving on operating costs. From a technical and cost perspective, cloud

computing is generally extremely attractive to organizations with both large and small

IT needs. Certain other factors may influence the attractiveness of cloud computing,

usually legal issues arising from the geographic location of the cloud providers datacenter,

or privacy concerns due to the fact that a user’s data is effectively in the hands of a third

party, however these points are outside the scope of this thesis.


2.3 The Smart Applications on Virtual Infrastruc-

ture Network

SAVI [8] is a Canada-wide research network aimed at exploring next-generation appli-

cation marketplaces that make use of fully virtualized infrastructure, as well as future

Internet alternatives. A central vision of the SAVI network is the notion of a Smart

Edge node – a smaller-scale datacenter situated close to the network edge, providing

specialized low-latency processing for future application platforms. SAVI joins a number

of other networking research testbeds such as GENI [18], Emulab [19], PlanetLab [20]

and Internet2 [21], many of which are also federated.

C & M

U of T Edge U of T CoreC & M

ORION CANARIE CANARIE

SAVI Testbed Network

C & M

McGill Edge

C & M

Carlton Edge

C & M

Victoria Edge

C & M

Calgary Edge

C & M

Waterloo Edge

C & M

YorkU Edge

Virtual Network

Virtual Network

Application XResources

Application YResources

C & M

Figure 2.1: Diagram of the SAVI testbed.

The SAVI testbed is one of the SAVI network research themes. The goal of the SAVI

testbed is to realize a future application platform that will provide a testing ground for

other SAVI research themes. The testbed consists of several Core nodes and many Smart

Edge nodes, deployed at various Universities and institutions across Canada. These

Core and Edge nodes are interconnected by a fully virtualized Software Defined Network

(SDN), and the whole system is orchestrated by a Control and Management (C & M)

system. Users can allocate virtualized resources via the C & M system across all nodes in

the testbed, as well as private virtual networks that provide complete isolation from other


users experiments and systems. The ORION [22] and CANARIE [23] networks connect

all components over a large geographic area of Canada. Figure 2.1 shows a diagram of

the testbed architecture.

2.3.1 The Smart Edge Node

SAVI Smart Edge nodes are small-scale datacenters situated close to the edge of the

network and are the primary connection point for application users. SAVI Smart Edge

nodes are unique in that they make use of heterogeneous resources in addition to virtual

machines to accelerate processing – GPUs and reconfigurable hardware, as well as regular

bare-metal servers. These resources put a large amount of processing power close to the

edge of the network, and allow applications to do a majority of intensive processing

before having to traverse the possibly high-latency network to the Core datacenter. Such

intensive processing may include things like advanced signal processing for wirelessly

connected devices, encryption and decryption, multimedia streaming acceleration, new

types of switching and routing, and other packet-oriented processing.

Figure 2.2 shows a diagram of a Smart Edge node. In the SAVI testbed, the Smart

Edge is an OpenStack cloud system [24]. OpenStack management forms the Smart

Edge C & M plane through a number of subsystems, all of which are reachable through

RESTful [25] APIs.

• Nova [26] – manages all compute resources through a Driver-Agent abstraction.

• Keystone [27] – performs authentication and identity management.

• Glance [28] – manages Virtual Machine images and other images.

• Quantum [29] – performs network management functions. Due to naming trade-

marks, will become known as “Neutron”.

• Swift [30] – object storage system.


• Cinder [31] – block storage service.

The SAVI smart edge also has a custom Software-Defined Infrastructure (SDI) man-

ager, called Janus. Janus offloads certain tasks from OpenStack, such as network control

and resource scheduling, and also performs configuration management and orchestration

of the testbed’s OpenFlow-based Software-Defined Network (SDN). Network control is

accomplished through an OpenFlow Controller implemented using Ryu [32]. Janus also

virtualizes the network into slices using FlowVisor [33] (an OpenFlow-based network vir-

tualization layer), and users can run their own User OpenFlow Controller to manage

their own private network slice. Essentially, the SDI Manager brings together Cloud

Computing and Software Defined Networking together under one management system.

More information on SAVI testbed infrastructure management and Janus is provided

in [34].

Smart Edge

Application and Service Provider Keystone Glance-reg

Nova

Driver-1 Driver-2 Driver-N

Agent AgentAgent

Agent Agent Agent

Sliceable OpenFlow Network

Heterogeneous Resources

SDI ManagerJanus

ConfigurationManager

OpenFlow Controller (ryu)

FlowVisor

Cinder

Glance-API

Swift

QuantumOpenStack

Use

r Op

en

Flow

Contro

ller

Figure 2.2: The SAVI testbed Smart Edge node.

Of particular interest to this thesis in Figure 2.2 is the Nova component of OpenStack,

which is the part that allocates resources. The standard Nova only supports processor


virtualization, where Virtual Machines (VMs) are booted on top of hypervisors that

abstract away the physical hardware. The vision of the Smart Edge however, incorporates

Heterogeneous Resources in addition to VMs. Thus Nova in the SAVI testbed is extended

to enable it to manage these new resources.

2.3.2 Heterogeneous Resources in the SAVI Testbed

For OpenStack to manage different types of resources, they must all appear homoge-

neous in nature. To accomplish this, the SAVI testbed OpenStack uses a Driver-Agent

system. A driver for any resource implements required OpenStack management API

methods, such as boot, reboot, start, stop and release. The driver then communicates

these OpenStack management commands to an Agent, which carries them out directly

on the resource, via a hypervisor or otherwise. In this fashion, OpenStack can manage

all resources through the same interface. Figure 2.3 shows a diagram of the Driver-Agent

system. Essentially, the Agent is performing resource-specific management, while the

driver facilitates resource-agnostic management for OpenStack. The method of commu-

nication between the Agent and the Resource, and the Driver and the Agent, is entirely

resource-dependent.

Common API

OpenStackNova

Driver

Agent N Resources

Agent 1 Resources

Communication

Resource-Specific ManagementResource-Agnostic

Management

...

Figure 2.3: The Driver-Agent abstraction used in the SAVI testbed OpenStack system.

If a user desires to allocate a resource, they need to be able to specify what resource

type they want - the SAVI testbed extends the OpenStack notion of resource flavor to en-

able this. Usually, resource flavor refers to the number of virtual processors and amount

of RAM to allocate to a VM. The SAVI testbed extends the definition of flavor to also


include resource type. The SAVI testbed currently has several of these additional resource

types including GPUs, bare-metal servers, and reconfigurable hardware. Although recon-

figurable hardware is included in SAVI (and its precursor VANI [35]) it is still relatively

non-virtualized – simply FPGA cards in bare-metal servers managed by OpenStack.

To be made aware of their existence, OpenStack must have resource references placed

in its database – one for each allocatable resource. This is done using the nova-manage

tool. The resource database entry includes the address of the Agent that provides the

resource, a type name that can be associated with a flavor, and how many physical

network interfaces the resource has. A flavor is created for each unique resource type.

2.4 FPGA Virtualization

This thesis focuses on exploring methods of virtualizing FPGAs and managing the device

or portions thereof using OpenStack in the SAVI testbed. A number of prior works have

examined “virtualization” of FPGAs in different contexts, described in the following

subsection.

2.4.1 Related Work

Hardware virtualization, especially that pertaining to FPGAs, has been explored for

some time. Initially realized through time multiplexing hardware [36], most hardware

virtualization schemes now generally use run-time reconfiguration of the FPGA. This

can be full reconfiguration of the entire device, but usually refers to dynamic Partial

Reconfiguration (PR) of a portion of the FPGA.

In terms of network applications, which is a theme in this thesis, there has been

work examining the use of partial reconfiguration to virtualize forwarding data planes in

routers [37], although this is a very specific case and does not involve user-designed custom

hardware. Recent works involving virtualized FPGAs for custom user hardware virtually


increase the number of available FPGA partially reconfigurable resources [38] or virtualize

non-PR coprocessors [39], to maintain parity with the number of microprocessors in a high

performance computing environment. Others use the partial reconfiguration technique

to make reconfigurable hardware sharable by multiple software processes. This generally

involves some sort of virtualization on the level of the operating system in addition to

the FPGA or gate-level virtualization done using PR. Some works investigate operating

system and scheduler design specifically to manage reconfigurable hardware tasks [40, 41].

On a lower level, Huang et al. use a hot plugin technique to provide access to PR based

accelerators via a unified Linux kernel module, allowing multiple processes to efficiently

share different accelerators [42]. Pham et al. propose a microkernel hypervisor for new

FPGA/CPU hybrid devices, which facilitates access to either a CGRA-like Intermediate

Fabric or a regular PR region running user accelerators [43].

What all these schemes and others like them have in common is that they view

the reconfigurable accelerators as rather short-lived entities executing hardware “tasks”,

which supplement software tasks running on a conventional processor. Thus they focus

heavily on reconfiguration times and concurrent access to the same PR region for multiple

processes, as well as high bandwidth between the FPGA and CPU.

The context of FPGA virtualization in this thesis is markedly different. This thesis

does not assume the virtualized accelerators to be closely coupled with CPUs or software

processes, rather, the accelerators are seen as being a major or supplemental component

of massive, distributed, cloud-based infrastructures. Most virtualization techniques like

the ones mentioned above cannot readily be applied to IaaS clouds and VMs because the

end user does not have any sort of access to the underlying physical hardware. It may be

possible for a cloud provider to make so called hardware tasks available to VMs and thus

end users, but the users would likely be unable to define their own hardware because of

the low-level access it would still require, which somewhat defeats the purpose.

Additionally, the hardware task model may not suit all IaaS users – this thesis also


envisions streaming, packet processing, and network centric applications as well – all

things that the user of a virtualized datacenter may need. Because of this, the work

presented here focuses on providing virtualized, in-network hardware resources that are

analogous in the “resource” sense to VMs.

Chapter 3

Architecture for FPGA

Virtualization

The general approach for virtualization is modelled after that for virtual machines. To

do this, it is important to understand how OpenStack manages resources. This chapter

will briefly examine how OpenStack operates and manages heterogeneous resources in

the SAVI Testbed. Then, the architecture of the system enabling FPGA virtualization

is presented.

3.1 OpenStack Resource Management

Figure 3.1 shows a simplified diagram of resource management in OpenStack. The Open-

Stack Controller runs on a Commodity Server inside the SAVI Testbed, and provides an

API (specifically Nova) to allow a user to request resources. In general, VM resources in

the cloud system are booted on top of hypervisors on physically separate machines from

the Controller - OpenStack maintains a database of all resources in the system, both in

use and free for allocation.

When a resource request comes in, OpenStack finds available resources in the Re-

source Database, and finds which physical machine they are located on. The Controller

14

Chapter 3. Architecture for FPGA Virtualization 15

Commodity Server

SAVI TBOpenStack Controller

Resource Database

Resource Server

Agent Hypervisor

VMRsrce

...

API (Nova)

User

VMRsrce

VMRsrce

Physical ServerResource

GPU ServerResource

*-Drivers

SAVI TB Heterogeneous Resources

(Commodity Server)

Figure 3.1: A simplified view of resource management in OpenStack/SAVI Testbed.

communicates with the Resource Server via a separate process running beside the Hyper-

visor, called an Agent. The Agent is a piece of software that interprets commands from

the main OpenStack Controller. As described briefly in Chapter 2, the Agent is part of

a Driver-Agent abstraction – a Driver integrates with the OpenStack compute controller

(Nova), implementing resource control API functions. Through the Driver, OpenStack

requests the resources from the Agent, which in turn instructs the Hypervisor to boot a

VM with the operating system image and parameters specified by the User (sent to the

Agent by the Controller). Networking information is also sent by the Controller, which

is used by the Agent and Hypervisor to set up network access for the VM. A reference,

usually in the form of an IP address, is returned to the User such that they can connect

to their resource, and run whatever application or system they want on top of it. For

more details on OpenStack, the reader is referred to [24].


Figure 3.1 also shows that the SAVI Testbed OpenStack Controller manages het-

erogeneous resources through similar mechanisms. A custom driver communicates with

Agents managing bare-metal servers, with some being regular bare-metal servers, and

others containing GPUs or reconfigurable hardware.

The current state of reconfigurable hardware in the SAVI Testbed is relatively non-

virtualized – bare-metal servers with PCIe FPGA cards and several BEECube systems.

They are non-virtualized because the resource provided is a fully physical resource, not

sharable between Users and thus not very scalable or flexible. The current reconfigurable

resources in the SAVI Testbed are described briefly below.

3.1.1 SAVI Testbed FPGA Resources

This subsection briefly describes the current, non-virtualized FPGA resources in the

SAVI Testbed.

BEE2 Boards

The SAVI Testbed has a number of BEE2 systems [44]. The BEE2 is equipped with five

Xilinx FPGAs, with one used to control the others. In the Testbed, an Agent runs on an

embedded system on the control FPGA, and manages the other FPGAs as resources that

can be allocated. Each FPGA resource has four 10G-capable CX4 interfaces that connect

to the Testbed SDN, allowing the user to send and receive data from their hardware on

the FPGA.

Since the user simply gets the entire device as a resource, they are responsible for

designing and compiling their hardware using vendor tools, ensuring that their hardware

ports match the correct pin locations on the BEE2, and ensuring that the hardware will

function correctly. Once they generate a bitstream file for programming the FPGA, it is

uploaded to OpenStack as an image, and would be loaded on the FPGA by the Agent.

Note that again the definition of a concept in OpenStack is being extended. Normally,


an “image” refers only to an Operating System (OS) image, however OpenStack allows

any file type to be uploaded as an image. Therefore, for a BEE2 FPGA resource, the im-

age will be a bitstream generated by the FPGA tools. For the BEE2 resource, the Agent

will receive this image from the OpenStack Controller via the driver, and simply config-

ures it onto an unused FPGA. OpenStack sees the FPGA as any other resource thanks

to the Driver-Agent abstraction, and the user can now make use of custom hardware

acceleration in the SAVI Testbed.

PCIe-Based FPGA Cards

To increase the range of different FPGA applications available to researchers, it is useful

to have FPGAs closely coupled to processors so that the reconfigurable hardware can

accelerate compute-intensive portions of software. The SAVI Testbed provides several

PCI-Express-based FPGA boards connected to physical servers: The NetFPGA, the

NetFPGA10G [45] and the DE5Net [46]. The boards have varying FPGA device sizes

and on-board memory, but have in common four network interfaces that are connected to

the Testbed SDN. The NetFPGA has four 1G Ethernet ports, while the NetFPGA10G

and DE5Net have four 10G Ethernet ports. A researcher can now design custom hardware

that can accelerate software tasks, provide line-rate packet processing, or a combination

of both.

In addition to these boards, the Testbed also contains MiniBEE [47] resources. The

MiniBEE contains a conventional processor and an on-board FPGA connected through

PCIe. It also has 10G network interfaces, a large amount of memory and an expansion

port for additional FPGA peripherals.

Since the PCIe boards are required to be mounted inside physical servers, the SAVI

Testbed provides the server itself with the FPGA card attached as a resource. In the

case of the MiniBEE, the entire system is also offered as a resource.


3.1.2 Requirements for FPGA Virtualization in OpenStack

Using Virtual Machines as a model for full FPGA virtualization, it is clear there are

several required components: An agent to provide the FPGA (or pieces thereof) as a

resource, and a driver to integrate into OpenStack so that OpenStack can communicate

with the Agent. The Agent is responsible for managing the actual resource provided to the

user, in this case an FPGA or portion thereof, and therefore must be capable of receiving

an FPGA programming file (hereafter referred to as a “bitstream”) and configuring or

reconfiguring the device. It must also track which FPGAs or FPGA portions have user

hardware running in them, and which hardware belongs to which user.

Additionally, if full virtualization is to be achieved, the physical FPGA device must

be abstracted and sharable between different users. This will require a base hardware

architecture to “virtualize” the device, somewhat similar to a hypervisor. The Agent

must be aware of this virtualization layer to manage the resources as well.

The following sections will describe the design of a system that meets the afore-

mentioned requirements, using OpenStack in the SAVI Testbed as the cloud computing

platform. The base hardware architecture virtualizing the device will be described, and

then the Agent that provides the virtualized hardware to OpenStack.

3.2 Hardware Architecture

Though the current FPGA resources in the SAVI Testbed are managed by OpenStack,

they still leave much to be desired in terms of commercial systems and user-friendliness.

The resources are still relatively non-virtualized – a single physical device is allocated to

one user, whereas in a fully virtualized system, one physical device should be sharable

among different users simultaneously. There is more motivation for this when considering

that one user may not make fully use of an entire FPGA, wasting some reconfigurable

fabric that could go to another user. A full FPGA can be more difficult to design


and program, especially when integrating complex IP (such as memory controllers), and

without physical access to the device.

The architecture presented in this thesis seeks to resolve these problems through full

hardware virtualization.

3.2.1 Fully Virtualized Hardware

Virtualization in a cloud computing context has several characteristics that are usually

presented in terms of Virtual Machines:

• Physical Abstraction – The physical device itself is abstracted and the user is not

aware of the underlying hardware. For example, a VM may be running via a

hypervisor running on an Intel Xeon processor, however the user is only aware of

how many Virtual CPUs they have. They are unaware of the real hardware.

• Sharing – A single physical device provides one or more virtual instances to one or

more users. Such devices can also be referred to as Multi-tenant devices.

• Illusion of Infinity – The actual number and physical location of resources is also

abstracted, and from the user’s view there exists a seemingly infinite pool of re-

sources.

The objective of the work presented in this thesis is to enable fully virtualized FPGA

hardware by designing an architecture and system that has the above characteristics (or

characteristics that are analogous). The physical device should be abstracted – a user

should be able to specify a hardware design, in HDL for example, and rely on the system

to run their design in the cloud. They should not have to worry about what specific device

their hardware must run on, nor about compiling for different devices. They should also

not be aware of other physical aspects of the system – such as the physical location of

resources or the number of available resources (i.e. the illusion of infinite resources in the

cloud should be maintained.)


FPGA size and density has grown to the point where it is feasible for the device to

be virtualized and shared between users, with enough fabric leftover for each user to run

non-trivial hardware designs. This also has the benefit of improving the usage efficiency

and cost-effectiveness of a device, since many useful hardware designs do not need as

much logic as the entire device provides. Full hardware virtualization would allow the

cloud provider to put the unused fabric to work for other users.

Another issue fully virtualized hardware would solve is that of security. Giving a user

full control over an entire FPGA directly connected to the network may be risky for a

cloud provider. It would allow a nefarious user the ability to inject malicious data directly

into the provider datacenter, at extremely high rates (10Gb/s or more). A hypervisor, in

the case of a regular VM, acts as a buffer between the user’s guest OS and the provider

hardware, allowing the provider to set up security and police network traffic before it

gets onto the internal network. The hardware virtualization layer is therefore designed

to allow the provider to police the data going in and coming out of the user hardware.

The general approach for virtualization of the hardware is based on Partial Reconfig-

uration (PR) of the FPGA. This technique of reconfiguring specific portions of an FPGA

while the rest remains running can be used to effectively split the device into several

regions that can be offered individually as resources to cloud users. To familiarize the

reader, the basics of Partial Reconfiguration will be reviewed in the following section.

3.2.2 FPGA Partial Reconfiguration

Partial Reconfiguration is a capability of some FPGAs where portions of the device can

be reconfigured independently, without affecting other circuits running on the device.

Physical portions of the device must be specified to be a Partially Reconfigurable Region

(PRR), and specific hardware modules of the overall design must be mapped to one of

these PRRs. Multiple modules can be compiled for one PRR, however only one can

be configured at run time. These modules are called Partially Reconfigurable Modules


(PRMs). Generally, the logic surrounding the PRRs is fixed, and is referred to as the

static logic. Major FPGA vendors support partial reconfiguration [48, 49, 50].

Figure 3.2 depicts a partially reconfigurable FPGA system. There is one PRR, and

three PRMs (PRM A, B and C). Each PRM contains different hardware implementing

different functionality, and each PRM can be dynamically configured into the PRR at

run-time while the Static Logic remains running.

PR introduces complexities into the hardware compilation process. The interface from

the static logic to a PRR, called the PR Boundary, must be dealt with carefully by the

CAD compile process. The static logic must be compiled once, since it does not change,

along with one of the PRMs. After placement and routing, the physical wires crossing

the PR boundary are set permanently since they connect to the static logic, shown in

Figure 3.2 as Static Connection Points. Further PRMs compiled with the static logic

must have their connections routed to the same physical locations, so that when they

are partially reconfigured, their connections actually connect to the running static logic.

This is also shown in Figure 3.2, where any PR Boundary crossing signal is routed to the

same location in each PRM. This is usually accomplished by locking the placement of

the logic cells whose wires cross the boundary (called anchor LUTs) after compiling the

static logic. Obviously, every PRM for a given PRR must have the same logical top-level

ports, whether or not they use them all.

Other considerations must be made by designers using PR. During reconfiguration,

outputs of a PRR may be in flux and have unknown values. The static logic should

have a method of freezing these outputs or ignoring them while reconfiguration takes

place. Timing constraints can also be harder to meet in PR systems, since the CAD

tools are unable to perform any logic optimization across the PR boundary. Xilinx Inc.

suggests registering signals both before and after the PR boundary to improve timing

performance [48]. Lastly, the designer should ensure that a freshly reconfigured PRM is

fully reset to a known state.


PRM C

PRM A

PRM B

Static Logic

FPGA

PRR

PR BoundaryStatic Connection Point

Figure 3.2: FPGA Partial Reconfiguration.

3.2.3 Virtualization via PR

In a cloud computing context, the static logic surrounding the Partially Reconfigurable

Regions implements hypervisor functions – providing a buffer under control of the cloud

provider in between the network and the user-defined hardware in the PRRs. Just as

in a VM hypervisor, this will allow the cloud provider to implement some measure of

security, and possibly other required management functions. The static logic also has

several other functions that will become apparent as the full design is described below.

3.2.4 Static Logic Design

As mentioned previously, partial reconfiguration is used to split a single FPGA into

several reconfigurable regions, each of which are managed as a single Virtualized FPGA

Resource (VFR). In effect, this virtualizes the FPGA and makes it a multi-tenant device,

although still requiring the external control of the Agent. A user can now allocate one

of these VFRs and have their own custom designed hardware placed within it. The

static logic surrounding the VFRs is still under control of the cloud provider, and must


accomplish several functions. The method of data transfer between a user and their VFR

is over the network, and therefore the static logic must facilitate forwarding of packets to

the correct VFR. To do this, the static logic system must track the network information

(i.e. MAC addresses) of each VFR as provided by the OpenStack Controller. The static

logic is also designed in a way that enables it, and thus the cloud provider, to police

the interfaces to the VFRs, to maintain some basic network security, such as prevention

of sniffing traffic and spoofing addresses. The static logic contains interface hardware

for for 10G Ethernet ports, memory controllers, all chip level I/O, and a method of

communicating with the Agent. The following paragraphs and subsections describe the

design choices made for this thesis to accomplish these functions.

Figure 3.3 shows a block diagram of the on-FPGA portion of the system. A Soft

Processor (that is, a microprocessor implemented inside the FPGA fabric) communicates

with the Agent that runs on the Host machine. The Soft Processor is attached to a Bus

that allows it to communicate with and control the different components of the system. A

bus is a two-way communication system with two actors: Masters and Slaves. Masters can

initiate read and write transactions by addressing the Slave they wish to communicate

with, while Slaves can only respond to reads or writes. The Soft Processor is a Bus

Master. The DRAM Controller is a Bus Slave that facilitates access to Off-Chip DRAM.

The MAC Memory-mapped (Memmap) Registers are also Slaves that allow the Soft

Processor to control the Input Arbiter and Output Queues (described in the following

subsection). The VFRs are wrapped in Bus Masters, which allows them to access the

DRAM Controller slave. Packet streams from 10G interfaces pass through the Input

Arbiter and the VFR Bus Master wrappers and enter the VFRs. Output streams exit

the VFR and connect to the Output Queues and subsequently the egress interfaces. The

Agent can reprogram the entire chip or individual VFRs through an external Programmer

that operates over JTAG. The major subcomponents shown in Figure 3.3 are described

in the following subsections.


Virtex 5 VTX240TSoft

Processor

Input Arbiter

OutputQueues

MACMemmapRegisters

VFR

VFR

VFR

Agent(Host)

Programmer(JTAG)

Off-ChipDRAM

128 MB

DRAMController

1

2

N

Stream In256b Wide

From 10GE(160 MHz)

Stream Out

To 10GE

32-bit Bus

VFR Wrappers

Figure 3.3: System view of the on-FPGA portion of the virtualization hardware.

Data Transfer and I/O

A Streaming Interface facilitates packet transfer into the system from 10G Ethernet ports.

Streaming Interfaces are one-way communication channels. They consist at a minimum

of four signals:

1. A variable bit width Data signal.

2. A single bit Valid signal.

3. A single bit Ready signal.

4. A Clock signal.

The actual data of the packet is transferred in chunks through the Data field, where

each chunk is the bit width of the Data field. These chunks are referred to as flits. At

the positive edge of the Clock signal (that is, a logic low to logic high transition), one

flit is transferred if and only if both the Valid and the Ready signals are logic high. The

Valid signal is asserted by the sender along with the flit in the Data field, and the Ready


signal is asserted by the receiver to indicate it is ready to receive data. The Valid and

Ready signals are also referred to as handshake signals. One flit can be transferred over

the Data signal every clock cycle.

The Input Arbiter block, shown in Figure 3.4, is responsible for directing incoming

packets to the correct VFR. The Input Arbiter contains a Content Addressable Memory

(CAM) that the arbiter uses to match a packet’s Destination MAC address to a specific

VFR. An incoming packet is stalled for one clock cycle while the CAM looks up the

destination VFR. In the case that there is no matching VFR in the CAM, the packet

is simply dropped. This has the benefit of preventing VFRs from inadvertently (or

intentionally) sniffing Ethernet traffic that does not belong to them, but the drawback of

being unable to receive broadcast packets. This could be addressed by designing a more

complex switching fabric within the Input Arbiter.

The CAM must be programmed with the VFR MAC addresses (provided by Open-

Stack) before any packets can be received. MAC addresses for new VFRs are programmed

into the CAM by a soft processor via several memory mapped registers. The software

running on the soft processor receives the MAC address and corresponding VFR region

code over a UART link from the Agent running on the host. The software running on the

processor and the communication protocol with the Agent will described in more detail

in Section 3.3.

The output queues operate similarly, except this block simply tracks each VFR’s MAC

in a register, and prevents spoofing by forcing an outgoing packet’s source MAC address

to be the VFR’s MAC address. The output queue MAC addresses are also updated via

the same memory mapped registers as the input arbiter.

Virtualized Accelerator Wrappers

The VFRs are wrapped inside Bus Masters, as can be seen in Figure 3.3. These are

labelled 1 to N. Figure 3.5 shows a closer view of this wrapper design, which must


48 Bit CAM

MACMemmapRegisters

VFR Region Code Out

write MAC Addr VFR Region Code

Stream In

Bus

To VFRsInput Arbiter

Figure 3.4: Virtualization hardware input arbiter block.

accomplish two things – facilitate safe partial reconfiguration of a VFR, and provide

the VFR access to off-chip DRAM. Safe reconfiguration is of paramount importance,

since the cloud provider should seek to guarantee that user hardware will be configured

correctly. The system must ensure that no information or packet is being transferred

across the PR boundary during reconfiguration. This can result in lost data, or a newly

reconfigured VFR receiving data starting in the middle of a packet, which could cause

errors in the user hardware.

The VFR reconfiguration process is handled as follows: When a request comes to

the Agent to configure a new VFR, the Agent sends a command to the soft proces-

sor to freeze the interfaces of the selected region. The soft processor will de-assert the

PR INTERFACE ENABLE signal by writing to the register, which causes the wrapper

hardware to set all streaming interface handshake signals low after any current transfer

finishes. This is accomplished by using a register for each Stream Interface that records

whether or not the stream is in the middle of a transaction. When both transaction reg-

isters are logic 0, the hardware sends an ACK to the processor through another memory-


VFR(User IP)

PR_INTERFACE_ENABLE

ACK

StreamIn

BusMemmap Register

Memory Operation Queues

PR_RESET

StreamOut

ACK

Figure 3.5: The VFR wrapper design.

mapped register that the processor is polling, and the processor asserts the PR RESET

signal and notifies the Agent that it is now safe to reconfigure that region. The VFR

is held in reset, and after the new user hardware is configured via an external JTAG

connection and the MAC address programmed into the input arbiter and output queue,

the Agent instructs the soft processor to release PR RESET and enable the interfaces

again. This method also ensures that new user hardware is fully reset before interfaces

are enables.

The wrapper also facilitates access to low-latency off-chip DRAM for a VFR. The VFR

wrapper has read and write ports so that user hardware can insert memory operations

into a queue that acts as a bus master. The queue would be implemented as a FIFO

so that all writes added to the queue are done in order, and read data is also returned

in order. The queue, which is part of the wrapper hardware under provider control,

partitions the memory address space so that each VFR in the physical FPGA only sees a

subset of the total DRAM – effectively giving each VFR a private off-chip memory, while

making it impossible for any VFR to access another’s memory. This is done by dividing


the address space and offsetting addresses by an appropriate amount before entry into

the operation queue. For example, if the memory is 64MB then the memory address

space might be 0x03FFFFFF <-> 0x00000000 or a total number of address locations

of 0x04000000. If there are four VFRs then this space is divided into four, and each

VFR would have log2(0x01000000) = 26 address bits, or a range of 0x00FFFFFF <->

0x00000000 (16MB each). When a VFR makes a read or write operation, the 26 bits

are zero-extended to 32 bits (the system bus width) and offset an appropriate amount

by adding a multiple of 0x01000000. The offset is done outside the VFR and inside the

provider-controlled static logic, making the user hardware entirely unaware that memory

beyond its partition exists. The queue is reset and cleared upon VFR reconfiguration via

the PR RESET signal.

In the prototype system using the NetFPGA10G, the RLDRAM modules have a

minimum read latency of three clock cycles, and a minimum write latency of four clock

cycles, plus several cycles for the controller running on the FPGA. The module can burst

read and write up to a length of eight. Maximum aggregate bandwidth is 38.4 Gb/s.

This memory system is admittedly rudimentary compared to other possibilities such

as a multi-port memory controller, but it was chosen to keep the static logic simple and

low-area.

3.3 Agent Design

This section describes the Agent – the piece of software that communicates with Open-

Stack, performing resource specific management. One Agent manages all the VFRs on

one FPGA, although the design could easily be extended to manage VFRs on multiple

FPGAs. This section will describe Agent requirements in general, but also focus on the

Agent implemented in the SAVI Testbed prototype, which uses the NetFPGA10G card.

Generally, the Agent must implement the resource-specific management commands


from OpenStack (issued via the Driver). At the very least, these must include boot

(instantiate a new resource with the specified parameters) and delete (tear down a running

resource).

The Agent and embedded software use a simple command-acknowledge protocol to

communicate: the Agent will send a command string, and the embedded software will

respond with an acknowledge, and an additional acknowledge for each data parameter

sent as part of the command. If no acknowledge is received, the command is aborted

and re-attempted. In the prototype hardware, the embedded software is run on a Xilinx

Microblaze soft processor. The Agent and OpenStack Driver communicate over the

Testbed network using a text-based protocol over TCP .

The following subsections describe how the Agent boots and deletes VFRs.

3.3.1 Booting

When a boot command is issued in OpenStack, several pieces of information are received

by the Agent:

• UUID – a universal unique identifier for the resource.

• Network information – an IP and MAC address for the resource.

• Image – usually an OS image for VMs, but repurposed for VFRs.

The Agent uses the UUID to track VFRs – UUIDs in OpenStack are an absolute

reference for any object or resource. The image is the most important piece of data

received. It contains PR bitstreams corresponding to the user-designed hardware. How

this is compiled and created will be explained in Section 3.5. The image contains one

PR bitstream corresponding to each physical PRR on the FPGA, numbered so that the

Agent can tell which one is for which region. This is explained in detail in Section 4.2.1.

The Agent chooses the first available unconfigured PRR for the incoming VFR to be

booted, and then begins the reconfiguration process – note that this requires the Agent


to maintain the current state of the system, remembering what regions are currently

configured and what users they belong to. A simple data structure could be used to

store the state of each physical VFR and, if configured, the associated UUID.

First, a “disable interfaces” command is sent to the embedded software. This com-

mand has a single parameter, the region code corresponding to the PRR about to recon-

figured. The soft processor freezes the packet stream interfaces as described in Section

3.2.3, and places the PRR (VFR) into reset. The Agent then reconfigures the PRR using

an external reconfiguration tool, in the case of the prototype system, Xilinx iMPACT

over JTAG. Then, the MAC address is programmed into to the static logic’s input ar-

biter and output queues, which allow packets in and out of the newly configured VFR.

The “load MAC address” command is sent to the embedded software followed by the six

byte MAC address received from the OpenStack Controller. Once successful, the “enable

interfaces” command is issued, and the embedded software releases the VFR from reset

and enables the packet stream interfaces.

This method ensures that no packet is in the middle of transfer during a reconfigura-

tion, avoiding the situation where a piece of user hardware might receive a fragmented

packet. It also ensures that user hardware is fully reset before running.

3.3.2 Deleting

Deleting a VFR is similar to booting in terms of what the Agent must do, however the

only piece of data sent is a UUID. The Agent finds the VFR with the corresponding

UUID, and proceeds to reconfigure the PRR with a black box bitstream. This effectively

removes any user hardware from the system. Again this requires that the Agent store

the state of each VFR and associated UUIDs.


3.4 Booting VFRs in OpenStack

Several modifications were needed in the OpenStack Controller (Nova) to get a complete

working system. Major modifications were already made by the SAVI Testbed team to

enable management of bare-metal servers – specifically modifications that allowed the

integration of custom Drivers and therefore custom Agents.

Recall that in Chapter 2 the notion of resource flavor was discussed. Normally, flavor

refers to the number of virtual processors and amount of memory for a VM, but it can

be repurposed to refer to more of a resource type. This is important because the Nova

scheduler uses the flavor submitted by the user to select which Driver to use to contact

the Agent. Each resource in the database references a flavor, so many single resources

can fall under one flavor.

Figure 3.6 shows a diagram of the boot sequence for a VFR, which proceeds as follows:

Upon receiving the Boot command from the User, the OpenStack Controller uses the

specified flavor to choose a resource in the database. The database entry references

the Agent associated with it through an IP and port number. Nova calls the Driver

implementation of the “boot” function, and passes generated networking information,

the Agent IP and port, UUID and any other required information. The Driver then

communicates with the Agent (FPGA host), instructing it to boot a new resource, and

in the case of VFRs, sending over networking information and the image containing

partial bitstreams. The Agent selects the partial bitstream in the image matching a free

region, and programs it along with the network information as in Section 3.3.1. Success

is indicated to the Controller, which then passes a reference to the user.

3.5 Compiling Custom Hardware

Through the modifications and systems described in this chapter, custom hardware ac-

celeration is now available to cloud computing users. However, there is still the question


OpenStackController

Boot (User)

Agent(FPGA host)

disable VFR interfaces,configure region

programVFR MAC

Agent selectsPR bitstreammatching freeregion.bit

.bit

Controller passes image, network info

enable VFRinterfaces

.bit

Figure 3.6: VFR boot sequence.

of how to compile hardware for the system. The VFRs have specific interfaces that any

user hardware must match exactly – not only logically, but also physically, due to the

nature of PR discussed previously. The static logic remains constantly running while

user hardware is partially reconfigured as resources are booted and deleted. Therefore,

any new user hardware must be compiled with the currently running static logic that

is part of the cloud provider’s systems. A compile flow must be developed to allow end

users to compile their custom hardware for use with the virtualization system.

First, the user hardware top-level ports must match those of the static logic PR

boundary logically – a template HDL file is provided to end users inside which they can

define their hardware. The template file contains a module definition whose top-level

ports match the PR boundary in the static logic.

It is assumed at this point that the provider has already synthesized, placed and

routed the static logic. Any user hardware has to be compiled with the placed and

routed static logic, so that the physical wires crossing the PR boundary are placed in the

right locations in the PRM. How this is done in practice depends entirely on the CAD

tool flow provided by the FPGA vendor. For the prototype system implemented for this

thesis, Xilinx PlanAhead is used to perform PR compilation.

The general hardware compile procedure is performed as follows for the prototype


system, but should generalize well to other vendors also. First, the user’s HDL (written

in the provided template) is synthesized to a netlist. This netlist is added to a new

compile “run” in Xilinx PlanAhead – the netlist is assigned to all physical VFRs present

in the system (in the case of the prototype, all four of them). The run is also configured to

use the already placed and routed static logic, and PlanAhead ensures that all boundary

crossing wires are placed in the correct locations in the PRM. The compile run is initiated,

and the tool maps, places and routes the user netlist in all VFRs (which are PRRs). One

design is placed and routed in all VFRs because the Agent requires the flexibility to

place the user’s design in any VFR in the system – PRMs can only be configured inside

PRRs for which they are specifically compiled. Bitstream generation is then completed,

creating one partial bitstream for each physical VFR. These partial bitstreams will be

used by the Agent to configure the user hardware into a running system when a VFR is

booted in OpenStack.

Recall that user hardware is sent to the Agent as an “image”. All partial bitstreams

generated by the compile are added to a compressed archive, and this archive is uploaded

as an image to OpenStack by the user with the image management tool called glance.

When the Agent selects which physical VFR partition to use for the users hardware,

it simply selects the partial bitstream corresponding to that PRR, configures it, and

disregards the others.

Figure 3.7 shows the general compile flow, upload and boot procedure. For the

prototype, such a compile system is realized using a script-based approach. The user

places their netlist in a specific folder, and then executes a compile script that uses

PlanAhead to perform the aforementioned compilation steps and bitstream packaging.

The end result is the zip archive containing the partial bitstreams, ready for upload as

an image. Steps inside the dashed lines of Figure 3.7 are part of the compile script.


user_ip.v static.ncd

XST (ISE 13.4)

user_ip.ngc

PR Compile Flow

Generate Bitstream

Static Logic fromprovider(placed and routed)

User IPPR Bitstreams in zipped folder

Upload via glance

OpenStackImage

.bit.bit

.bit

PR Compile with Xilinx PlanAhead13.4

Figure 3.7: Compile flow.

Chapter 4

SAVI Testbed Prototype

A prototype of the system described in Chapter 3 has been created using the SAVI

testbed at the University of Toronto. The goal of the prototype system is to validate the

system architecture and show its feasibility, as well as to evaluate and attempt to quantify

the benefit of reconfigurable hardware resources in an Infrastructure-as-a-Service cloud

system.

4.1 FPGA Hardware

The prototype is based on the SAVI testbed OpenStack cloud, with its ability to man-

age heterogeneous resources. The FPGA-based portion is implemented using the NetF-

PGA10G [45], available in the SAVI testbed as a non-virtualized resource in the form

of a baremetal server with NetFPGA10G connected to a PCIe slot. The NetFPGA10G

comes equipped with a Xilinx Virtex 5 VTX240T, four 10G Ethernet interfaces, and 128

MB of off-chip reduced latency DRAM. Although of an older generation than currently

available, the Virtex 5 [51] is still large enough to realize a sufficiently non-trivial sys-

tem that demonstrates the required functionality. The static logic architecture uses as

a base the NetFPGA10G open source infrastructure [52]. Certain components of the

infrastructure were modified to realize the architecture described in Chapter 3:

35

Chapter 4. SAVI Testbed Prototype 36

1. Input Arbiter: The open source infrastructure provides an input arbiter that simply

forwards packets from all four interfaces to a single output stream in a round-robin

fashion. This was modified to several output streams, one corresponding to each

physical VFR in the system, and a CAM was inserted in the pipeline to realize cor-

rect forwarding to said VFRs. The CAM is created using Xilinx IP [53]. Additional

top-level ports were also added to the Input Arbiter that connect to the write ports

of the CAM. These top-level ports then connect to several memory-mapped regis-

ters (implemented as GPIO peripherals) connected to the microprocessor system

bus. This allows a Xilinx Microblaze soft processor to program the MAC address

for any specific VFR, by writing data to the GPIO registers. A single bit in one of

the registers forms a Write signal to the CAM, which the software toggles to write

the MAC and VFR region code into the CAM. The VFR datapath is clocked at a

higher frequency than the embedded processor system, which may cause the CAM

to be written several times in succession before the processor can write a ‘0’ to the

write bit. This is generally not a problem however, since the same location will be

written with valid data each time; it is only a minor inefficiency.

2. Output Queues: The NetFPGA10G infrastructure Output Queues were also mod-

ified to accomodate multiple incoming packet streams. Registers that contain the

MAC addresses of each VFR were added, allowing the hardware to force source

MAC address fields. Top-level ports were also added to program these registers via

the same GPIO peripherals as the Input Arbiter.

All packet streams in the design are AXI Stream [54] streaming interfaces, with a

width of 256 bits running at 160 MHz, equating to just over 40 Gb/s peak throughput.

The stream widths are kept overprovisioned at 256 bits for the sake of simplicity. The

prototype system contains four physical VFRs, each a Partially Reconfigurable Region

(PRR). Each VFR region contains 11376 LUTs and 15 36K BRAMs. The number of

four is chosen because it is a non-trivial number, and because it allows maintenance of


a region size that can implement meaningful hardware. It should also be noted that

increasing the number of VFRs in the system also increases the number of required

streaming interfaces, causing a large increase in required routing resources. Normally the

placement and routing algorithms may handle this acceptably, but the problem is vastly

more complicated due to the fact that PRRs are physically fixed, and the CAD tools

are unable to do any optimization across the PR boundary. These constraints make it

significantly more difficult for the CAD tools to meet timing, especially as the number

of PRRs increases. This also contributed to the limiting of the number of VFRs.

Resource utilization for the static logic is shown in Table 4.1. The counts in Table 4.1

do not include the counts for the VFR regions. The design with four VFRs successfully

meets timing with the AXI streams running at 160MHz and the soft processor system

running at 100MHz. The VFRs are connected to the AXI streams, so they also run at

160 MHz and make this clock available to user hardware. User hardware must meet the

160 MHz timing constraint to work properly, as there are no other clocks provided.

Table 4.1: Resource Usage for System Static Hardware

Resource Usage (Used / Device Total)

Flip-flop 29327 / 149760 (19%)

LUT 28711 / 149760 (19%)

36K BRAM 105 / 324 (32%)

4.2 Agent Software

The Agent for the prototype is implemented in Python, in keeping with the rest of the

OpenStack project, and because Python provides high functionality with low coding

effort. Although this can come at the cost of performance, the Agent is not required to

be a high performance component.

The Agent uses a collection of Python objects and standalone functions to accomplish


its required tasks, and is designed in a way to make it FPGA hardware agnostic – the

idea being that the same Agent software can be used for any FPGA device realizing the

VFR system, with minimal modifications, provided the communication protocol to the

soft processor is implemented correctly.

The software contains two global objects: a statusTable object, which holds infor-

mation about the status of the hardware, and a serial object, which provides RS232

send and receive functionality for communication with the soft processor system. The

serial object is implemented using the PySerial library.

4.2.1 statusTable and Associated Objects

The statusTable global object stores the needed state of the entire FPGA hardware

system. This includes:

1. The number of physical VFR regions in the managed FPGA hardware.

2. The FPGA system type. This refers to the FPGA device or part number, and

the specific board it is mounted on. For example, the prototype system type is a

NetFPGA10G.

3. A Python list of region objects. region objects correspond to one unique PRR

(VFR) in the FPGA system.

4. A string containing information about the system type, useful for debugging.

The statusTable object also provides several methods (functions) that allow the

Agent to manage the FPGA hardware. The first is (in pseudocode):

statusTable.program(bitPkg, macAddr, serial, uuid)


The first argument to the function (bitPkg) points to a compressed file containing the im-

age – a collection of bitstreams, one matching each PRR in the system, that are essentially

hardware images supplied by the user, and sent to the Agent by the OpenStack controller.

macAddr is a string containing the MAC address of the VFR about to be booted. serial

is a reference to the serial object, and uuid is a string containing the OpenStack gen-

erated UUID of the resource. Algorithm 1 shows the basic operation of the function.

Algorithm 1: The statusTable.program() function

Data: bitPkg, regionList, uuid, MAC

Result: A unconfigured VFR in regionList is configured with the user hardware

in bitPkg, and the MAC address is set to MAC

for region in regionList do

if region is not configured then

region is free, find bitstream file matching this region

pkg = uncompress(bitPkg)

for file in pkg do

if match(file, ‘‘* %d’’ % region.id) then

region.configure(file, uuid)

region.setMAC(MAC)

return Success

end

end

end

end

a free region was not found, fail

return Failure


The match function call in Algorithm 1 is a regular expression matching the region

object identifier (an integer id ∈ [0..N−1], where N is the number of PRRs) to a portion

of a particular bitstream filename. The filename format of the bitstreams in the package

are [name] [region id].bit. For example, if the users hardware system was named myhard-

ware and was compiled for a system containing four VFRs, the bitstreams generated by

the script-based compile would be named myhardware 0.bit ... myhardware 3.bit.

This is how the Agent knows which bitstreams in the package correspond to which PRRs.

The second method in the statusTable object is the opposite of the first:

statusTable.release(serial, uuid)

This function is used to release a currently running VFR with UUID uuid and con-

figure a black box PRM to remove the user hardware from the system. It also removes

that resources MAC address from the hardware (i.e. from the Input Arbiter CAM and

the Output Queues registers.) The basic operation is shown in Algorithm 2.

Algorithm 2: The statusTable.release() function

Data: regionList, uuid

Result: A configured VFR in regionList is released

for region in regionList do

if region is configured AND region.uuid == uuid then

region matches uuid to be released, remove MACs and configure black box

region.resetMAC()

bb = “blackbox ” + region.id + “.bit”

region.release(bb)

return Success

end

end

the configured region was not found, fail

return Failure


As can be seen in the program and release functions, there are non-global region

objects, lists of which are in the statusTable object, where each region object in the

list corresponds to a PRR on the FPGA. The region object contains data describing a

single VFR:

1. An integer identifier for the particular region.

2. A string containing a configured VFRs OpenStack UUID.

3. A boolean variable indicating whether or not the region is configured.

4. A string containing the MAC address of the region.

The region object also contains several methods for managing a single VFR. These

are:

1. region.resetMAC(serial) : Send a zero MAC address over the serial connection

to the soft processor to program into VFR with region.id.

2. region.setMAC(serial, newMAC) : Send newMAC over the serial connection for the

soft processor to program into the VFR corresponding to region.id.

3. region.configure(serial, bitstream, uuid, sysType) : Configure bitstream

into VFR region. This involves sending a command to disable stream interfaces,

receiving an acknowledge, using Xilinx iMPACT to program the partial bitstream

for the correct FPGA (dependent on sysType), and re-enabling the stream inter-

faces and receiving a final acknowledge. Also sets region.isConfigured to True.

4. region.release(serial, bitstream, sysType) : Disable stream interfaces as

in the configure method, use Xilinx iMPACT to configure bitstream (a blackbox)

into the FPGA dependent on sysType, and set region.isConfigured to False.


Standalone Functions

Several standalone functions are also used to implement the required functionality of

the Agent. These include functions to setup a TCP socket for communication with the

OpenStack controller, reading and writing bytes, null terminated lines and files from

said socket, as well as a function to set up and run a TCP server that the OpenStack

controller can connect to. Also included is a main() function where execution begins

when the Agent is started. This is described in the following section.

4.2.2 Initialization and Operation

The Agent execution starts in a main() function, and first sets up a serial connection

(preferred to PCI Express for simplicity). to the FPGA hardware, specifically the soft

processor, and queries the embedded system for information on the system. The embed-

ded system responds with a string containing the number of VFRs and the system type.

With this information, the Agent creates a new statusTable object whose constructor

sets up the list of region objects and initializes them. If the query to the embedded

system fails, the Agent will retry several times and exit with a failure message if there is

no response.

Upon initializing the statusTable object, the Agent acts as a server, listening on

a predefined TCP port for incoming connections from the OpenStack controller. The

Driver, which integrates with the controller, gets the connection parameters from the

resource flavor information passed to it after a boot command is issued by the user. The

Driver sets up a TCP connection to the Agent when a resource needs to be booted or

torn down, and the Agent will accept the connection and begin reading null-terminated

lines from the Driver. These lines contain string-based commands followed by other

string-based parameters and files. The Agent carries out programming the resource as

described in other sections and a successful return is propagated back to the user. Any

failure along the way is also propagated back to the user.


Figure 4.1 shows a sequence diagram of the entire boot procedure. A User would use

OpenStack Nova to issue a Boot VFR command and the OpenStack Controller would

map the resource Flavor to the correct Driver (the VFR Driver) and relay the Boot

VFR command. The VFR Driver extracts the Agent IP address and port and sets up

a TCP connection. Then, the VFR Driver sends a PROGRAM command, followed by

arguments containing the Resource UUID, the Resource MAC address, and the Image

(PR bitstreams). The Agent calls statusTable.program(), which configures the PR

bitstream from the image into an available VFR and programs the MAC address. Success

is returned to the Driver, which closes the TCP connection, and returns Success to the

OpenStack controller, which in turn returns Success and a reference to the User.

Figure 4.1: A sequence diagram of the entire boot procedure in the SAVI testbed proto-type system.

Chapter 5

Platform Evaluation and

Application Case Studies

The previous chapters have described the architecture for FPGA virtualization and the

prototype system in the SAVI testbed used to realize it. In this chapter, a brief analysis

of the prototype is presented. Comparisons are made in terms of time to boot resources

(comparing VMs and VFRs), and the trade off between virtualized and non-virtualized

hardware is discussed. Furthermore, two application case studies are presented – hard-

ware applications run in the testbed using the virtualization infrastructure.

5.1 Platform Evaluation

It is desirable to see how efficiently OpenStack can handle these new resources, relative

to existing Virtual Machines. At the very least, one would expect that a VFR should be

able to be allocated in the same amount of time or less than a VM. An experiment was

set up to determine how quickly a VFR can be booted compared to a VM.

The experiment attempts to measure the amount of time from when the command

is issued to OpenStack to the point at which the resource becomes usable. Usability will

obviously differ for different resources, and the term is defined for VMs and VFRs as

44

Chapter 5. Platform Evaluation and Application Case Studies 45

Table 5.1: Boot Times for VMs and VFRs

Resource Boot Time (seconds)

Virtual Machine 71

VFR 2.6

follows:

• Virtual Machine: A VM is defined as usable at the point where an SSH connection

can successfully be established.

• Virtualized FPGA Resource: A VFR is defined as usable at the point where the

hardware can successfully process packets as its design intends.

The VFR hardware used for the experiment is a simple Layer 2 packet reflector.

An incoming packet is buffered, and has its source and destination MAC address fields

swapped. The packet is sent back out, effectively returning to its original sender. This

provides a low-latency method of determining whether a VFR is usable or not.

The experiment is run as follows: From a node within the SAVI testbed, the boot

command to allocate a VFR is entered. Time is measured while a small program sends

out a packet destined to the MAC address of the new VFR (extracted from OpenStack)

every tenth of a second. In between the packet sendings, the program listens for the

reflected response, and if found, exits and stops the timer. This is done five times and

an average is taken. For the VM, the boot command is entered and time is measured

while a shell script continually attempts to make an SSH connection. The flavor of the

VM is a small size, with 2048 MB of RAM and one virtual CPU. The timer stops once

a successful connection is made. Again this is done five times and an average is taken.

Results of these averages are shown in Table 5.1.

As can be seen in the results, a VFR can be booted much faster than a VM. This is

unsurprising however, considering that it takes on the order of milliseconds to partially

reconfigure a PRR over JTAG. The time to reconfigure a PRR may increase with the


physical size of the region, however it would still be on the order of milliseconds. A VM

must first be initialized by the hypervisor, and would then take generally around the

same time to boot up as a regular non-virtualized machine. It can be concluded though,

that VFR systems can be scaled up or down relatively quickly.

Virtualization Trade-offs

Virtualization usually comes with trade-offs in terms of performance – Virtual Machines

typically have worse performance than a baremetal machine due to the underlying ab-

stractions, although much progress has been made in reducing this performance gap to

the point where it is very small, within a few percentage points [55, 56]. VFRs also come

with some trade-offs in terms of performance, as well as some that are not analogous to

processor virtualization. First, the performance trade-offs are analyzed.

Since VFRs are network connected accelerators, performance can be framed in terms

of throughput and latency. The static logic in the virtualization system will cause addi-

tional latency, although the system has been designed to balance this trade-off in favour

of the user’s hardware. The baseline for the following comparisons is the open source

NetFPGA10G hardware, unmodified. In the virtualization system, the input arbiter is

modified to include a CAM – this CAM is generated using Xilinx IP and adds one cycle

of latency, but this only happens once per packet. This is because only the first flit of the

packet is needed by the CAM to decide which VFR to route the packet to, after which

one flit is transferred per cycle as normal. From the arbiter the streams connect to each

physical VFR wrapper module. Recall that the wrapper modules contain logic under con-

trol of the cloud provider, and the Partially Reconfigurable Region (PRR) into which the

user hardware gets configured. The wrapper module contains two FIFO packet buffers –

one to buffer packets going into the PRR, and another to buffer outgoing packets. Each

of these buffers adds another cycle of latency. The output queues in the NetFPGA10G

hardware were also modified to force outgoing packet source MAC addresses, but this is


done completely in combinational logic during the cycle in which the hardware decides to

which output port the packet should be directed, therefore adding no additional latency.

Thus, in terms of performance, the user sees only a one-cycle pipeline stall per packet,

and two additional cycles of latency, which, at 160MHz, is a mere 12.5 nanoseconds.

Buffering will not affect throughput, but the one cycle stall per packet will affect it.

The larger the packet however, the less the throughput will be affected because there

will be more flits per packet. Peak theoretical throughput of the datapath is 256 bits per

cycle, or 256 / 6.25ns = 40.96 Gb/s, but with a minimum size packet of 64 bytes and a

one cycle pipeline stall, this falls to 27.3 Gb/s. For maximum size Ethernet packets of

1500 bytes the maximum throughput increases to 40.11 Gb/s.

There are however, additional penalties for virtualization in terms of available area

for use in the FPGA fabric. For example, in the prototype system, each VFR has 11376

LUTs, which amounts to 7.6% of the entire device, or 9.4% when considering fabric

already taken by the static logic. This was the maximum VFR area that could be

achieved in a four-VFR system while manually placing and adjusting the PRRs. In a

production system, a cloud provider would likely have several different resource flavours

of VFR, with different partition sizes that could help to alleviate this problem.

5.2 Case Study: Load Balancer

In this section the first application case study for the virtualization system is presented

– an application-layer load balancer. Load balancers are an essential part of large dat-

acenter systems, allowing requests to be distributed among active servers, increasing

performance and system stability. In a virtualized cloud environment however, users

generally would not have access to hardware load balancers, especially not if their in-

coming data or requests were in a proprietary or non-HTTP protocol. Their only option

is to use a VM-based software load balancer. In addition, this type of end-user applica-


tion will become increasingly more important as datacenter networks become virtualized

through software defined networking (SDN), and users have full control over private in-

ternal Ethernet LANs. This case study shows how VFRs can be used to implement

such an arbitrary-protocol load balancer that can vastly outperform a software version

in terms of throughput and latency predictability.

5.2.1 Load Balancer Designs

The load balancer operates on a hypothetical protocol that runs on top of UDP. The

protocol has an identifier field in the first 16 bits of the UDP payload, which the load

balancers recognize. There are two identifiers – one designating request or data packets,

coming from a client, required to be distributed to servers, and the other designates

update packets, sent from servers to the load balancers such that the servers can be

added or removed from the distribution system.

Software

The software load balancer is written in Python using the low-level Berkeley Sockets

library [57]. A list is used to track active servers to distribute request packets to, and

incoming request packets are distributed in round-robin fashion. If an update packet is

received from a server, that server’s IP address is added or removed from the distributing

list accordingly.

Hardware

The hardware load balancer is implemented in a similar manner. Since the VFR stream

interfaces are 256 bits wide, the first two flits of the packet are required to detect the

identifier field in the UDP payload. If an update packet is detected, the hardware stores

the server’s IP address and MAC address in a memory. For request packets, the hardware

will replace the packet’s destination IP and destination MAC fields with values read from


Table 5.2: Resource Usage for VFR Load Balancer

Resource Usage (used / VFR Total)

Flip-flop 7523 / 11376 (67%)

LUT 3594 / 11376 (31.6%)

36K BRAM 11 / 15 (74%)

the memory location corresponding to the current destination server in the round robin

schedule. The packet’s UDP checksum is then recalculated, and the rest of the packet is

buffered and sent along.

Because the interface to the VFR is a Xilinx AXI Stream, it was possible to com-

pile this hardware using Xilinx’s high level synthesis tool, Vivado HLS [58], and simply

instantiate the generated HDL within the VFR template file. The design is described

in approximately 150 lines of C code, and compiled for the virtualization system as in

Figure 3.7. Using this method, the design iteration time is reduced from days or weeks

to as little as a few hours. The resource utilization for the hardware load balancer is

shown in Table 5.2.

Many sources of frustration are also removed from the design cycle because the static

logic is already placed, routed, tested and guaranteed working by the cloud provider.

The user need not worry about complex memory interfaces, pin constraints and I/O, or

special timing considerations for high speed transceivers. Because the hardware is just

another generic cloud resource, it is possible to quickly scale up or down the number of

active load balancers according to system-level loads. This can also be done much faster

than the software version, since VFRs can be allocated much more quickly than VMs.


5.2.2 Performance Analysis

Test System and Prototype

The prototype system in the SAVI Testbed is used for all tests. Recall that the SAVI

Testbed provides an “extended cloud” consisting of Core nodes (conventional cloud com-

pute) and Edge nodes that contain additional resources such as the NetFPGA platforms

used for the prototype. All of the experiments are limited in scope to a single edge node

in the SAVI testbed.

Comparison of VM and VFR Load Balancers

An experiment is set up to compare latency and throughput of the different load bal-

ancers. The experiment uses a single load balancer, three receiving servers, and two

clients. The servers and clients are all VMs spread amongst two physical machines. Care

is taken to make sure that the two clients are on different physical machines so that they

do not share a single physical interface and interfere with one another. One client is used

to take measurements of latency, while the other is used to inject additional traffic at

prescribed rates onto the network, bound for the load balancer. A third physical machine

hosts the NetFPGA 10G platform, which in turn provides the VFRs to OpenStack as

described in the prototype. The machines are connected via the SAVI Testbed OpenFlow

network. The network physical layer is implemented via a gigabit switch, with four 10G

ports connecting the NetFPGA and VFRs. The test setup is shown in Figure 5.1.

The latency is measured as the round trip time from Client 1, through the VM

or VFR Load Balancer, to a Server, and back to Client 1, averaged over 10000 mid-

sized (760 byte) packets. The load balancer is on the forward path, but not the return.

The latency is continually measured and the number of dropped packets counted as

Client 2 is used to inject additional traffic into the load balancer. In this way it is

possible to see what level of throughput the load balancer can handle. Figures 5.2 and 5.3


Hypervisor

Hypervisor

Server

Physical Server 1

Physical Server 2

Server

Server

Client 1Measure

Client 2Inject

VM Load Balancer

Physical Server 3

VFR Load Balancer

Sw

itch

1G

1G

1G 10G

Figure 5.1: Experiment setup for load balancer tests.

show the results for the VFR and VM load balancers as a function of the Injection

Rate from Client 2 – each latency measurement is an average of 10000 packets, so the

standard deviation is included as well, which gives a sense of system load and performance

predictability. Socket timeout at one second is counted as a dropped packet, and is

not counted in the latency average. Transmission continues until 10000 packets are

successfully received back. Dropped packets for the VM load balancer at each data point

are shown in Figure 5.4. The VFR load balancer had no dropped packets.

Even at 25 MB/s, the VM load balancer begins to drop a number of packets. At

45 MB/s, the VM load balancer drops a significant number of packets, and latency in-

creases and varies widely. The VFR balancer however has predictable and fairly constant

latency. Even at 100 MB/s the VFR balancer dropped no packets and maintained pre-

dictable performance. Latency in general was dominated by software in both send and

receive. Unfortunately, the test system did not allow testing of the full capabilities of the

VFR, since there were not enough physical servers available in the edge node to saturate

the 10GE links to the VFR load balancer. Clearly though, the VFR provides a significant

benefit to the cloud user – several VMs can be replaced with a single VFR, simplifying

and streamlining the user’s systems, and potentially lowering operating costs. The cloud

provider also benefits by having fewer VMs per user, which may reduce overall power


200

400

600

800

1000

1200

5 10 15 20 25 30 35 40 45 50

Late

ncy

(mic

rose

cond

)

Injection Rate (MB/s)

VFR LatencyVFR Std

Figure 5.2: VFR load balancer latency at different throughput levels.

200

400

600

800

1000

1200

5 10 15 20 25 30 35 40 45 50

Late

ncy

(mic

rose

cond

)


VM LatencyVM Std

Figure 5.3: VM load balancer latency at different throughput levels.


0

50

100

150

200

250

300

4.55 6 7 8 910 15 20 25 35 45

Drop

ped

Pack

ets


1 1 1 9

75

107

241

Figure 5.4: Number of dropped packets for the VM load balancer.

consumption and costs of the datacenter.

5.3 Case Study: Extending OpenFlow Capabilities

Software-Defined Networking (SDN) is a growing paradigm that sees network manage-

ment move from a decentralized approach to a centralized, software-managed system.

Network management rules are centrally defined for an entire network, and these rules

are then translated into matching and forwarding actions for the data plane. OpenFlow

is a popular realization of SDN that has also significantly affected the development of

SDN in general [59, 60].

OpenFlow operates by having a central controller specify rules that packets coming

into the network are matched against. These matches and the specific packet fields they

operate on form what are know as flows. A flow consists of a match field, a priority field,

statistics counters, timeouts, and most importantly, actions. When an incoming packet

is matched to a flow (with multiple matches the highest priority is taken), the action(s)

specified in the flow are taken on the packet. This may simply be to forward the packet

out of a particular switch port, or it may be to drop the packet, modify certain fields in

the packet header, swap certain fields or insert VLAN (Virtual Local Area Network) [61]


tags or MPLS (Multi-Protocol Label Switching) [62] shims. In current OpenFlow enabled

switches, not all of these actions will be supported in the hardware datapath – certain

actions, usually more complex ones, require the processing to be done in switch software,

which is a much slower path. Many desirable actions are not even possible on current

OpenFlow switches, and some desirable actions are not even specified in the OpenFlow

specification [63].

In this case study, virtualized hardware is used to extend the capabilities of OpenFlow.

By forwarding matched packets to a VFR, custom hardware can be used to implement

arbitrary actions or matches on packets at 10 Gb/s line rates. In particular this case study

will look at actions pertaining to the Virtual eXtensible Local Area Network protocol,

or VXLAN, which allows bridging of LANs over IP.

5.3.1 VXLAN

Virtually eXtensible Local Area Network (VXLAN) is a method of bridging two OSI

Layer 2 networks by encapsulating the Layer 2 packets inside UDP packets. This “virtu-

ally” connects two Layer 2 networks. The packet structure for VXLAN is straightforward.

It consists of a standard UDP packet, with a port number of 4789, and a payload con-

sisting of a VXLAN header and the encapsulated Layer 2 packet. Figure 5.5 shows the

format of a VXLAN packet. Currently, OpenFlow does not specify any actions that can

act on the encapsulated packet, or in fact anything that lies within the payload of the

encapsulating UDP packet. It would be very useful from a network management perspec-

tive to be able to manage virtually connected LANs directly in OpenFlow without having

to invest in additional infrastructure. This amounts to the ability to match and/or edit

fields within the encapsulated packet.


L2 Ethernet L3 Internet Protocol Encapsulated Packet ...VXLANHeader

L4 Transport Protocol (UDP)

L3 Internet Protocol L4 Transport Protocol ....

Reserved8 Bits

Reserved24 Bits

VNID24 Bits

Reserved8 Bits

L2 Ethernet

Figure 5.5: Packet diagram for the VXLAN protocol.

5.3.2 Virtualized Hardware for New OpenFlow Capabilities

Hardware can be written and compiled for a VFR via the methods discussed in Chapters

3 and 4. For this application, the chosen enhancement to implement is the ability to

perform matches on the fields of the encapsulated field and either forward or drop the

packet. This amounts to a custom, OpenFlow-controlled in-network firewall for VXLAN

tunneled networks. The end result will be that a user will be able to control forwarding

in a VXLAN-bridged Layer 2 network using nothing but OpenFlow. Theoretically of

course, the VFR architecture allows access to the entire packet, and therefore any field

could potentially be parsed, modified or inserted.

The VFR used for this case study is designed to be capable of matching transport layer

(i.e. TCP, UDP) port numbers inside VXLAN encapsulated packets, which will allow it

to block certain communication protocols. The hardware is implemented in Verilog, and

consists of a packet datapath, a programmable drop port register, and a simple control

circuit. Control packets are defined as having the Ethernet Type field equal to 0x88B3.

If the hardware detects a control packet in the datapath, it will extract the first 16 bits

of the payload, and place it in the drop port register. The control packet is then dropped.

Any other packet will be parsed to detect destination UDP port 4789, signifying that


the packet is a VXLAN packet. If this port is detected, the hardware then tests if source

or destination transport layer port of the encapsulated packet matches the one currently

stored in the drop port register. If it matches, the packet is dropped. If not, the packet

is simply forwarded. Using the control packets, the user can change what port numbers

the hardware matches. The hardware module also contains a FIFO buffer to help the

design meet timing. The design introduces no pipeline stalls and can run at full pipeline

throughput. To avoid broadcast loops, any broadcast Ethernet packets entering the VFR

are dropped.

All that remains is the question of how the VXLAN packets get to the VFR in the

first place. This is achieved by programming several regular flows into the network such

that all switches that receive packets with a destination UDP port of 4789 will forward

the packet to the VFR. To avoid a forwarding loop, additional flows are added such that

if the VXLAN packet ingress port corresponds to the VFR hardware, it is sent on to its

original destination. In the SAVI testbed, physical topologies can be queried by the user,

and it is possible to determine exactly what physical or virtual switches lie on the path

between VXLAN-tunnelled machines and the VFR. This allows the aforementioned flows

to be set manually. Also recall that the SAVI testbed has a fully virtualized network that

allows users to allocate a private slice of the network with their own OpenFlow controller.

In such an environment, the flows necessary for routing VXLAN tunnels through the VFR

can be set up programmatically via the controller. Figure 5.6 depicts the structure of the

VXLAN port firewall. Incoming packets are buffered, and if a packet is a VXLAN packet

(i.e. the destination UDP port is equal to 4789) and the encapsulated packet contains a

transport layer port (that is, TCP or UDP) equal to the Drop Port register, the packet is

dropped. If the packet is not a VXLAN packet, it is also dropped. Otherwise, it is simply

forwarded out of the VFR, and regular flows move the packet to it’s original destination,

avoiding a loop.


IS VXLAN?

PKT InPKT Out

IS TP_PORT

X?

Drop

VM1 VM2

VFR

N

Y

Y

N

VXLAN Tunnel

Figure 5.6: VXLAN Port Firewall

5.3.3 Performance Analysis

The hardware for the VXLAN port firewall was written in Verilog and compiled using

the procedure outlined in Section 3.5. This uses Xilinx ISE 13.4 to synthesize a netlist,

and Xilinx PlanAhead to perform the PR compilation. Resource usage for the hardware

is shown in Table 5.3. At only 15.6% of available LUTs in a VFR, the hardware is lean

and leaves room for more features, such as matching multiple ports using a CAM, or

matching and blocking based on other encapsulated packet fields.

Table 5.3: Resource Usage for VFR VXLAN Port Firewall

Resource Usage (used / VFR Total)

Flip-flop 1778 / 11376 (15.6%)

LUT 1446 / 11376 (12.7%)

36K BRAM 0 / 15 (0%)

Experimental Setup

The port firewall VFR is booted using OpenStack along with two VMs. OpenVSwitch

(OVS), a virtual OpenFlow switch, is installed on both VMs and used to set up a VXLAN


tunnel between the two VMs. Important to note is that both VMs are booted on separate

physical machines, and separated by a physical OpenFlow switch. The two VMs are

connected through the physical OpenFlow switch to the VFR port firewall. Flows are

installed in the virtual switches and the physical switch to redirect any VXLAN packet to

the VFR hardware, so it can achieve its intended function. It is obviously more desirable

to use a custom OpenFlow controller to accomplish this rather than manually installing

the flows, but at the time of writing this thesis, private OpenFlow networks were not

fully compatible with heterogeneous resources in the SAVI testbed. Therefore, the flows

are manually installed.

Results

After the flows are installed, VXLAN encapsulated traffic between the two VMs operating

on a given transport layer port is successfully blocked. The port can be changed at any

time by sending a control packet with a different port number to the VFR.

The iperf tool is used to measure throughput from one VM to another, and the

ping tool is used to measure latency. First, a baseline measurement is taken, that is

with no additional flows installed, to see what the throughput and latency are between

the VMs without the VFR in the network path (Figure 5.7a). iperf is run five times

and an average is taken, while ping is run until 20 pings are completed. These tests are

run both with and without VXLAN tunnelling. Results are shown under No Tunnel and

VXLAN Tunnel in Table 5.4

Without tunnelling the throughput is near line rate (1 Gb/s) as expected. When

tunnelling using VXLAN, throughput takes a large performance hit. Normally this is not

the case, but the Ubuntu operating systems used in the machines to boot the VMs have

a dynamic CPU frequency scaling system to save power – the lower clock frequencies

significantly affect the performance of the software switch (OpenVSwitch), since the

VXLAN encapsulation and decapsulation is done inside OVS. The core frequency was


VM1 VM2

VXLAN

Switch

OVS OVS

(a)

VM1

VXLAN Switch

BM

VM2

OVS OVS

(b)

VXLAN Switch

VFR

VM1 VM2

OVS OVS

(c)

Figure 5.7: Experimental setups for VFR-based VXLAN firewall.

limited by Ubuntu to 1.2 GHz, which was still enough to saturate the 1 Gb/s link when

not running a tunnel, but the processor was not able to maintain this bandwidth when

running the VXLAN tunnel. This effect was verified by disabling the CPU frequency

scaling and observing a rise in throughput over the VXLAN tunnel, however, the CPU

scaling was left on for the experiments to maintain default settings across all VMs in the

system.

Table 5.4: Throughput and Latency for VXLAN Port Firewall

Throughput Latency

No Tunnel 941 Mb/s 0.465 ms

VXLAN Tunnel 517.4 Mb/s 0.532 ms

VXLAN through VFR 513.2 Mb/s 0.600 ms

VXLAN through BM 480.4 Mb/s 0.801 ms

Now that a baseline for comparison is established, the flows are installed to reroute

VXLAN traffic to the running VFR and iperf is run again to determine the overhead of

rerouting the VXLAN traffic through the VFR port firewall, shown in Figure 5.7c. An

average of five runs gives a throughput of 513.2 Mb/s, slightly less than the 517.4 Mb/s

achieved without the VFR in the network path. This is not a large difference and it can

be concluded that this technique introduces little to no overhead in terms of throughput.

The ping test is also run again, and results show that rerouting to the VFR introduces a


small increase in latency (∼ 12%), but this is expected when adding an additional hop.

These results are also summarized in Table 5.4 under VXLAN through VFR.

Lastly, it is also useful to see how the VFR compares in performance to a software

version executing the same task. A software VXLAN port firewall is programmed in C,

based on the packet capture library libpcap. Normally it would be best to run the software

on a VM in the cloud as VMs are the usual processing resource available. However,

many VM network interfaces, including those in the current SAVI testbed, do not have

the ability to put an interface into promiscuous mode. In promiscuous mode, a network

interface passes along all detected packets, even those not addressed to the receiving

interface. To implement a pass-through firewall, a promiscuous interface is required.

Therefore, for the sake of the experiment, a non-standard bare-metal server is used to

run the software on which the network interface can be configured into promiscuous

mode.

The packet flows installed on the switch are slightly modified to enable the network

path shown in Figure 5.7b, where VXLAN packets are sent to the bare-metal server (BM)

for software processing rather than the VFR hardware. The iperf and ping tests are

run again, and the results are also shown in Table 5.4 under VXLAN through BM for

comparison.

The software introduces a rather large increase in latency compared with both the

VFR and the direct path from one VM to another. This is expected since each packet

must travel up through several software layers before being processed. That being said,

network stacks in modern servers are very efficient, and this is evident in the throughput

of 480.4 Mb/s, only a 6.6% decrease from the VFR. The main reason for this good

performance however, is that the application is relatively simple. For a single port check,

the software is only required to move a pointer and make a comparison, which is less

than tens of instructions, then either sending or dropping the packet. More complex

operations, such as an IP address Longest Prefix Match or multiple exact matches, would


likely widen the performance gap since they can be fully pipelined in hardware. Such

enhancements are left to future work.

Summary

This case study has shown that VFRs can be used in a straightforward manner to enhance

the capabilities of an OpenFlow network, allowing custom, in-network matching and

actions to be added to the system with little to no degradation in throughput. This

comes at the cost of slightly increased latency. It has also been shown that even for

a simple application, a VFR will outperform a software implementation. VFRs thus

provide a promising path toward fully custom complex network configurations, with all

data plane operations done entirely inside hardware on high-speed paths.

Chapter 6

Conclusion

This thesis has presented a hardware and software architecture that represents a first

attempt at integrating FPGA hardware acceleration as a first-class citizen into cloud

computing infrastructures. This work shows that FPGA hardware acceleration can be

virtualized and feasibly integrated with existing infrastructure-as-a-service cloud man-

agement software. It also shows that FPGA Partial Reconfiguration (PR) can be used to

virtualize an FPGA device with very little performance overhead in terms of throughput

and latency when considering network-connected accelerators. The architecture pre-

sented shows that this style of virtualization can be very accessible to the end user as

well, trading off potential application scope for a template-based automated compila-

tion system that effectively removes many common design complexities. End users also

benefit from the fact that the very same management commands for traditional cloud

resources are used to manage the new heterogeneous resources.

6.1 Future Work

As stated, this work is merely a first attempt at the integration of FPGAs into cloud

computing. There are many open avenues of exploration, both for the general concept

and for the particular architecture presented in this thesis. This section on Future Work

62

Chapter 6. Conclusion 63

will discuss some of these possible avenues, focusing mainly on the work presented in this

thesis.

6.1.1 Architectural Enhancements

While the Agent is fairly generic and not performance critical, there are areas in which

the FPGA static hardware virtualization layer could be improved. VFRs are limited

in size, and many applications may need more than one to fully implement a desired

function. The performance of such a system may be heavily dependent on where each

constituent VFR is placed physically by the scheduling algorithm. A commercial system

could consist of hundreds or thousands of FPGA systems spread across one or possibly

several datacenters. VFRs that are part of the same system could take a severe latency

penalty if placed very far away from each other. This leads to two related future work

items:

1. VFR-aware resource scheduling: modify the scheduling algorithm to recognize that

VFRs that are part of the same system (or belonging to the same user) should be

placed as close as possible to each other physically. Ideally, they should be placed

in the same physical device. In the current implementation, however, this is not as

good as it could be – transferring packets between VFRs still requires the packet

to be switched external to the FPGA.

2. Static Logic Switching Fabric: Related to the second point of the first item de-

scribed above, a switching fabric in the static logic would allow packets addressed

from one VFR to another in the same device to be redirected without exiting the

device. VFRs could be chained together with extremely low-latency paths, leading

to higher performance for larger, multi-VFR systems.

Currently, the virtualization architecture lacks a dedicated promiscuous interface

function. A promiscuous interface is one that allows any and all packets detected on


the wire into the device, regardless of whether they are addressed to the device. For ex-

ample, all ports on a regular Ethernet switch would be promiscuous. What this means is

that it is difficult for a user to use a VFR to do generic switching and processing, because

the static logic only allows addressed packets into a VFR. A useful enhancement would

be a mechanism that allows VFRs to act in a promiscuous mode, while still enabling the

provider to enforce security.

Another thing to consider, especially from the view of the cloud provider, is what

assumptions can be made about the user-designed hardware. A commercial provider

would likely make very few assumptions – user-designed hardware may be bug-ridden,

faulty, or have any number of other problems. Error detection and fault tolerance in the

face of shoddy or unpredictable user hardware may very well become a top priority. For

example, if one VFR begins to make transfers incorrectly on the stream interface, the

entire static logic may be affected by this, at the very worst locking up and forcing an

entire chip reconfiguration. A poorly implemented VFR may also transmit packets below

the network minimum packet size (e.g. below 64 bytes for Ethernet networks), which

could result in network problems for the provider.

Because of these problems, much research in fault-tolerant embedded systems could

likely be applied to the static logic to improve its commercial viability.

Additionally, the memory system connecting the VFRs to off-chip DRAM is currently

simple and inefficient in its partitioning scheme. Future work may address this problem

by replacing the partitioning scheme with a full multi-port memory controller or similar

technology to more efficiently share the off-chip DRAM.

6.1.2 Failures and Migration

As briefly discussed in the previous subsection, VFRs may occasionally fail due to a

variety of circumstances such as faulty user hardware, a provider static logic bug or even

network faults. In the case of a Virtual Machine, the hypervisor always retains the ability


to kill or reboot a faulty virtual machine, however, this feature does not translate well to

VFRs since they are so tightly integrated with the static logic that performs hypervisor-

like functions. Future work would devise methods of maintaining a more strict separation,

again possibly drawing on fault-tolerant embedded systems research, to allow the static

logic and/or Agent to deal with faults in ways that do not involve bringing down the

entire system.

Another key capability of Virtual Machines that future work would seek to replicate

for VFRs is migration. Migration involves saving the state of a running VM, transferring

it to another hypervisor, and then continuing execution in the new location, all done

transparently to the user. This problem is more difficult for FPGAs because they are

not designed with this feature in mind, and the vendor tools do not support moving a

PRM between different PRRs. This would require accessing a running circuit’s state and

FPGA configuration bits, saving them, and reprogramming it on another FPGA device.

6.1.3 Further Heterogeneity

In the future, a provider would likely have multiple flavors of VFRs, each with different

sizes. Larger VFRs would provide more LUTs for more complex circuits, but would

cost more for end users. This is analogous to how one can allocate VMs with different

amounts of RAM and virtual CPUs, and different operating systems.

New FPGA devices that combine ARM CPUs with traditional reconfigurable fabric

have also recently entered the market [64, 65]. Future work could investigate how these

devices could be virtualized as resources that provide both an ARM-based CPU or VM,

with a slice of closely-coupled reconfigurable fabric.

6.1.4 Applications

The application case studies in this thesis have only described a taste of what may be

possible with this type of system. Multi-VFR applications have yet to be investigated


and analyzed.

The concept of extending available OpenFlow capabilities may also be an area of

potential future work. There are other tunnelling protocols similar to VXLAN that

could be added to OpenFlow (e.g. GRE tunnelling), and there are many other higher

layer protocols that OpenFlow does not consider. Although these higher layer protocols

are complicated because of the large amount of state information required, there is still

potential for implementation via VFRs.

The current VXLAN port firewall application is also an area of future work. The

VFR resource usage was low, and this leaves room for additional functionality, such

as matching and blocking based on multiple ports, encapsulated Layer 2 and Layer 3

headers, or another part of the packet payload. Furthermore, the experiments revealed

a significant loss in performance due to OVS having to encapsulate and decapsulate

packets in software. This function could also be moved into hardware via a VFR, and

the encapsulation and decapsulation done at line rate, incurring no losses in performance.

VFRs are also well suited to the many streaming protocols that exist. Real-time

Transport Protocol packet streams could be redirected to VFRs for hardware accelerated

processing or custom switching and routing. RTP is used for many applications like

telephony, video streaming and Voice over IP (VoIP). Other applications may include

Event Stream processing or financial data stream processing.

6.1.5 FPGA CAD as a Service

Recall that the user must compile their hardware for VFRs in a way that uses pre-placed

and routed static logic. In the implementation discussed in this thesis, this is done using

an automated script-based compile system. Currently a user must have access to the

entire project, but this is not feasible in a production environment. Ideally, the user

would submit their Verilog module to a cloud-based compile service that would analyze

and compile their hardware to run on the cloud provider’s systems. CAD as a service as


a general topic may also make an interesting topic for future work, to see how it could

be useful for traditional hardware design and also for cloud-based hardware as described

in this thesis.

6.1.6 Complementary Studies

Overhead Area Trade-off Study

In the future a study could be performed to determine how the overhead of the static

logic (in terms of FPGA area) scales according to how many VFRs are implemented on

a physical chip. Larger device sizes could also be examined. Ideally, an optimal point

would be discovered where the lowest static logic overhead gives the highest percentage

of chip area to VFRs.

Deployment Costs Analysis

In this future study, a full, commercial-scale deployment of a VFR system would be

examined. The primary focus would be on costs and considerations from a cloud provider

perspective. Approximate values for how many FPGAs and VFRs of different flavors

could fit in one standard rack would be determined, as well as how much they would cost

– this will be dependent on the system architecture, whether or not there are failsafe or

redundant devices, as well as list cost prices of FPGA chips and boards, which are subject

to volume and availability. Approximate power consumption and other operating costs

would be compared to Virtual Machine deployments. The result of the study would show

how virtualized FPGAs can provide benefits to providers by giving more computation

per unit cost than Virtual Machines alone.

With regard to power consumption and performance per Watt, recent work involving

large-scale datacenter systems using FPGAs has shown that for a small (10%) increase in

power, computation performance can be doubled [66]. These are promising results that

can hopefully be replicated or improved with virtualized reconfigurable hardware.

Bibliography

[1] Gartner. Gartner Says Worldwide Public Cloud Services Market to Total $131

Billion. Gartner, Inc., 2013.

[2] Michael Armbrust, Armando Fox, Rean Griffith, A Joseph, R Katz, A Konwinski,

G Lee, D Patterson, A Rabkin, I Stoica, and Matei Zaharia. Above the Clouds: A

Berkeley View of Cloud Computing. Dept. Electrical Eng. and Computer Sciences,

University of California, Berkeley, Rep. UCB/EECS, 28, 2009.

[3] Daniele Catteddu. Cloud Computing: Benefits, Risks and Recommendations for

Information Security. Springer, 2010.

[4] Maire McLoone and John V McCanny. High Performance Single-Chip FPGA Ri-

jndael Algorithm Implementations. In Cryptographic Hardware and Embedded Sys-

temsCHES 2001, pages 65–76. Springer, 2001.

[5] S. Rigler, W. Bishop, and A Kennings. Fpga-based lossless data compression using

huffman and lz77 algorithms. In Canadian Conference on Electrical and Computer

Engineering, pages 1235–1238, April 2007.

[6] F. Braun, J. Lockwood, and M. Waldvogel. Protocol Wrappers for Layered Network

Packet Processing in Reconfigurable Hardware. IEEE Micro, 22(1):66–74, Jan 2002.

[7] Sai Rahul Chalamalasetti, Kevin Lim, Mitch Wright, Alvin AuYoung, Parthasarathy

Ranganathan, and Martin Margala. An FPGA Memcached Appliance. In Proceed-

68

Bibliography 69

ings of the ACM/SIGDA International Symposium on Field Programmable Gate

Arrays, pages 245–254. ACM, 2013.

[8] Joon-Myung Kang, H. Bannazadeh, H. Rahimi, T. Lin, M. Faraji, and A. Leon-

Garcia. Software-Defined Infrastructure and the Future Central Office. In IEEE In-

ternational Conference on Communications Workshops (ICC), pages 225–229, 2013.

[9] Joon-Myung Kang, Hadi Bannazadeh, and Alberto Leon-Garcia. Savi Testbed:

Control and Management of Converged Virtual ICT Resources. In IFIP/IEEE In-

ternational Symposium on Integrated Network Management, pages 664–667. IEEE,

2013.

[10] Ian Kuon, Russell Tessier, and Jonathan Rose. FPGA Architecture: Survey and

Challenges. Foundations and Trends in Electronic Design Automation, 2(2):135–

253, February 2008.

[11] IEEE Standard for Verilog Hardware Description Language. IEEE Std 1364-2005,

pages 1–560, 2006.

[12] IEEE Standard VHDL Language Reference Manual. IEEE Std 1076-2008, pages

c1–626, Jan 2009.

[13] Rishiyur Nikhil. Bluespec System Verilog: Efficient, Correct RTL from High Level

Specifications. In Proceedings of the Second ACM and IEEE International Confer-

ence on Formal Methods and Models for Co-Design, pages 69–70, June 2004.

[14] Robert M. Metcalfe and David R. Boggs. Ethernet: Distributed Packet Switching

for Local Computer Networks. Communications of the ACM, 19(7):395–404, July

1976.

[15] Don Anderson, Tom Shanley, and Ravi Budruk. PCI Express System Architecture.

Addison-Wesley Professional, 2004.

Bibliography 70

[16] Peter Mell and Timothy Grance. The NIST Definition of Cloud Computing (draft).

NIST Special Publication, 800(145):7, 2011.

[17] Amazon Web Services Inc. Amazon Web Services (AWS) – Cloud Computing Ser-

vices. http://aws.amazon.com, 2014.

[18] GENI. Global Environment for Networking Innovations (GENI) Project. http:

//geni.net/, 2014.

[19] Emulab. Emulab – Network Emulation Testbed. http://emulab.net/, 2014.

[20] PlanetLab. Planetlab – A Open Platform for Developing, Deploying, and Accessing

Planetary-Scale Services. http://planet-lab.org/, 2014.

[21] Internet2. http://www.internet2.edu/, 2014.

[22] ORION. Ontario Research and Innovation Optical Network. http://www.orion.

on.ca/, 2014.

[23] CANARIE. Canada’s Advanced Research and Innovation Network. http://www.

canarie.ca/, 2014.

[24] OpenStack. http://www.openstack.org/, 2013.

[25] R.T. Fielding. REST: Architectural Styles and the Design of Network-Based Software

Architectures. PhD thesis, University of California, Irvine, 2000.

[26] OpenStack. Nova Developer Documentation. http://nova.openstack.org, 2014.

[27] OpenStack. Keystone Developer Documentation. http://keystone.openstack.

org, 2014.

[28] OpenStack. Glance Developer Documentation. http://glance.openstack.org,

2014.

http://aws.amazon.com

http://geni.net/

http://geni.net/

http://emulab.net/

http://planet-lab.org/

http://www.internet2.edu/

http://www.orion.on.ca/

http://www.orion.on.ca/

http://www.canarie.ca/

http://www.canarie.ca/

http://www.openstack.org/

http://nova.openstack.org

http://keystone.openstack.org

http://keystone.openstack.org

http://glance.openstack.org

Bibliography 71

[29] OpenStack. Quantum Developer Documentation. http://quantum.openstack.

org, 2014.

[30] OpenStack. Swift Developer Documentation. http://swift.openstack.org, 2014.

[31] OpenStack. Cinder Developer Documentation. http://cinder.openstack.org,

2014.

[32] Ryu. Ryu SDN Framework. http://osrg.github.io/ryu/, 2014.

[33] Rob Sherwood, Glen Gibb, Kok-Kiong Yap, Guido Appenzeller, Martin Casado,

Nick McKeown, and Guru Parulkar. Flowvisor: A Network Virtualization Layer.

OpenFlow Switch Consortium, Technical Reports, 2009.

[34] Joon-Myung Kang, Lin T., Bannazadeh H., and A. Leon-Garcia. Software-Defined

Infrastructure and the SAVI Testbed. In TRIDENTCOM 2014, 2014.

[35] K. Redmond, H. Bannazadeh, P. Chow, and A. Leon-Garcia. Development of a

Virtualized Application Networking Infrastructure Node. In IEEE GLOBECOM

Workshops, pages 1–6, 2009.

[36] Steven Trimberger, Dean Carberry, Anders Johnson, and Jennifer Wong. A Time-

Multiplexed FPGA. In Proceedings of the 5th Annual IEEE Symposium on FPGAs

for Custom Computing Machines, pages 22–28. IEEE, 1997.

[37] D. Unnikrishnan, R. Vadlamani, Yong Liao, J. Crenne, Lixin Gao, and R. Tessier.

Reconfigurable Data Planes for Scalable Network Virtualization. IEEE Transactions

on Computers, 62(12):2476–2488, 2013.

[38] Esam El-Araby, Ivan Gonzalez, and Tarek El-Ghazawi. Virtualizing and Sharing

Reconfigurable Resources in High-Performance Reconfigurable Computing Systems.

In Second International Workshop on High-Performance Reconfigurable Computing

Technology and Applications, pages 1–8. IEEE, 2008.

http://quantum.openstack.org

http://quantum.openstack.org

http://swift.openstack.org

http://cinder.openstack.org

http://osrg.github.io/ryu/

Bibliography 72

[39] Ivan Gonzalez, Sergio Lopez-Buedo, Gustavo Sutter, Diego Sanchez-Roman, Fran-

cisco J. Gomez-Arribas, and Javier Aracil. Virtualization of Reconfigurable Copro-

cessors in HPRC Systems with Multicore Architecture. Journal of Systems Archi-

tecture, 58(67):247 – 256, 2012.

[40] C. Steiger, H. Walder, and M. Platzner. Operating Systems for Reconfigurable

Embedded Platforms: Online Scheduling of Real-Time Tasks. IEEE Transactions

on Computers, 53(11):1393–1407, 2004.

[41] K. Rupnow. Operating System Management of Reconfigurable Hardware Computing

Systems. In International Conference on Field-Programmable Technology, pages

477–478, 2009.

[42] Chun-Hsian Huang and Pao-Ann Hsiung. Virtualizable Hardware/Software Design

Infrastructure for Dynamically Partially Reconfigurable Systems. ACM Transactions

on Reconfigurable Technology and Systems (TRETS), 6(2):11, 2013.

[43] Khoa Dang Pham, A.K. Jain, Jin Cui, S.A. Fahmy, and D.L. Maskell. Microkernel

Hypervisor for a Hybrid ARM-FPGA Platform. In 24th International Conference

on Application-Specific Systems, Architectures and Processors, pages 219–226, 2013.

[44] C. Chang, J. Wawrzynek, and R.W. Brodersen. BEE2: A High-End Reconfigurable

Computing System. Design Test of Computers, IEEE, 22(2):114–125, 2005.

[45] NetFPGA. NetFPGA 10G. http://netfpga.org/, 2014.

[46] Terasic Technologies Inc. DE5Net. http://de5-net.terasic.com/, 2013.

[47] BEECube Inc. miniBEE - Research in a Box. http://www.beecube.com/products/

miniBEE.asp, 2014.

[48] Xilinx. Xilinx Partial Reconfiguration User Guide v12.3. http://www.xilinx.com/

support/documentation/sw_manuals/xilinx12_3/ug702.pdf, 2010.

http://netfpga.org/

http://de5-net.terasic.com/

http://www.beecube.com/products/miniBEE.asp

http://www.beecube.com/products/miniBEE.asp

http://www.xilinx.com/support/documentation/sw_manuals/xilinx12_3/ug702.pdf

http://www.xilinx.com/support/documentation/sw_manuals/xilinx12_3/ug702.pdf

Bibliography 73

[49] Altera. Partial Reconfiguration Megafunction. http://www.altera.com/

literature/ug/ug_partrecon.pdf, 2013.

[50] Xilinx. Vivado Design Suite User Guide - Partial Reconfiguration.

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_

2/ug909-vivado-partial-reconfiguration.pdf, 2014.

[51] Xilinx. Virtex 5 User Guide. http://www.xilinx.com/support/documentation/

user_guides/ug190.pdf, 2012.

[52] NetFPGA. NetFPGA 10G Open Source Hardware. https://github.com/NetFPGA/

NetFPGA-public, 2014.

[53] Kyle Locke. Xilinx Parametrizable Content-Addressable Memory. http:

//www.xilinx.com/support/documentation/application_notes/xapp1151_

Param_CAM.pdf, 2011.

[54] Xilinx. UG761 Xilinx AXI Reference Guide v14.3. http://www.xilinx.com/

support/documentation/ip_documentation/axi_ref_guide/latest/ug761_

axi_reference_guide.pdf, 2012.

[55] Jianhua Che, Yong Yu, Congcong Shi, and Weimin Lin. A Synthetical Performance

Evaluation of OpenVZ, Xen and KVM. In Services Computing Conference (AP-

SCC), 2010 IEEE Asia-Pacific, pages 587–594, Dec 2010.

[56] The Xen Project. Baremetal vs. Xen. vs. KVM – Redux. http://blog.xen.org/

index.php/2011/11/29/baremetal-vs-xen-vs-kvm-redux/, 2013.

[57] Python. Socket - Low-level Networking Inteface - Python Documentation. http:

//docs.python.org/2/library/socket.html, 2013.

[58] Xilinx. Vivado High-Level Synthesis. http://www.xilinx.com/products/

design-tools/vivado/integration/esl-design/, 2014.

http://www.altera.com/literature/ug/ug_partrecon.pdf

http://www.altera.com/literature/ug/ug_partrecon.pdf

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_2/ug909-vivado-partial-reconfiguration.pdf

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_2/ug909-vivado-partial-reconfiguration.pdf

http://www.xilinx.com/support/documentation/user_guides/ug190.pdf

http://www.xilinx.com/support/documentation/user_guides/ug190.pdf

https://github.com/NetFPGA/NetFPGA-public

https://github.com/NetFPGA/NetFPGA-public

http://www.xilinx.com/support/documentation/application_notes/xapp1151_Param_CAM.pdf



http://www.xilinx.com/support/documentation/ip_documentation/axi_ref_guide/latest/ug761_axi_reference_guide.pdf



http://blog.xen.org/index.php/2011/11/29/baremetal-vs-xen-vs-kvm-redux/

http://blog.xen.org/index.php/2011/11/29/baremetal-vs-xen-vs-kvm-redux/

http://docs.python.org/2/library/socket.html

http://docs.python.org/2/library/socket.html

http://www.xilinx.com/products/design-tools/vivado/integration/esl-design/

http://www.xilinx.com/products/design-tools/vivado/integration/esl-design/

Bibliography 74

[59] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson,

Jennifer Rexford, Scott Shenker, and Jonathan Turner. Openflow: Enabling inno-

vation in campus networks. ACM SIGCOMM Computer Communication Review,

38(2):69–74, 2008.

[60] Scott Shenker, M Casado, T Koponen, and N McKeown. The Future of Networking,

and the Past of Protocols. Open Networking Summit, 2011.

[61] P.J. Frantz and G.O. Thompson. VLAN Frame Format, September 1999. US Patent

5,959,990.

[62] Bruce Davie and Yakov Rekhter. MPLS: Technology and Applications. Morgan

Kaufmann Publishers Inc., 2000.

[63] OpenFlow Specification. http://www.opennetworking.org/sdn-resources/

onf-specifications/openflow, 2013.

[64] Xilinx. Xilinx All Programmable SoC. http://www.xilinx.com/products/

silicon-devices/soc/index.htm, 2014.

[65] Altera Corporation. Altera SoC Overview. http://www.altera.com/devices/

processor/soc-fpga/overview/proc-soc-fpga.html, 2014.

[66] Andrew Putnam et. al. A Reconfigurable Fabric for Accelerating Large-Scale Dat-

acenter Services. In The 41st International Symposium on Computer Architecture.

IEEE, 2014.

http://www.opennetworking.org/sdn-resources/onf-specifications/openflow

http://www.opennetworking.org/sdn-resources/onf-specifications/openflow

http://www.xilinx.com/products/silicon-devices/soc/index.htm

http://www.xilinx.com/products/silicon-devices/soc/index.htm

http://www.altera.com/devices/processor/soc-fpga/overview/proc-soc-fpga.html

http://www.altera.com/devices/processor/soc-fpga/overview/proc-soc-fpga.html

by Stuart A. Byma - University of Toronto T-Space Virtualizing FPGAs for Cloud Computing...

Documents

Transcript of by Stuart A. Byma - University of Toronto T-Space Virtualizing FPGAs for Cloud Computing...