Download - DOMAIN SPECIFIC HARDWARE ACCELERATION ADISSERTATION ...pw135js0060/... · domain specific hardware acceleration adissertation submitted to the department of computer science and the

DOMAIN SPECIFIC HARDWARE ACCELERATION

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Jared Casper

January 2015

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/pw135js0060

© 2015 by Jared Arthur Casper. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/pw135js0060

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Oyekunle Olukotun, Primary Adviser


Mark Horowitz


Christos Kozyrakis

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

The performance of microprocessors has grown by three orders of magnitude since

their beginnings in the 1970s; however, this exponential growth in performance has

been achieved without overcoming substantial obstacles. These obstacles were over-

come due in large part of the exponential increases in the amount of transistors

available to architects as transistor technology scaled. Many today call the largest of

the hurdles impeding performance gain “walls”. Such walls include the Memory Wall,

which is memory bandwidth and latency not scaling with processor performance; the

Power Wall, which is the processor generating too much heat to be feasibly cooled; and

the ILP wall, which is the diminishing return seen when making processor pipelines

deeper due to the lack of available instruction level parallelism.

Today, computer architects continually overcome new walls to extend this ex-

ponential growth in performance. Many of these walls have been circumvented by

moving from large monolithic architectures to multi-core architectures. Instead of

using more transistors on bigger, more complicated single processors, transistors are

partitioned into seperate processing cores. These multi-core processors require less

power and are better able to exploit data level parallelism, leading to increased per-

formance for a wide range of applications. However, as the number of transistors

available continues to increase, the current trend of increasing the number of ho-

mogeneous cores will soon run into a “Capability Wall” where increasing the core

count will not increase the capability of a processor as much as it has in the past.

Amdahl’s law limits the scalability of many applications and power constraints will

make it unfeasible to power all the transistors available at the same time. Thus, the

capability of a single processor chip to compute more things in a given time slot will

iv

stop improving unless new techniques are developed.

In this work, we study how to build hardware components that provide new ca-

pabilities by performing specific tasks more quickly and with less power then general

purpose processors. We explore two broad classes of such domain specific hardware

accelerators: those that require fine-grained communication and tight coupling with

the general purpose computation and those that require much a looser coupling with

the rest of the computation. To drive the study, we examine a representative example

in each class.

For fine-grained accelerators, we present a transactional memory accelerator. We

see that dealing with the latency and lack of ordering in the communication chan-

nel between the processor and accelerator presents significant challenges to e�ciently

accelerating transactional memory. We then present multiple techniques that over-

come these problems, resulting in an accelerator that improves the performance of

transactional memory application by an average of 69%.

For course-grained loosely coupled accelerators, we turn to accelerating database

operations. We discuss that since these accelerators are often dealing with large

amounts of data, one of the key attributes of a useful database accelerator is the

ability to fully saturate the bandwidth available to the system’s memory. We provide

insight into how to design an accelerator that does so by looking at designs to perform

selection, sorting, and joining of database tables and how they are able to make the

most e�cient use of memory bandwidth.

v

Acknowledgements

In the last few years I’ve learned that the proverb is true, it takes a village to raise a

child. I have also learned that it takes a village to get a Ph.D. I sincerely appreciate

the help and encouragement of all those, too many to name, that I have interacted

with along the way.

In particular, my loving and incredible wife Colleen has been my rock throughout

the entire process and has never faltered in her support. Her and my two daughters,

Elliot and Amelia, have been there to share the joys of accomplishment and bouy my

spirits during the depths of the lows. They have made it all worth it.

My principal advisor, Kunle Olukotun, has been the epitome of the patient and

wise master to see me through the maze of academia, for which I will be eternally

grateful. Many of the other incredible scholars on the Stanford CS faculty, Christos

Kozyrakis especially, have provided insight and advice that considerably advanced

my work and saved me many hours of frustration. My fellow graduate students have

likewise been an incredible source of inspiration.

Finally, my parents Art and Luana and their unconditional love of me and my

family (and oft-needed financial support) have provided the foundation upon which

I have built my life. Without them being who they are, none of this would have ever

been possible.

vi

Contents

Abstract iv

Acknowledgements vi

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 The Free Lunch . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Multi-Core Processors . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 The Capability Wall . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Tightly Coupled Acceleration 9

2.1 FARM: Flexible Architecture Research Machine . . . . . . . . . . . . 11

2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.2 FARM System Architecture . . . . . . . . . . . . . . . . . . . 14

2.1.3 Module Implementation . . . . . . . . . . . . . . . . . . . . . 18

2.2 Techniques for fine-grain acceleration . . . . . . . . . . . . . . . . . . 21

2.2.1 Communication Mechanisms . . . . . . . . . . . . . . . . . . . 22

2.2.2 Tolerating latency and reordering . . . . . . . . . . . . . . . . 26

2.3 Microbenchmark Analysis . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Transactional Memory Case Study . . . . . . . . . . . . . . . . . . . 31

2.4.1 TM Design Alternatives and Related Work . . . . . . . . . . . 32

2.4.2 Accelerating TM . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4.3 Implementing TMACC on FARM . . . . . . . . . . . . . . . . 38

vii

2.4.4 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 46

2.4.6 Comparison with Simulation . . . . . . . . . . . . . . . . . . . 62

2.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3 Loosely Coupled Acceleration 64

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.2 Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.2.1 Barrel shifting and multiplexing . . . . . . . . . . . . . . . . . 68

3.2.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.2.3 Merge Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.2.4 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.2.5 Sort Merge Join . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.3 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . 85

3.3.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.3.2 Merge Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.3.3 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.3.4 Sort Merge Join . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4 Conclusions 100

Bibliography 102

viii

List of Tables

2.1 Hardware specifications of the FARM system. . . . . . . . . . . . . . 15

2.2 Summary of FPGA resource usage. . . . . . . . . . . . . . . . . . . . 21

2.3 Comparison of Cache Miss latency . . . . . . . . . . . . . . . . . . . 24

2.4 Summary of communication mechanisms. . . . . . . . . . . . . . . . . 24

2.5 TMACC hardware functions used by TMACC-GE. . . . . . . . . . . 42

2.6 TMACC hardware functions used by TMACC-LE. . . . . . . . . . . . 45

2.7 TMACC Microbenchmark Parameter Sets . . . . . . . . . . . . . . . 49

2.8 STAMP benchmark input parameters. . . . . . . . . . . . . . . . . . 52

2.9 STAMP benchmark application characteristics. . . . . . . . . . . . . 52

3.1 Memory port usage in sort merge unit. . . . . . . . . . . . . . . . . . 93

3.2 Summary of sort merge join results. . . . . . . . . . . . . . . . . . . . 97

ix

List of Figures

2.1 Diagram of the Procyon system with the FARM hardware on the FPGA. 14

2.2 Photo of the Procyon system . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 FARM Data Transfer Engine . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 FARM Coherent Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5 Comparison of DMA schemes. . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Comparison of non-coherent and coherent polling. . . . . . . . . . . . 25

2.7 Local and Global Epochs . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.8 FARM Communication Mechanisms . . . . . . . . . . . . . . . . . . . 29

2.9 FARM Experiment Vizualization . . . . . . . . . . . . . . . . . . . . 30

2.10 TMACC Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.11 Logical block diagram of Bloom filters. . . . . . . . . . . . . . . . . . 37

2.12 TMACC Microbenchmark Results . . . . . . . . . . . . . . . . . . . . 48

2.13 STAMP performance on the FARM prototype. . . . . . . . . . . . . . 54

2.14 Single threaded execution time relative to sequential execution. . . . . 57

2.15 TMACC ASIC Comparison - Short Transactions . . . . . . . . . . . . 59

2.16 Projected microbenchmark performance with TMACC ASIC. . . . . . 60

2.17 Projection of STAMP performance with TMACC ASIC . . . . . . . . 61

3.1 A pipelineable eight word barrel shifter. . . . . . . . . . . . . . . . . . 69

3.2 Data and control paths for selection of four elements. . . . . . . . . . 71

3.3 Control logic for the selection unit. . . . . . . . . . . . . . . . . . . . 72

3.4 Merge Join Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.5 Merge Joine Optimization . . . . . . . . . . . . . . . . . . . . . . . . 75

3.6 Sorting using a sort merge tree. . . . . . . . . . . . . . . . . . . . . . 77

x

3.7 Multi-Way Merge Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.8 Sort Merge Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.9 High bandwidth sort merge unit. . . . . . . . . . . . . . . . . . . . . 82

3.10 Full system block diagram and data paths. . . . . . . . . . . . . . . . 82

3.11 Block diagram of prototyping platform from Maxeler Technologies. . 86

3.12 Measured throughput of the select block prototype. . . . . . . . . . . 88

3.13 Select Hardware Resources . . . . . . . . . . . . . . . . . . . . . . . . 90

3.14 Throughput of the merge join prototype. . . . . . . . . . . . . . . . . 92

3.15 Throughput of the sort tree prototype. . . . . . . . . . . . . . . . . . 94

3.16 Sort Hardware Memory Usage . . . . . . . . . . . . . . . . . . . . . . 95

3.17 Full Multi-FPGA Join Process . . . . . . . . . . . . . . . . . . . . . . 97

xi

Chapter 1

Introduction

1.1 Background and Motivation

To understand why domain specific accelerators are necessary, we must first under-

stand the problems facing computer architects today and why traditional approaches

fall short. This section takes a brief look back at how performance gains have histor-

ically been achieved and discusses how these techniques are not able to cope with the

challenges that computer architects face today.

1.1.1 The Free Lunch

The performance of general purpose processors, beginning with the introduction of

the Intel 4004 in 1971, grew exponentially until a few years into the 21st century.

This increase in performance was due largely to two contributing factors: improve-

ments in the underlying technologies, i.e. transistor scaling, and improvements in the

microarchitecture techniques used by chip designers, including pipelining and cache

hierarchies. We call this period “The Free Lunch” because software developers did

not have to do anything to realize performance gains in their application. Software

companies could simply wait for the next generation of processors to be released and

their product would automatically become faster, allowing them to add new features

and new capabilities without improving the performance of the existing code.

1

CHAPTER 1. INTRODUCTION 2

This exponential gain in processor performance was due in large part to the scaling

of the MOS transistor, both in terms of speed and size. The scaling is typically known

as Dennard Scaling as it was predicted by Robert Dennard in the early 1970s[31].

Dennard stated that the power density of transistors would remain constant as they

decreased in size. Thus, as transistors got smaller, more of them could be put into

a chip without substantially increasing the power consumption. During the The

Free Lunch period, dimensions of transistors were reduced by 30% every two years,

or every generation, while the electric fields required to maintain reliability were

held constant. Reducing transistor dimensions by 30% results in a 50% reduction

in the area needed for a given number of transistors. Thus, in the same die size,

developers had twice the number of transistors to use (i.e. Moore’s Law). Reducing

the transistors dimensions also results in an increase in performance, as it takes fewer

electrons to achieve the same electric field required to switch the transistor. The 30%

reduction in size typically resulted in a 40% increase in performance. Finally, these

processors were able to stay within a power budget because the supply voltage scaled

down with the size. Thus, a given number of new transistors consumed the same

amount of energy has half that number of the previous generation.

With a significant increase in the number of transistors available, processor de-

signers were also able to incorporate new architectural techniques to increase the

performance of single processors. Techniques such as branch prediction; superscalar,

out-of-order, and speculative execution; deep pipelining; and vector processing all

significantly contributed to fuel the free lunch. These performance increases were

quantified in Pollack’s Rule which states that performance increases as the square

root of the number of transistors in a processor. Thus, with twice the number of

transistors, performance will increase by 40%. More transistors also allowed design-

ers to include larger caches with the processor, which improved overall memory access

times. All of this combined with the performance increase of the transistors them-

selves to allow Moore’s Law to continue uninhibited through much of the past 30

years.


1.1.2 Multi-Core Processors

Two main factors combined to spell the demise of the free lunch:the ILP wall, and

the power wall. Most of the architectural techniques discussed in the previous section

were driving towards the goal of executing as many instructions as possible per clock

cycle: increasing the “instruction per clock” or IPC. Processors contain large complex

structures to analyze the instruction stream to determine what instructions can safely

be executed and multiple pipelines to execute more than one instruction at a time

if they are available. When it is unclear which instructions can or will execute, the

processor will even predict which will run, speculatively execute those, and roll back

that execution if it determines that its prediction was incorrect. The complexity and

design cost of developing these complicated structures made it increasingly expensive

and di�cult to increase the IPC. In addition, typical instruction streams have enough

dependencies between instructions that there is a limit to how many can actually

execute in parallel, typically four instructions per cycle [63]. Thus, the high e↵ort

to make processors capapable of executing more instructions in parallel resulted in

very little actual performance increase in the vast majority of applications. This lack

of more inherent parallelism in the instruction stream is often referred to as the ILP

wall.

The other main factor that impeded the progress of single processor development

was power. While smaller transistors do require less power to switch, the number of

transistors on a die and the higher frequencies they were running at caused the overall

power needed by the chip to grow exponentially. Eventually, chips were consuming

so much power in such a small space that it became impossible to keep them cool

enough to function. This led to the clock rate of processors levelling o↵ around 4

GHz and limited the complexity of the power-hungry structures that enabled deep

pipelining and super scalar execution; processor had run into what is known as the

power wall.

While the single thread performance still sees marginal gains, it is not enough to

add entirely new capabilities like the exponential growth previous seen was. Thus

designers turned to using the still increasing number of transistors available to add

more processing cores to a single chip rather than improving upon a single core.


Now, instead of executing a single thread of execution faster, new processors execute

more and more threads of execution at about the same speed as they did before.

Adding new capabilities to software is now a bit more di�cult than it was, since the

computation must be partitioned into multiple threads, but it is possible. While the

marketing focus had previously been the core clock rate of the processor, it is now the

number of cores the processor has. See the 2005 ACM QUEUE article “The Future of

Microprocessors” by Kunle Olukotun and Lance Hammond [63] for a full treatment

of the move to multi-core architectures.

1.1.3 The Capability Wall

The switch to multiple cores per processor avoided the power wall in a few ways.

Multiple cores are now able to share certain components such a large caches and

power-hungry high-speed communication circuits that communicate with the rest of

the system. In addition, with more cores to share the workload, the performance of

a single core is not as important as it was before (as long as the workload can be

su�ciently parallelized) and each core can run at a lower frequency while maintaining

overall system performance. On a workload that has two completely independent but

equal tasks to perform, two cores can complete that workload in the same time as a

single core running at half of the clock rate of the single core.

However, this does not solve the fundamental problem that a chip can only con-

sume so much power before it becomes impossible to keep cool. Processors are still

seeing an exponential increase in the number of transistors per chip, which leads to

more cores and bigger caches on a single chip. In addition, as a transistors continue

to get smaller, the amount of power that is consumed even when turned o↵ and not

switch, or leakage power, becomes more dominant. Thus, we are quickly approach-

ing a point where power can not be supplied to all the transistors that can fit in a

chip [16]. Some of the transistors will have to be left o↵ the chip entirely or completely

powered o↵, and turning them on means powering o↵ some other part of the chip.

In addition to the power wall still looming, the ILP wall will return with a di↵erent

face. Just as there is a limit to the amount of parallelism in a typical instruction


stream, there is a limit to the amount of inherent paralleism in many workloads.

Many computational tasks are serial by nature. The result of one step must be

obtained before the next step can be started. These inherently serial tasks see little

benefit in additional processing cores. While the throughput of performing many of

these tasks can be improved, the latency of completing a single task can not. In

addition, most workloads that have mostly indepedent tasks that can be executed in

parallel still have some portion that must be executed serially. Amdahl’s Law provides

an upper bound on the increase in performance given the amount of serial execution

in a workload. For example, if just 5% of a workload is serial, the maximum speedup

from parallelization is just 1/0.05 = 20x, no matter how many cores a system has.

Thus, even if we could power more cores, it won’t help improve the performance of

most application past a certain point.

We call this limit on the benefit of adding more cores to a system the “capability

wall”. We can no longer rely on general purpose hardware improvements to enable

more capabilities. To overcome the capability wall, many researchers are turning to

heterogenous computing. Instead of adding more cores that are all the same, proces-

sors can have cores that each excel at a di↵erent type of computation. Heterogeniety

means that processors are specialized for a particular set of workloads, or a domain.

In other words, they are domain specific. These domain specific blocks can be powered

o↵ when not in use to make more power available to other devices. This argument for

dark silicon holds at the chip level, which is power limited by the package’s thermal

envelope; all the way to the data center which is power limited by its power and

cooling provisioning. Such acceleration blocks thus make sense at the chip level as

part of an SoC, at the system level such as an external accelerator connected to the

system’s main memory or peripheral bus, or at the rack level, as a separate appliance

in a compute cluster.

In a recent article, Andrew Chien suggested one way to look at this move towards

heterogeniety is to see it as a move from 90/10 optimization, where e↵ort is spent

optimizing the common case, to “10x10 optimization” where “the goal is to attack

performance as a set of 10% optimization opportunities” [26]. If 10 new ideas (or 8,

or 12, etc.) can each improve the performance of 10% of the tasks in a workload,


than the overall performance on that entire workload will improve dramatically. In

this way, heterogeneity can break through the capability wall. Andrew Chien and

Shekhar Borkar published an article in Communications of the ACM in 2011, also

titled “The Future of Microprocessors” [16], that gives full treatment of the case for

heteregenous processors.

In this thesis, we explore the extreme case of the domain specific processor: the

domain specific accelerator. We di↵erentiate a domain specific accelerator from a

domain specific processor by looking at its generality. A domain specific processor

is a general-purpose processor that is specialized for a broad domain of applications.

For example, a GPU is general-purpose in that it could conceiably perform almost

any computation; however, it is specifically built to perform computation dealing with

graphics processing. It thus excels at any workloads that have characteristiccs similar

to that of rendering a picture and performs poorly on any other workloads (but it

can do them). In contrast, a domain specific accelerator is not general-purpose at

all. It is very limited in what type of computation it is able to perform, but what it

does, it does extremely well in terms of speed and e�ciency. We will focus entirely

on domain specific accelerators for the remainder of this work; although many of the

concepts presented could apply equally well to domain specific processors.

1.2 Contributions

In this thesis, we explore the extreme case of the domain specific processor: the

domain specific accelerator. We di↵erentiate a domain specific accelerator from a

domain specific processor by looking at its generality. A domain specific processor

is a general-purpose processor that is specialized for a broad domain of applications.

For example, a GPU is general-purpose in that it could conceiably perform almost

any computation; however, it is specifically built to perform computation dealing with

graphics processing. It thus excels at any workloads that have characteristiccs similar

to that of rendering a picture and performs poorly on any other workloads (but it

can do them). In contrast, a domain specific accelerator is not general-purpose at

all. It is very limited in what type of computation it is able to perform, but what it


does, it does extremely well in terms of speed and e�ciency. We will focus entirely

on domain specific accelerators for the remainder of this work; although many of the

concepts presented could apply equally well to domain specific processors.

The primary contribution of this work is to provide insight into the design and

development of domain specific accelerators. To do so, we first classify the di↵erent

types of interesting accelerators into two fundamental categories by looking at how

tightly coupled with the rest of the system the accelerator is. We then examine each

of these accelerators in turn to describe the various problems unique to each class

and provide techniques that can be used to mitigate those problems.

We first look at accelerators that are very tightly coupled with the rest of the

system in Chapter 2. Such accelerators are often di�cult to prototype and design

due to the rigid interfaces through which general purpose processors communicate

over their lowest latency, highest bandwidth links (i.e. with other processors in the

system). We thus begin our look at tightly coupled accelerators by detailing a system

that allows rapid prototyping of hardware connected directly and coherently to the

other processors in the system (Section 2.1). The then describe useful mechanisms for

processor-accelerator communication in this regime (Section 2.2) and provide bench-

marks to characterize a system’s communication performance (Section 2.3). This

is important because the communication between the accelerator and the other pro-

cessing elements often becomes the dominanating characteristic which determines the

performance of the accelerators (Section 2.2). Finally, we put the prototyping sys-

tem and communication techniques to practice in and accelerator for Transactional

Memory (Section 2.4).

We then turn to accelerators that are loosely coupled with the rest of the system

in Chapter 3. In these systems, the accelerator typically has a large task to perform

asynchronously with the rest of the system. We will see that the dominating charac-

teristic is often the accelerator’s ability to quickly access large amounts of memory and

make the most e�cient use of the supplied memory bandwidth. We thus spend the

majority of the chapter detailing a case study of a database operation accelerator. In

examining the detailed design of each component of the accelerator, we provide useful

examples and patterns that can be emulated to design other accelerators that make


e�cient use of memory bandwidth (Section 3.2). We then discuss implementation

details and performance analysis of the accelerator to provide practical knowledge

about gleaning the most out of a particular platform (Section 3.3).

Chapter 2

Tightly Coupled Acceleration

We first look at the class of domain specific accelerator which is tightly coupled

with the computation being performed in the rest of the system. The dominating

characteristic of tightly coupled accelerators is frequent communication with the rest

of the system. This is opposed to loosly coupled accelerators, such as those we will

look at in Chapter 3, where the accelerator works for a large amount of time on a

large amount of data without any synchronization or communication with the rest of

the system. To characterize the space of accelerators that are tightly coupled with

the rest of the system, we look at one application that, as we will show, requires

frequent communication between the general purpose processor and the accelerator:

Transactional Memory (TM). TM provides an ideal proving ground for exploring

issues that arise when designing and implementing such an accelerator.

In tightly coupled accelerators, the frequent communication means that the char-

acteristics of the communication, for example the amount of data and whether the

communication is synchronous or asynchronous, will be a dominant factor in the ac-

celerator’s ability to improve performance. Another dominant factor is where and

how the accelerator connects with the rest of the system. For example, an accelerator

that requires frequent synchronous communication over a high latency link will not

perform well. By reducing the amount of synchronous communication, an accelerator

can more resiliant its placement in the overall system. To this end, in this chap-

ter we present techniques that can be generally employed to deal with asynchronous

9

CHAPTER 2. TIGHTLY COUPLED ACCELERATION 10

communication and thus tolerate a significant amount of latency between the host

system and the acclerator. The TM case study solidifies these techniques by detail-

ing how they are used to make the majority of communication with the accelerator

asynchronous.

Even with the techniques presented to tolerate latency, tightly coupled accelerators

will generally perform better when they are connected to the rest of the system with

high-bandwidth low-latency links. Modern processors have eliminated the traditional

front-side bus architecture and have instead moved the memory controller into the

processor and connected multiple processors with a point to point mesh network.

AMD processors use HyperTransport links while Intel processors use the QuickPath

Interconnect (QPI). This has led to the emergence of a new method of attaching

custom hardware to the system: directly into the processor interconnect. Several

companies have produced boards with FPGAs on them that plug directly into a

standard processor socket [27]. The main advantage of such accelerators is the ability

to participate in the cache coherent protocol of the system and the ability to own a

portion of the systems physical memory space. These systems also provide the custom

logic with the advantageous a high-bandwidth, low-latency link to rest of the system.

While PCI Express 3.x o↵ers similar bandwidth and latency characteristics [27], we

will see that using the cache coherency in the system allows an accelerator to make

better use of the high performance link than it would be able to on a peripheral bus

such as PCIe.

As this technology is relatively new and has not been heavily utilized to date, it

is interesting to explore the performance characteristics of such systems. We start

this chapter by presenting and analyzing FARM, a framework built on top of the

Procyon system from A&D Technology [3]. FARM not only allows us to measure key

performance characteristics, but serves as the platform upon which we can build our

TM accelerator.

The major contributions of this chapter are thus:

• We present FARM, a novel prototyping environment based on the use of a co-

herent FPGA. We detail its design, implementation, and characteristics. (Sec-

tion 2.1)


• We describe useful mechanisms for processor-accelerator communication, in-

cluding techniques for tolerating the latency of fine-grained asynchronous com-

munication with an out-of-core accelerator. (Section 2.2).

• We provide a thorough study of the performance characteristics of FARM, pro-

viding insight for designers considering the use of coherent FPGA solutions for

their own problems. (Section 2.3)

• We present a system (both software and hardware) for Transactional Memory

Acceleration using Commodity Cores (TMACC). We detail two novel algorithms

for transactional conflict detection, both of which employ general purpose out-

of-core Bloom filters. (Section 2.4).

• We demonstrate the potential of TMACC by evaluating our implementation

using a custom microbenchmark and the STAMP benchmark suite. We show

that, for all but short transactions, it is not necessary to modify the processor

to obtain substantial improvement in TM performance. TMACC outperforms

an STM by an average of 69%, showing maximum speedup within 8% of an

upper bound on TM acceleration (Section 2.4.5).

2.1 FARM: Flexible Architecture Research Ma-

chine

Heterogeneous architecture that incorporate domain specific archiectures are fun-

damentally di↵erent from existing hardware and di�cult to accurately model using

traditional simulators. In particular, traditional simulation techniques fall short when

domain specific accelerators are tightly coupled to general purpose computer through

the system interconnect (see Section 2.4.6). New hardware prototypes are therefore

extremely useful, being faster and more accurate than simulators. In addition to pro-

viding better insight into the system and being able to run larger and more realistic

pieces of code (such as an OS), prototyping allows researchers to find bugs and design

holes earlier in the development cycle.


FARM is based on an FPGA that is coherently tied to a multiprocessor system.

E↵ectively, this means that the FPGA contains a cache and participates in coherence

activities with the processors via the system’s coherence protocol. Throughout this

chapter we refer to an FPGA connected coherently as a “coherent FPGA.” Coherent

FPGAs allow for prototyping of some interesting segments of the architectural de-

sign space. For example, architectures requiring rapid, fine-grained communication

between di↵erent elements can be easily represented using FARM. Ideas involving

modifications to memory tra�c, coherence protocols, and related pursuits can also

be implemented and observed at the hardware level, since the FPGA is part of the

coherence fabric. The close coupling also obviates the need for soft cores or other

processors on the FPGA in many cases, since general computation can be done on

the (nearby) processors. Section 2.1.2 provides details about the system architec-

ture and implementation of FARM. In addition to prototyping, FARM’s architecture

is naturally well-suited to exploring the domain specific architectures this thesis is

exploring, with the FPGA functioning as the accelerator.

Using a tightly-coupled coherent FPGA, whether as an accelerator or for proto-

typing, presents communication and sharing challenges. One must provide e�cient

and low-latency methods of communication to and from the FPGA. When function-

ing in the capacity of an accelerator, in particular, it is very necessary to understand

the behavior of the communication mechanisms o↵ered by FARM. The mechanisms

include: traditional memory-mapped registers (MMRs), a streaming interface, and a

coherent cached interface. Section 2.2.1 details these methods of communication and

suggests how one important application characteristic, frequency of synchronization,

could a↵ect the choice of communication mechanism.

System designers must understand the tradeo↵s and overheads that accompany

each communication type when using it to accelerate applications with various char-

acteristics, especially di↵ering levels of synchronization between the FPGA and the

processors. In particular, knowledge of the execution overhead introduced by using

a dedicated remote accelerator would suggest a minimum for the speedup benefits

gained when using that accelerator. Furthermore, this overhead is not constant, but

rather a function of the type of communication chosen as well as other characteristics,


such as latency and synchronization. Section 2.3 explores these issues by presenting

the performance of a synthetic benchmark on FARM for all communication mech-

anisms and various other factors. Such data should influence users of FARM-like

systems when deciding on implementations of heterogeneous prototypes or coproces-

sors.

2.1.1 Related Work

The FARM prototyping environment follows in the tradition of previous FPGA-based

hardware emulation systems such as the Rapid Prototyping engine for Multiproces-

sors (RPM) [9]. RPM focused on prototyping multiprocessor architectures where

FPGAs are used primarily for gluing together symmetric cores, but not much for

computation. RAMP White [81] is a similar approach, prototyping an entire SMP

system with an FPGA, including CPU cores and a coherency controller. We di↵er in

that our approach is more directed at evaluating heterogeneous architectures, where

the FPGA prototypes a special-purpose module (e.g. an energy-e�cient accelera-

tor) attached to high-performance CPUs. Convey Computer Corporation’s HC-1 is a

high-performance computing node that features a coprocessor with multiple FPGAs

and a coherent cache [28]. Convey’s machines are di↵erent in that they optimize

for memory bandwidth in high-performance, data-parallel applications. The copro-

cessor’s cache is usually only used for things like synchronizing the start sequence.

Recently, AMD researchers have also implemented a coherent FPGA [7]. AMD’s and

our system use di↵erent versions of the University of Heidelberg’s cHT core to han-

dle link-level details of the protocol1, but AMD does not give a thorough analysis of

system overheads for various configurations and usages.

Indeed, there has not been much discussion on how these coherent FPGA systems

can be well-utilized, and what kinds of applications can benefit from them. In this

section we discuss issues such as system utilization and present some key considera-

tions to account for when building with these systems. We also provide the detailed

1The cHT core was provided by the University of Heidelberg [60] under an AMD NDA. Wemade modifications and extensions to the core to improve functionality, increase performance andintegrate with the FARM platform.


1.8GHzCore 064K L1

512KBL2 Cache

2MBL3 Shared Cache

512KBL2 Cache

…

HyperTransport

512KBL2 Cache

2MBL3 Shared Cache

512KBL2 Cache

…

HyperTransport

32 Gbps

32 Gbps

1.8GHzCore 364K L1

1.8GHzCore 464K L1

1.8GHzCore 764K L1

AMD Barcelona ~60 ns6.4 Gbps,

6.4 Gbps cHTCore™Hyper Transport (PHY, LINK)

AlteraStratix II FPGA (132k Logic Gates)

ConfigurableCoherent Cache

Data Transfer Engine

Cache Interface

Data Stream Interface

User ApplicationMMR

~380 ns

Figure 2.1: Diagram of the Procyon system with the FARM hardware on the FPGA.

design and implementation of our system.

2.1.2 FARM System Architecture

This section presents the design details of FARM. We begin with a description of

the system architecture and the hardware specifications of our particular implemen-

tation. We then describe the usage of the FPGA in FARM and detail the design and

structure of some of our key units. We also reveal our implementation of the coher-

ent HyperTransport protocol layer and describe methods and strategies for e�ciently

communicating coherently with CPUs.

FARM is implemented as an FPGA coherently connected to two commodity

CPUs. The three chips are logically connected using point-to-point coherent Hy-

perTransport (HT) links. Figure 2.1 shows a diagram of the system topology, along

with bandwidth and latency measurements, as well as the high level design of the

FARM hardware. Memory is attached to each CPU node (not shown). Latency mea-

surements in the figure represent one-way trip time for a packet from transmission to

reception, including de-serialization and bu↵ering logic.

We used the Procyon system, developed by A&D Technology Inc. [3], as a baseline

in the construction of the FARM prototype. Procyon is organized as a set of three

daughter boards inter-connected by a common backplane via HyperTransport. Figure

2.2 shows a photograph of the Procyon system. The first board is a full system board

featuring an AMD Opteron CPU, some memory, and standard system interfaces


Figure 2.2: Photo of the Procyon system with a main board, CPU board, and FPGAboard.

CPU Type AMD Barcelona 4-core (2 CPUs)Clock Freq 1.8 GHzL1 Cache Private: 64KB Data, 64KB InstrL2 Cache Private: 512KB UnifiedL3 Cache Shared: 2MB

Physical Topology 3 boards connected via backplaneDRAM 3GB (2GB on main system board)

HT Link Type HyperTransport: 16-bit linksCPU-CPU HT Freq HT1000 (1000 MT/s)

CPU-FPGA HT Freq HT400 (400 MT/s)FPGA Device Stratix II EP2S130

Logical Topology Single chain of point-to-point links

Table 2.1: Hardware specifications of the FARM system.


such as USB and GigE NIC. The second board houses another Opteron CPU and

additional memory. The third board is an FPGA board with an Altera Stratix II

EP2S130 and support components used for programming and debugging the FPGA.

The photograph shows the FPGA board, secondary CPU board, and full system

board from left to right, respectively. The system runs on both Linux and Solaris;

our experiments were run on Arch Linux with linux kernel 2.6.31. Table 2.1 gives a

detailed listing of FARM’s hardware specifications.

Our FARM device driver is somewhat unique in that it is the driver for a coherent

device, which looks quite di↵erent to the OS than a normal non-coherent device. To

allow for flexibility in communication with the FPGA, the driver reconfigures the

system’s DRAM address map (in the MTRRs and PCI configuration space) to map

a section of the physical address space above actual physical memory to “DRAM” on

the FPGA. We must keep this memory hidden from the OS to prevent it from being

used for normal purposes. Using the mmap mechanism, these addresses are mapped

directly into the user program’s virtual address space. The FPGA then acts as the

memory controller for this address space, allowing the user program to read and write

directly to the FPGA. (see Section 2.2.1).

The FARM device driver is also used to pin memory pages and return their phys-

ical address in order to facilitate coherent communication from the FPGA to the

processor. An alternative, albeit more complicated, solution would be to maintain a

coherent TLB on the FPGA.

Reconfigurability in a prototype built with FARM is provided via the attached

FPGA. The FPGA houses modules that allow for general coherent connectivity to

the processors as well as a means by which the coprocessor or accelerator can use

these modules. As shown in Figure 2.1, the FARM platform implements a version of

AMD’s proprietary coherence protocol, called coherent HyperTransport (cHT). With

some exceptions, the cHT definition is a superset of HyperTransport that allows for

the interconnection of CPUs, memory controllers, and other coherent actors. Coher-

ent HyperTransport implements a MOESI coherence protocol. The cHT core, also

described in the introduction, handles only link-level details of the protocol such as

flow control, CRC generation, CRC checking, and link clock management. Primarily,


the core interfaces between the serialized incoming HT data (in LVDS format) and

the standard cHT packets which are exchanged with the logic behind the core. We

designed and implemented the custom transport layer logic, the Data Transfer Engine

(DTE), to process these packets. The DTE handles: enforcement of protocol-level

correctness; piecing together and unpacking HT commands; packing up and sending

HT commands; and HT tag management. The DTE also handles all the details of

being a coherent node in the system, such as responding to snoop requests. In ad-

dition, the FARM platform includes a parameterized set-associative coherent cache.

We will provide design and implementation details for the DTE and the cache later

in this section. Finally, there is also a small memory mapped register (MMR) file for

status checking and other small-scale communication with the processors.

The FARM platform provides three communication interfaces for the hardware

being prototyped by the user on the FPGA, or the user application. One is a co-

herent interface. Having a coherent cache, the FPGA can communicate with a CPU

using the normal coherence protocol. In the current implementation, we circumvent

the need for a coherent TLB on the FPGA by using only physical addresses of a

pinned contiguous memory region. Another interface is a stream interface where we

support streaming (or “fire-and-forget”) non-coherent communication. To implement

this interface, the FPGA is assigned a specific range of the physical address space.

This memory region can be marked as uncacheable, write-combining, write-through,

or write-back. Our original design marked this “FARM memory” as uncacheable

to allow for communication with FARM that bypassed the cache. However, the

Barcelona CPUs impose very strict consistency guarantees on uncacheable memory,

so we instead mark this section as write-combining in FARM. This marks the region as

non-coherent and bypasses the Opteron’s store ordering requirements without impos-

ing strict consistency, as “uncacheable” memory does, that would impede streaming

data. The final interface is standard memory-mapped registers (MMR). A detailed

comparison of these interfaces can be found in Section 2.2.1.

We use dual-clock bu↵ers and (de-)serialization blocks to partition the FPGA into

three di↵erent clock domains: the HyperTransport links, the cHT core, and the rest of

the acceleration logic (everything “above” the cHT core). In our base configuration:


cHT BuscHT

Core

DTE

Co-herentCache

MMR

Snoop Handler

Data Requester

Data Handler

Stream-in Traffic Handler

to user App

Figure 2.3: Block diagram of data transfer engine (DTE) components. Arrows rep-resent requests and data buses.

the user application and cHT core run at 100 MHz and the HyperTransport links at

200 MHz.

2.1.3 Module Implementation

The DTE and the cache are two vital units allowing the accelerator to communicate

with the processors, process snoops, and store coherent data. In this section, we

briefly describe the design and structure of these modules as implemented on our

FPGA.

Data Transfer Engine

The DTE’s primary responsibility is ensuring protocol-level correctness in Hyper-

Transport transactions. Figure 2.3 shows a block diagram of the components of the

DTE. A typical transaction is the following: If the data requester on the FPGA

requests data from remote memory (owned by one of the Opteron CPUs), snoops

and responses must be sent among all coherent nodes of the system (assuming no

directory) to ensure that any dirty cached data is accounted for. In this example,


because the FPGA is the requester, the DTE’s data handler is responsible for count-

ing the responses from all caches as well as the data’s home memory controller and

selecting the correct version of the data. Evictions from the FPGA’s cache to remote

memory are also fed to the cHT core via the data requester. In addition, snoops

incoming to the FPGA are processed by the snoop handler in the DTE. The DTE

also handles incoming tra�c for stream and MMR interfaces. In doing so, the DTE

acts as a pseudo-memory controller for memory requests belonging to the FPGA’s

memory range. Coherent HyperTransport supports up to 32 simultaneously active

transactions by assigning tags to each transaction, so the design must be robust to

transaction responses and requests arriving out of order. The DTE handles this by

using tag-indexed data structures and tracking tags of incoming and outgoing packets

in the data stream interface.

Configurable Coherent Cache

In general, FARM’s configurable coherent cache behaves like an ordinary data cache;

it coherently keeps data in the vicinity of the computation by initiating data transfers

and responding to snoop requests. However, have made a few di↵erent choices in the

design and implementation of the cache to best serve are target applications. For

example, in current implementation, we don’t have a coherent TLB on the FPGA,

but using only physical addresses of a pinned contiguous memory region.

Figure 2.4 shows the block diagram of our coherent cache module. The cache is

composed of three major subblocks. The core is where the traditional set-associative

memory lookup happens; the write bu↵er keeps track of evicted cache lines until they

are completely written back to memory; and the prefetch bu↵er is an extended fill

bu↵er to increase data fetch bandwidth. There are three distinct data paths from the

cache to the DTE: fetching data, writing data back, and snooping. All data transfers

happen at cache line granularity. The user application can request that the cache

prefetch a line and read or write to memory using a normal cache interface.

Our normal cache interface supports simple in-order reads and writes at word

granularity.2 This is a valid compromise of design complexity (and power, area, and

2In actuality, our cache is not strictly in-order but supports hit-under-miss. That is, the interface


DTEUser

AppConfigurableCache Core

Prefetch Buffer

Write Buffer

snoop

Coherent Cache

Fetch Data

Figure 2.4: Block diagram of coherent cache components. Arrows represent directionof data flows, rather than that of requests.

verification) against application performance since we seldom expect complex out-

of-order computation behind our cache. However, the user application can initiate

multiple data fetch transfers through the prefetch interface. Unlike the normal inter-

face, the prefetch interface is non-blocking as long as there is an empty slot in the

bu↵er. This design is based on the observation that in many cases the user application

can pre-compute a set of addresses to be accessed.

The cache module is responsible for maintaining the coherence of the data it has

cached. First, the cache answers incoming snoop requests by searching for the line

in all three subblocks simultaneously. Snoop requests have the highest priority since

their response time is critical to system-wide cache miss latency. Second, the module

must maintain the coherence status of each cached line. For simplicity, our current

implementation assumes that cache lines are either modified or invalid; exclusive

access is requested for each line brought in to the cache. This simplification is based

on the observation that for our current set of target applications, the cache is most

often used for producer-consumer style communication where non-exclusive access to

the line is not beneficial.

stalls at the second miss, not the first.


FARM modules4Kbit Block RAMs 144 (24%)Logic Registers 16K (15%)LUTs 20KFPGA Device Stratix II EPS130FPGA Speed Grade -3 (Fastest)

Table 2.2: Summary of FPGA resource usage.

The cache uses physical addresses, not virtual addresses. This saves us from

implementing address translation logic, a TLB, and a page-table walker in hardware

and from modifying the OS to correctly manage the FPGA’s TLB. Instead we rely

on the software to use pinned pages provided by our device driver for shared data.

FPGA Resource Usage

Table 2.2 shows an overview of the resource usage on the FPGA. We made an e↵ort

to minimize the usage of FPGA resources by FARM modules in order to maximize

free resources for the user application. Note that the cache module has several con-

figuration parameters, including total size and associativity of the cache, size of each

cache line, and others. These parameters are configured at synthesis time to meet

area, frequency and performance constraints for application. The numbers for FARM

modules in the table reflect a 4KB, 2-way set associative cache.

2.2 Techniques for fine-grain acceleration

When an accelerator requires frequent communication with the computation done

on the general purpose processors, two fundamental design decisions must be made:

how to communicate with the accelerator at the lowest level, and how to tolerate

the adverse characteristics of the underlying interconnect, such as large, variable

latency, and out of order delivery of data. In this section we first describe various

communication mechanisms and when they should be used, noting how they have

been implemented in FARM when applicable. We then look at methods of tolerating

latency and reordering in the underlying interconnect.


2.2.1 Communication Mechanisms

Fundamentally, communication between a accelerator and a processor can be per-

formed either synchronously or asynchronously. As we will see, breaking down the

communication into these two methods and reducing the synchronous communica-

tion as much as possible is a critical step in the process of designing an acceleration

system.

A single method of communicating with a accelerator will not be su�cient for

all situations. For example, nearly all accelerators will need an asychronous method

of moving data from the processor to the accelerator, but will also need to occa-

sionally perform synchronous communication just as most parallel algorithms require

synchronization to coordinate the computation across the nodes.

FARM supports multiple communication mechanisms tailored for di↵erent situa-

tions. Applications may use traditional memory-mapped registers (MMRs), a stream-

ing interface for pushing large amounts of data to the FPGA with low overhead, or a

coherent cache for communicating with the FPGA as if it were another processor in

a shared memory system.

MMRs are traditionally used for infrequent short communication, such as config-

uration, because of the time required to read and write to them. FARM allows for

much faster access to the MMRs because of the FPGA’s location as a point-to-point

neighbor of the processors. Specifically, we measured the total time to access an

MMR on farm to be approximately 672 ns, nearly half the measured 1240 ns to read

a register on an ethernet controller directly connected to the south bridge via 2nd

generation PCIe x4 on our system and inline with the latency of PCIe 3.x devices.

This lower latency allows MMRs in FARM to be used for more frequent communica-

tion patterns like polling. More detailed measurements show that most of the 672 ns

is spent handling the access inside the FPGA, indicating that this latency could be

further reduced by upgrading to a faster FPGA.

FARM’s MMRs uses uncached memory, which provides strong consistency guar-

entees. However, this means that access to multiple MMRs will not overlap and the

total access time will grow linearly with the number of registers accesses, just like

those to normal PCI registers. With FARM it is just as simple to put the MMRs in


the write-combining space, which has weaker consistency guarantees but would allow

multiple outstanding accesses (although still disallow caching) and thus provide much

faster multi-register access. Section 2.4.5 uses uncached memory for the MMRs, as

the uncached semantics are closer to the expected use of MMRs.

FARM’s streaming interface is an e�cient way for the CPU to push data to

the FPGA. To facilitate streaming data, a physical address range marked as write-

combining is mapped to the FPGA. Writes to this address range are immediately

acknowledged and piped directly to the user application module. The internal pipeline

passes 64 bits of data and 40 bits of address to the user application per clock.

On the CPU, write requests to the streaming interface are queued in the core’s

write-combining bu↵er and execution continues without waiting for the request to

be completed. Consecutive accesses to the same cache line are merged in the write-

combining bu↵er, reducing o↵ chip bandwidth overhead. Thus, to avoid losing writes,

every streamed write must be to a di↵erent, ideally sequential, address. The CPU

periodically sends requests from the bu↵er to the FPGA or an explicit flush can be

performed to ensure that all outstanding requests are sent to the FPGA.

Finally, the coherent cache allows for shared memory communication between the

CPUs and FPGA. Since the cache on the FPGA is kept coherent, the FPGA can

transparently read data either directly from a CPU’s cache or from DRAM, and vice

versa. The communication latency is simply the o↵-chip cache miss latency, which is

summarized in Table 2.3. In the table, the column labelled FARM shows the cache

miss latency measured on the current FARM system. Except when the requesting

CPU is two hops away from the FPGA, this latency is fairly constant because the

FPGA’s response to the snoop dominates any other latency. For comparison we also

provide measurements using the same system with the FPGA removed. This increase

in latency would be intolerable for an end product, but is reasonable for a prototype

platform and would be mitigated by using a faster FPGA.

The coherent communication mechanism is especially beneficial when performing a

pull -type data transfer (i.e. DMA), or when polling for an infrequent event. Figure 2.5

illustrates two di↵erent ways of performing a DMA from the CPU to the FPGA.

Figure 2.5.(a) is the conventional DRAM-based method, where (1) a CPU first creates


Service Location FARM FARMof cache miss w/o FPGAMemory 495 ns 189 nsOther cache (on-chip) 495 ns 145 nsOther cache (o↵-chip) 500 ns 195 nsFPGA cache (1-hop) 491 ns N/AFPGA cache (2-hop) 685 ns N/A

Table 2.3: Comparison of Cache Miss latency

CPU DRAM FPGA

(a) Through DRAM (Conventional)

CPU DRAM FPGA

(b) Through Coherent Cache

(1)

(2)

(3)

(1)

(2)

Figure 2.5: Comparison of DMA schemes.

Interface Description Approx. BandwidthProposed Usage

MMR CPU writes to FPGA’s MMR 25 MB/sInitialization or change of configuration

MMR CPU reads from FGPA’s MMR 25 MB/sPolling (likely to hit)

Stream CPU writes into FPGA’s address space 630 MB/sData push

Coherent CPU reads from FPGA’s cache 630 MB/sData pull or Polling (likely to miss)

Coherent FPGA reads from CPU’s cache (i.e. coherent DMA) 160 MB/sData pull or Polling (likely to miss)

Table 2.4: Summary of communication mechanisms..


CPU FPGA

(a) Non-coherent polling

CPU FPGA

(b) Coherent polling

(1) (1)

0

(2)

1

1

(2)

0

Figure 2.6: Comparison of non-coherent and coherent polling.

data in its own cache, (2) the CPU moves the data to DRAM, and (3) the FPGA

reads the data from DRAM. Note that during the data preparation steps, (1) and (2),

the CPU is kept busy. FARM’s coherence allows the method shown in Figure 2.5.(b),

where (1) the CPU leaves the data and proceeds while (2) the FPGA reads the data

directly from the CPU’s cache.

The coherent interface is also beneficial when polling infrequent events [59]. Fig-

ure 2.6 illustrates this by comparing (a) non-coherent polling through MMR reading

and (b) coherent polling through a shared address. In both cases, the event to be

polled is represented as a star, and the CPU polls it before and after the event, de-

noted as (1) and (2) respectively. In Figure 2.6.(a), (1) and (2) have the same MMR

reading latency, while in (b), (1) has the negligible latency of a cache hit and (2) has

up to twice the cache miss latency. Thus, when the event is infrequent, the majority

of checks performed by the CPU are simply a cache hit and do not stall the CPU at

all.

Table 2.4 summarizes communication mechanisms based on FARM’s three inter-

faces and their proposed usages. The MMR bandwidth numbers are for MMRs are

in uncached memory. The roundtrip latency to the FPGA is the limiting factor for

the MMR bandwidth. The bandwidth of the FPGA reading from the CPU’s cache

is limited by the bandwidth of the cHT core because the data read pathway has not


been optimized. Measurements indicate that optimizing this pathway could bring

this number up to at least 320 MB/s.

2.2.2 Tolerating latency and reordering

For many applications, like TM, that require fine-grained (frequent) communication

between the processor and an accelerator, asynchronous communication is essential

for performance. When using fully asynchronous communication to out-of-core de-

vices, however, it is incorrect to assume that commands are received by the accelerator

in the same order they were dispatched from the processors. Consider the following

example: One processor sends a command to add an address to a transaction’s read

set; this command stalls in the processor’s write-combining bu↵er. Later, a commit-

ting transaction on another processor sends notification that it is writing to that same

address. This notification arrives immediately (before the preceding add to read set

by the first processor) and thus the conflict is missed because the FPGA sees the com-

mit notification and the add to the read set command in reverse order. To avoid the

performance penalty of a more synchronous communication scheme (e.g. an mfence

after each command), accelerators such as those in TMACC must therefore reason

about possible command reorderings.

To address this serious issue, we present epoch-based reasoning and apply the

technique to our Bloom filter accelerators. In this scheme, we split time into variable

sized epochs, either locally determined (local epochs) or globally agreed upon (global

epochs). Global epochs can be implemented using a single shared counter variable

that is atomically incremented when a thread wants to move the system into a new

epoch. To inform the accelerator of the epoch in which a command is executed, the

epoch counter, which will usually be in the L1 cache, is read and included in the com-

mand. The accelerator then compares the epochs of commands to determine a coarse

ordering, with the atomic increment providing the necessary happens-before relation-

ship between threads. The accelerator cannot determine the ordering of commands

with the same epoch number, since it may only assume the commands were fired at

some point during the epoch (see Figure 2.7). Thus, the granularity of epoch changes


Epoch EpochN

EpochN+1N−1

Local EpochsGlobal Epochs

A

B

C

A

B

C

Figure 2.7: To determine the ordering of events, time is divided into epochs, eitherglobally or locally. In the global epochs example, it is known that A comes before Band C, but not the relative ordering of B and C. In the local case, it is known thatC comes before B, but not the ordering of A and B or A and C because their epochsoverlap.

determines the granularity at which the accelerator is able to determine ordering.

The potentially high overhead of maintaining a single global counter can be elim-

inated by using epochs local to each thread. When a thread wants to move into a

new local epoch, it sends a command to the accelerator to inform it of an epoch

change and performs a memory fence to ensure any command tagged with the new

epoch number happens after the accelerator sees the epoch change. The epoch change

command can often be included in an existing synchronous command with low cost.

While this scheme has less overhead, it leaves the accelerator with less information

about the ordering of events. Like the global scheme, the accelerator may only as-

sume the command was fired at some point during the epoch; therefore the relative

ordering of commands from di↵erent threads can only be determined if their epochs

do not overlap, as illustrated in Figure 2.7.

2.3 Microbenchmark Analysis

Designers of memory-system based accelerators, such as FARM, would benefit from

understanding how key application characteristics a↵ect the overhead introduced by

the system. For example, it is clear that one would avoid the fully synchronous MMR

write for frequent communication with the accelerator. Less obvious, however, is the

choice between using streaming versus DMA for moving data to the accelerator. Side

e↵ects such as CPU involvement, which would be considerably more for the streaming


Algorithm 1 Microbench to characterize communication mechanisms.procedure MainLoop(numIter, commType, N , M , K)

for i = 1 to numIter dofor i = 1 to K do

InitCommunication(commType, M);DoComputation(N);

Synchronize(commType);

procedure InitCommunication(commType, M)switch commType do

case MMRdoMMRWrite(M)

case STREAMdoStreamWrite(M)

case DMAInitateDMA(M)

procedure DoComputation(N)for j = i to N do

nop();

procedure Synchronize(commType)switch commType do

case MMRdoNothing(); . MMR is always synchronous

case STREAMflushWriteCombingingBu↵er();

case DMAwaitforDMADone();

case, complicate matters further.

To adequately address questions such as these, we constructed a microbenchmark

that allows for variation of key parameters a↵ecting communication overhead. Al-

gorithm 1 displays its pseudocode. Three parameters control the behavior of the

communication:

• N controls the frequency of communication. That is, communication happens

every N CPU operations.

• M controls the granularity of communication by specifying how much data (in


0

1

2

3

4

5

6

7

8

9

10

100 1000 10000 100000

Communication Granularity, M (Bytes)

Mea

sure

d C

omm

unic

atio

n O

verh

ead

(Cyc

les/

B)

STREAM (N=16384) DMA(N=16384)

STREAM(N=2048) DMA(N=2048)



0

1

2

3

4

5

6

7

8

9

10

100 1000 10000 100000

Communication Granularity, M (Bytes)

Mea

sure

d C

omm

unic

atio

n O

verh

ead

(Cyc

les/

B)

STREAM(N=256, K=inf) DMA(N=16384,K=inf)

STREAM(N=256,K=5) DMA(N=16384,K=5)

STREAM(N=256, K=1) DMA(N=16384,K=1)

(a) E↵ect of communication (b) E↵ect of synchronizationgranularity(M) and frequency(N) frequency (K)

Figure 2.8: Analysis of communication mechanisms using microbenchmark in Algo-rithm 1. The detailed meaning of parameter M,N,K can be found there.

bytes) is transferred per communication.

• K controls the frequency of synchronization. Synchronization occurs after ev-

ery K sets of communication/computation segments. If K is 1, we assume

synchronization happens only once: at the end of the application.

Figure 2.8 explores the e↵ects of communication granularity, communication fre-

quency, and synchronization on communication overhead. The vertical axis is com-

munication overhead measured in cycles per byte received by the FPGA (lower is

better). We first examine the case of asynchronous communication (i.e. K is 1) in

graph (a).

For the streaming interface (solid lines), the results for all communication frequen-

cies are asymptotic, with the overhead approaching 2.8 cycles/B for large M . After

taking into account the CPU clock frequency (1.8GHz), this value is close to the 630

MB/s bandwidth limit reported in Table 2.4. As we decrease M , however, we see


...

Computation Communication

Computation

Computation

Computation

CPU’sReorderingWindowSize

...

Communication

Computation

...DMA init

Computation(i)

Computation(i+1)

DMA TX(i)

DMA init

DMA TX(i+1)

DMA init

Computation(i)

Computation(i+1)

DMA TX(i-k)

DMA init

DMA TX(i-k+1)

... ...

... ...

(a) Stream interface’s case (b) DMA interface’s case

Figure 2.9: Visualized explanation of graph 2.8. For stream interface’s case, whengranularity(M) is large the communication overhead is solely determined by the band-width limit, while CPU’s instruction reordering can hide it for small M. Similar ex-planation applies to DMA’s case, where communication overhead can be completelyhidden depending on the choice of M and N.

the overhead decrease and surpass the bandwidth limit. This is because for smaller

amounts of data, the overhead can be hidden by the CPU’s out-of-order window.

Figure 2.9.(a) provides a visualized explanation of this e↵ect. For frequent communi-

cation (N=256), there is not enough computation to hide the communication latency,

which explains the increased overhead for this data point compared to the other three.

For DMA communication of data from the CPU’s cache to the FPGA(dashed

lines), we immediately see that the overhead is increased due to the Note, however,

that the general behavior of the curves is similar to that of the streaming case. Fig-

ure 2.9.(b) provides further insight into DMA behavior. The figure on the left depicts

the case where N=16384 and M=1024. In this scenario, the actual DMA transfer

time is fully overlapped with the subsequent computation. When this is the case, the

overhead is simply the time taken to setup the DMA. For very small M , the small

amount of computation per DMA is not enough to amortize this setup time. As the

amount of data per communication goes up, the setup time is amortized and the

overhead per byte goes down. If we increase M to the point that data transfer time

becomes longer than computation time (seen on the right of Figure 2.9.(b)), we see a

dramatic increase in the overhead. As in the streaming case, the overhead converges


to the bandwidth of the DMA transfer (See Table 2.4).

Figure 2.8.(b) explores the e↵ect of synchronization frequency. Smaller K means

more frequent synchronization. We take two data points from graph (a) for both

streaming (N=256) and DMA (N=16384), and we vary K. For the streaming in-

terface, synchronization means flushing of the write-combining bu↵er. For coherent

DMA, synchronization requires waiting (busy wait) until all queued DMA opera-

tions have finished. For very large communication granularity (M), the overhead is

bounded by the bandwidth in both cases and synchronization does not matter. For

smaller M , however, both communication methods exhibit an increase in overhead.

For the streaming interface, flushing the write bu↵er cripples the CPU’s out-of-order

latency-hiding e↵ect, hence the increased overhead for K=1. For DMA, synchroniza-

tion adds the fixed overhead of setting up the DMA.

2.4 Transactional Memory Case Study

Transactional memory (TM) [39, 49] is a potential way to simplify parallel program-

ming. Ideally, TM would allow programmers to make frequent use of large transac-

tions and have them perform as well as highly optimized fine-grain locks. However,

this ideal cannot be realized until there are real systems capable of executing large

transactions with low overhead. Our aim in this section is to describe a TM system

that strikes a reasonable balance between performance, cost and system implementa-

tion complexity.

Researchers have proposed a wide variety of TM systems. There are systems im-

plemented completely in hardware (HTMs), completely in software (STMs), and more

recently, systems with both hardware and software components (hybrid TMs). To put

our contributions in context, we now briefly review the strengths and weaknesses of

the various TM design alternatives.


2.4.1 TM Design Alternatives and Related Work

STM

Software transactional memory (STM) systems [70, 34, 38, 66, 53, 75] replace the

normal loads and stores of a program with short functions (“barriers”) that pro-

vide versioning and conflict detection. These transactional read and write barriers

must themselves be implemented using the low-level synchronization operations pro-

vided by commodity processors. The barriers can be inserted automatically by a

transaction-aware compiler [8, 80, 5] or managed runtime [66], added by a dynamic

instrumentation system [62], or invoked manually by the programmer. STMs increase

the number of executed instructions, perform extra loads and stores, and require meta-

data that takes up cache space and needs to be synchronized. The resulting inherent

performance penalty means that despite providing good scalability, most STMs fall

far short of the performance o↵ered by hardware-based approaches to TM. There have

been proposals that reduce the overhead required [80], but they do so by giving up on

the promise of TM–they require small transactions that are used rarely. Hence, using

these STMs is as di�cult as using fine-grain locks. As a result of these limitations,

STMs have been largely constrained to the domain of research [21]. However, tech-

niques developed in STM research has been successfully used for optimized parallel

data structures [17].

HTM

At the opposite end of the spectrum from STM is hardware transactional memory

(HTM) [37, 23, 12, 65, 13, 51]. HTM systems eliminate the need for software barri-

ers by extending the processor or memory system to natively perform version man-

agement and conflict detection entirely in hardware, allowing them to demonstrate

impressive performance. Version management in an HTM is performed by either

bu↵ering speculative state (typically in the cache or store bu↵er) or by maintaining

an undo log. Metadata that allows conflict detection is typically stored in Bloom

filters (signatures) or in bits added to each line of the cache. The close synergy of the

hardware with the processor core and cache allow these systems to provide very high


levels of performance; however this tight relationship causes the system to be inflexi-

ble and more costly. Recent advances in HTM design address both of these problems

by minimizing the coupling between the TM and the processor core [82, 71], but even

decoupled HTM designs introduce nontrivial design complexity and disturb the deli-

cate control and data paths in the processor core. The first-level cache has the e↵ect

of hiding loads and stores from the outside world, making it impossible to construct

an out-of-core pure HTM system. Previous studies have not explored the possibil-

ity of adding transactional acceleration hardware without modifying a commodity

processor core.

In addition to the design complexity introduced by hardware-based TMs, there

still remains some uncertainty as to the optimal lightweight, forward-compatible se-

mantics appropriate for transactional memory. Several open questions are yet to

be resolved: strong versus weak isolation, methods of handling I/O, optimal con-

tention management, virtualizing unbounded transactions, etc. They have raised

questions due to the di�culty in virtualizing them and their inability to elegantly

handle unbounded transactions. When one also considers the latent skepticism re-

garding transactional memory as a viable general programming model, the hesitation

of hardware vendors to wholly adopt TM features may seem justified. Given these

barriers to adoption, it is not terribly surprising that the microprocessor industry has

yet to embrace HTM.

HybridTM

One way of limiting the complexity required by an HTM is to provide a limited

best-e↵ort HTM that falls back to an STM if it is unable to proceed [30, 48, 79,

40, 24]. These systems are particularly well-suited for supporting lock-elision and

small transactions. However, applications that use large transactions (or cannot tune

their transactions to avoid capacity and associativity overflows) will find that they

derive no benefit. This approach is especially problematic as the research community

explores transactional memory as a programming model, since it prescribes a limit

on how transactions may be used e�ciently.


Hardware Accelerated STM

Hardware accelerated STMs are a type of hybrid TM that use dedicated hardware to

improve the performance of an STM. This hardware typically consists of bits in the

cache or signatures that accelerate read set tracking and conflict detection. Existing

proposals extend the instruction set to control new data paths to the TM hardware.

Explicit read and write barriers then use the TM hardware to accelerate conflict

detection and version management [72, 67, 19].

TMACC Motivation

We observe that hardware acceleration of an STM’s barriers only requires that the

runtime be able to communicate with the hardware; the TM hardware need not be

part of the core or connected to the processor with a dedicated data path. Commodity

processors are already equipped with a network that provides high bandwidth, low

latency, and dedicated instructions for communication: the coherence fabric. This

leads to the unexplored design space of hardware accelerated TM systems that do

not modify the core, or Transactional Memory Acceleration using Commodity Cores

(TMACC). Early simulation results, presented in Figure 2.10, show the promising

potential of TMACC systems to perform within five to ten percent of an in-core

hybrid TM system. These results also suggest that much of that performance can

be realized despite a relatively large latency between the processing cores and the

TMACC hardware.

Keeping the hardware outside of the core maintains modularity, allowing archi-

tects to design and verify the TM hardware and processor core independently. This

significantly reduces the cost and risk of implementing TM hardware and allows de-

signers to migrate a core design from one generation to the next while continuing to

provide transactional memory acceleration.

There is therefore great benefit in exploring TM systems that can be feasibly

constructed using commodity processors. Such systems will allow researchers to:

1. better understand and fine-tune TM semantics using real hardware and large

applications


2 4 8 16

# of Processors

0

1

2

3

4

5

6

7

Sim

ulat

ed S

peed

upSigTMTMACC-L1TMACC-MEMTL2

Figure 2.10: Average (mean) performance on the STAMP suite of two simulatedTMACC systems, one two cycles away from the core (L1) and one two hundredcycles away (MEM). These are compared to TL2, a pure STM, and an in-core hybridTM system much like SigTM.

2. explore the extent of speedup and hardware acceleration possible without mod-

ifying the processor core

3. better understand the issues associated with tolerating the latency of out-of-core

hardware support for TM

To derive these benefits in this work, we describe the design and implementation

of a hardware accelerated TM system, implemented with commodity processor cores.

Like the accelerators presented in systems like FlexTM [71], BulkSC [22], LogTM-

SE [82], and SigTM [19], we use Bloom filters as signatures of a transaction’s read

and write sets. Unlike these previous proposals, our Bloom filters are located outside

of the processor and require no modifications to the core, caches, or coherence pro-

tocol. In this thesis we also address the non-trivial challenges encountered when the

acceleration hardware is moved out of the core.

2.4.2 Accelerating TM

In this section we present our system for Transactional Memory Acceleration using

Commodity Cores, or TMACC. We first give a high level overview of our design

decisions and describe our general use of Bloom filters. We follow with a more detailed

description of our Bloom filter hardware, which is general and flexible enough to be


placed anywhere in the system. We describe how we implement this hardware using

FARM, and using the two techniques described in Section 2.2.2, present two distinct

TM algorithms using this hardware.

In any TM system, the processor must have very low latency access to transaction-

ally written data while hiding that data from other executing threads. Performing

this version management in hardware and being able to promptly return specula-

tively written data would almost certainly require modification of the L1 data cache

or the data path to that cache. Previously proposed HTM systems use bu↵ers next to

the L1, or the L1 itself, to store this speculative data until the transaction commits.

Imposing out-of-core latencies on these accesses would significantly degrade perfor-

mance. We therefore conclude that performing hardware-based or hardware-assisted

version management in a TMACC system is impractical.

To address this issue of version management, our software runtime uses a heavily

optimized chaining hash table as a write bu↵er. A transactional write simply adds

an address/data pair to this hash table. Each transactional read must first check for

inclusion of the address in the write bu↵er. If it is present the associated data is

used; otherwise, a normal load is performed. The hash table is optimized to return

quickly in the common case where the key (the address) is not in the table. Once

the transaction has been validated and ordered (i.e. given permission to commit),

the write bu↵er is walked and each entry applied directly to memory. The details of

write bu↵er data structures are more thoroughly explored elsewhere [34, 66, 53, 29].

Application of the write bu↵er could potentially be performed by the TMACC

hardware, freeing the processor up to continue on to the next transaction. However,

initial experiments showed that any benefit is outweighed by the impact of reloading

the data into processor’s cache after application of the write bu↵er. This is an area

of potential future work.

Like version management, checkpointing the architectural state at the beginning

of a transaction and restoring that state upon rollback would require significant mod-

ification to the processor core in order to be e↵ectively and e�ciently handled in

hardware. We thus perform this entirely within the software runtime using the stan-

dard sigsetjmp() and longjmp() routines.


dat

a

addr

wre

n

req

ack

Contr

ol

...

Fil

ter

0

Fil

ter

1

Fil

ter

2

Fil

ter

n

Fil

ters

Has

hes

copy_

dat

a

bit

s_in

=>

tag_gt

tag_hit

clear

query

copy_in

copy_in_data

bits_in

hit

copy_out_data

copy_out

tag_in

tag_we

we

Figure

2.11:Logical

block

diagram

ofBloom

filters.


This leaves conflict detection as the best target for out-of-core hardware accel-

eration. After all, the speculative nature of an optimistic TM system means that

the latency of the actual detection of conflicts is not on the critical path. Conflict

detection is a primary contributor to execution overhead in STM systems, and many

STM proposals have attempted to improve it.

In this work, we present two novel methods for performing conflict detection,

both of which use Bloom filters as signatures of a transaction’s read and write set.

Bloom filters [11] have been shown to be an e↵ective data structure for holding sets of

keys with very low overhead and have been used for multiple applications, including

the acceleration of transactional memory [19, 82, 22, 75]. Like several other TM

proposals, TMACC uses Bloom filters to encode the read and write sets of running

transactions. When a transaction commits, each address that is written can be quickly

checked against the read and write sets of other concurrent transactions in order to

discover conflicts. Details of the TM algorithm can be found in Section 2.4.4. The

TMACC system presented in this work assumes a lazy optimistic STM. There are no

fundamental reasons, however, why TMACC could not be used to accelerate an eager

pessimistic system.

2.4.3 Implementing TMACC on FARM

In order to fully qualify a TMACC design, we needed a platform which would al-

low for easy experimentation with real applications and developed FARM [61] (see

Section 2.1). To implement TMACC, we use two of the logical interfaces for com-

munication between the TMACC accelerator and the CPU: a) the coherent interface

which uses cache lines managed by the coherence protocol and b) the stream interface

which provides streaming (or “fire-and-forget”) non-coherent communication.

Bloom filters

Figure 2.11 presents a block diagram of a collection of Bloom filters. Note that while

logic symbols are used, Figure 2.11 does not represent a physical implementation,

but a logical diagram of the functionality provided. In addition to the normal add,


clear, and query operations, each individual Bloom filter provides functionality to

copy bits in from another filter or broadcast out its bits to other filters. Each Bloom

filter also has a tag associated with it, which can be used, for example, to associate

a Bloom filter with a particular thread of execution. Programmability of the module

is achieved in the control block, which can be programmed to translate high level

application-specific operations to the low level operations (add, query, clear, copy in,

and copy out) sent to each individual Bloom filter. These operations can potentially

be predicated by the tag hit and tag gt signals.

On FARM, the Bloom filters are placed in the placeholder marked “User Appli-

cation” in Figure 2.1. We use four randomly selected hash functions from the H3

class [20]. We considered using PBX hashing [83], which is optimized for space ef-

ficiency, but we were not constrained by logic resources on the FPGA. We perform

copying by stepping through the block ram word by word. In order to reduce the

number of cycles needed to copy, filters requiring copy support use additional RAM

blocks to widen the interface, resulting in more logic cells for the datapaths. All filters

are logically 4 Kbits in size.

Software communicates with the Bloom filters using the memory subsystem, which

is the fastest (both highest bandwidth and lowest latency) I/O path to and from a

commodity processor core. Uncached “fire-and-forget” stores can be used to send

asynchronous commands to the filters, such as a request to add an address to a

transaction’s read set. FARM’s data stream interface provides similar functionality;

however, its Barcelona processors are not able to perform true fire-and-forget stores.

Instead, “write-combining” memory is used to provide a way to stream data to the

FPGA with minimal impact on the running processor [61]. The Bloom filter hardware

performs commands serially in the order they are received by the FPGA. The imple-

mentation is pipelined, allowing the filters to easily process all incoming commands

even when the link is fully saturated.

For asynchronous responses, such as a filter match notification indicating a conflict

between transactions, the filters use FARM’s coherent interface to store a message in

a previously agreed upon memory location, or mailbox [59]. The application receives


notification of Bloom filter matches (i.e. conflicts) by periodically reading this mail-

box. In the common case of no conflicts, this check is very cheap as it consists of a

read that hits the processor’s L1 cache.

Using out-of-core Bloom filters that communicate using the memory system allows

us to easily perform virtualization. The software runtime maintains the pool of Bloom

filters, explicitly managing the binding between software threads and hardware filters.

Issues such as interrupt handling, context switching, and thread migration are thus

transparent to the acceleration hardware. If the hardware were added to the processor

core, these issues would become much more complex and expensive, as the core would

be physically tied to a specific Bloom filter.

2.4.4 Algorithm Details

We propose two di↵erent transactional memory algorithms in this section: one using

global epochs (TMACC-GE) and one using local epochs (TMACC-LE). In both of

these schemes, a filter match represents a conflict that requires a transaction to abort,

and a pre-set mailbox is used to notify the STM runtime. Both schemes provide

privatization safety. Publication safety could be provided by constraining the commit

order as in an STM; we don’t expect TMACC to make this either easier or harder.

When using Bloom filters to perform conflict detection, an important decision is

what logical keys are put into the Bloom filter to designate a shared variable. This

decision determines the granularity at which conflicts are detected. In our systems, we

simply use the virtual address of the shared variable as the key (later referred to as a

reference). For structures and arrays, each unique word is a separate shared variable.

An object identifier or something similar could be also be used as a reference.

To e�ciently manage RAM resources on the FPGA, we use two slightly di↵erent

instantiations of the bloom filter design for TMACC-LE and TMACC-GE. TMACC-

LE uses 24 filters: 8 for each of the read, write and missed sets. The 16 used for

the write and missed sets support copying. TMACC-GE uses a total of 40 filters: 8

for the read sets and 32 for the write sets, none of which support copying. An ASIC

implementation would not be constrained by the number of RAM blocks, and both


Algorithm 2 Pseudocode for the TMACC-GE runtime.

procedure WriteBarrier(tid, ptr, val)AddToWritebu↵er(tid.wb, val)

procedure ReadBarrier(tid, ptr)HW AddToReadSet(tid, ptr, global epoch)if Writebu↵erContains(tid.wb, ptr) then

return Writebu↵erLookup(tid.wb, ptr)

WaitForFreeLock(ptr)Return ⇤ptr

procedure Commit(wb)AcquireLocksForWB(wb)epoch = global epochif (violation mailbox[wb.tid] == true) then return failure

for entry in wb doHW WriteNotification(wb.tid, entry.address, epoch)

violated = HW AskToCommit(wb.tid) . Synchronousif violated then ReleaseLocks(); return failure

for entry in wb do *(entry.address) = entry.specData

AtomicIncrement(global epoch)ReleaseLocks()return success

TMACC-GE and TMACC-LE could use the same design [33].

Global Epochs

In the global epoch scheme, the Bloom filters are split into two banks. One bank

maintains the read set for each active transaction in the system. Each read set

holds the references read during the execution of the associated transaction. The

other bank contains filters which hold the write set for a given epoch; the write set

is composed of writes that were performed by any transaction during that epoch.

The Bloom filter tags are used to determine which Bloom filter in this bank cor-

responds to what epoch. When the filters receive a HW AddToReadSet , the refer-

ence is added to the transaction’s read set and checked against the write set for the

given and all previous epochs. A conflict is signalled on any match, thus ensuring


Function Description

HW AddToReadSet(tid,

reference, epoch)

Asynchronously adds reference to tid’s read setand enables notification for any write that couldpossibly make this read inconsistent. Queries eachwrite set that has an epoch number less than orequal to epoch for reference, triggering a conflictin tid if a match is found or if epoch is less thanthe epoch of the oldest write set.

HW WriteNotification(tid,

reference, epoch)

Asynchronously queries all reads sets, except tid’s,and triggers a conflict in any transaction whoseread set includes reference. Adds reference to thewrite set for epoch epoch, clearing and replacing anold epoch’s write set if necessary.

HW AskToCommit(tid) Synchronously processes all outstanding commandsand returns the conflict status of tid.

Table 2.5: TMACC hardware functions used by TMACC-GE.

a match against any write that could have occurred prior to the read. When the

filters receive a HW WriteNotification, the reference is added to the given epoch’s

write set and checked against each transaction’s read set, ensuring that any read

that could possibly come after, or has come after, the associated write will signal a

conflict. In the case that there is not a filter currently associated with the epoch of

a HW WriteNotification, and the epoch is greater than the oldest epoch for which

a filter exists (i.e. this is a new epoch), the write set filter of the oldest epoch is

cleared and replaced with a new write set containing the address to be added (and

tagged with the new epoch number). If no filter exists for the epoch in either a

HW WriteNotification or a HW AddToReadSet , and the epoch is older than the old-

est epoch for which a write set exists, then the command comes from an epoch that

is too old to have a filter and conservatively triggers a conflict. Since the ordering of

reads and writes within the same epoch cannot be determined, this scheme has the

e↵ect of logically moving all reads to the end of the epoch in which they are performed

and all writes to the beginning. These operations are summarized in Table 2.5.

Algorithm 2 gives high level pseudo-code for the algorithm used by the TMACC-

GE software runtime. Each read is instrumented to inform the Bloom filters of the


reference being read. Since the command is asynchronous, the only per read barrier

cost of doing conflict detection is the cost of firing o↵ the command to the FPGA. To

commit the transaction, the runtime first acquires locks for each address in its write

bu↵er, using a similar low-overhead striped locking technique as TL2 [34]. To ensure

that all of its writes are assigned to the same epoch, a local copy of the global epoch

counter is stored and used to inform the hardware of all the references that are about

to be committed. Locks are necessary to ensure that any readers of partially com-

mitted state perform the read in the same epoch as the commit. Without them, the

epoch could be incremented and a read of a partial commit performed in the following

epoch. This read would (incorrectly) not be flagged as a conflict. Once all of the locks

are obtained, the running transaction must synchronize with the filters to ensure that

it has not been violated up until the point the filters perform the HW AskToCommit

operation. If the transaction read a value that had been committed in the current

or any previous epoch, either the HW WriteNotification would have matched on the

read set and triggered a conflict, or the HW AddToReadSet would have matched

against one of the epoch’s write sets. Therefore, when the HW AskToCommit is

performed on the FPGA, the transaction’s read set is coherent and consistent if no

conflict has been seen by the FPGA. The transaction is then placed in the global or-

dering of transactions on the system and allowed to apply its write bu↵er to memory.

Once the write bu↵er has been applied, the transaction atomically increments the

global epoch counter so that any thread that reads the newly committed value will

read it in the new epoch and not be violated. It then releases the locks and returns.

It is important to note that the locks used in TMACC-GE are simple mutex

locks only used to ensure the atomicity of a commit, not the versioned locks used for

conflict detection in TL2. TMACC-GE can thus use coarser grain locking than TL2.

We found that 216 locks is idle for TMACC-GE, while TL2 performs best with 220.

Local Epochs

To perform conflict detection using local epochs, each transaction is assigned three

filters: a read set, a write set, and a missed set. As before, the read set main-

tains the references read during the transaction. The write set holds references that


Algorithm 3 Pseudocode for the TMACC-LE runtime.

procedure WriteBarrier(tid, ptr, val)AddToWritebu↵er(tid.wb, val)

procedure ReadBarrier(tid, ptr)HW AddToReadSet(tid, ptr)if Writebu↵erContains(tid.wb, ptr) then

return Writebu↵erLookup(tid.wb, ptr)

if TimeForNewLocalEpoch() thenHW ClearMissedSet(tid); mfence

Return ⇤ptrprocedure Commit(wb)

for entry in wb doHW WriteNotification(wb.tid, entry.address)

violated = HW AskToCommit(wb.tid) . Synchronousif violated then return failurefor entry in wb do *(entry.address) = entry.specData

HW ClearWriteSet(wb.tid)return success

are currently being committed by a transaction, and the missed set holds references

committed by any other transaction during the local epoch. When a filter receives

a HW AddToReadSet , the reference is checked against all other transactions’ write

sets and the reading transaction’s missed set, ensuring that any write that could have

occurred before the associated read (i.e. in the current local epoch) will trigger a con-

flict. A HW WriteNotification causes the reference to be added to the transaction’s

write set and checked against all other transactions’ read sets, ensuring a conflict

will be triggered for any read that could have potentially seen the result of the cor-

responding write. The written reference is also added to the transaction’s read set,

preventing write-write conflicts which cause a race during write bu↵er application.

Finally, HW ClearWriteSet first copies (merges) the write set into all other missed

sets and then clears the write set. This allows each transaction to independently

decide when it no longer needs to consider missed writes as potentially conflicting.

The transaction does this with HW ClearMissedSet which clears its own missed set,


Function Description

HW AddToReadSet(tid,

reference)

Asynchronously adds reference to tid’s read set,and enables notification for any write that couldpossibly make this read inconsistent. Queries tid’smissed set and the write set for every other trans-action for reference, triggering a conflict in tid ona match.

HW WriteNotification(tid,

reference, epoch)

Asynchronously queries all reads sets except tid’s,triggering a conflict in transactions whose read setincludes reference. Adds reference to tid’s readset and to epoch’s write set.

HW ClearMissedSet(tid) Asynchronously clears tid’s missed set, moving thistransaction to a new local epoch.

HW ClearWriteSet(tid) Asynchronously copies the content of tid’s writeset into every other transaction’s missed set, thenclears the write set.

HW AskToCommit(tid) Synchronously processes all outstanding commandsand returns the conflict status of tid. Clears tid’sread and missed set in preparation for a new trans-action.

Table 2.6: TMACC hardware functions used by TMACC-LE.

e↵ectively moving it into a new local epoch. HW WriteNotification could add ref-

erences directly to the other transaction’s missed sets, but having the intermediate

step of using the local write set allows the transaction to abort a commit without

polluting the other missed sets. These operations are summarized in Table 2.6.

Algorithm 3 gives high level pseudo-code for the algorithm used by the TMACC-

LE software runtime. The main di↵erence in this software runtime, as compared to

TMACC-GE, is the absence of locks during commit. Locks are not needed when

using local epochs because the missed sets cause all of the writes performed during a

commit to be logically moved to the beginning of an epoch defined locally for each

transaction, not globally. Therefore, each transaction individually ensures that any

of its own reads of a partial commit will signal a conflict, an e↵ort which won’t be

frustrated by the update of a global epoch outside of the transaction’s control.


In the local epoch scheme, an epoch is implicitly defined by what writes are con-

tained in the transaction’s missed set filter; thus no explicit local epoch counter is

needed. In addition to firing a HW AddToReadSet and locating the correct version

of the datum, read barriers may choose to begin a new local epoch by sending a

HW ClearMissedSet command. A memory fence is then used to ensure that any

subsequent read (and its corresponding HW AddToReadSet) must wait until the

HW ClearMissedSet is complete and a new missed set has begun to collect writes

performed in the new epoch. This eliminates the possibility that a conflicting read is

performed during a local epoch update and the conflict lost. Periodically incrementing

the local epoch is not necessary for correct operation but reduces the number of false

conflicts and is especially important in applications using long-running transactions.

2.4.5 Performance Evaluation

In this section, we present the performance and analysis of the TMACC-GE and -

LE architectures implemented on FARM. We present the performance results in two

parts. First, we present results from a microbenchmark that is used to explore the

full range of TM application parameters. These results characterize the range of per-

formance we might expect from TM applications and can be used to understand the

performance results from complete applications. second, we present results of full

applications from the STAMP benchmark suite [18]. We show where the STAMP

applications fit into the design space as characterized by the microbenchmark param-

eters and how these parameters explain the performance results. Finally, we project

the performance of an ASIC TMACC implementation.

Microbenchmark Analysis

In order to characterize the performance of TMACC-LE and TMACC-GE, we used

an early version of EigenBench [41] which is a simple synthetic microbenchmark spe-

cially devised for TM system evaluation. This microbenchmark has two major advan-

tages over a benchmark suite composed of complex applications. First, transactional

memory is a complex system whose performance is a↵ected by several application


Algorithm 4 Pseudocode for microbenchmark.static int gArray1[A1];static int gArray2[A2];procedure uBench(A1, A2, R, W , T , C, N , tid)

probrd

= R/(R+W );for t = 1 to T do

TM BEGIN();for j = 1 to (R +W ) do

do read = random(0,1) probrd

? true : false;addr1 = random(0,A1/N) + tid*A1/N ;

. addr1 does not conflict with othersif do read then

TM READ(gArray1[addr1]);else

TM WRITE(gArray1[addr1]);

if C == true thenaddr2 = random(0,A2);

. addr2 possibly conflicts with othersif do read then

TM READ(gArray2[addr2]);else

TM WRITE(gArray2[addr2]);

TM END();

parameters. The microbenchmark makes it simple to isolate the impact of each pa-

rameter, independently from the others. Second, a microbenchmark allows us to get

a theoretical upper bound on the best possible performance given a set of parameters.

We arrive at this bound by simply executing a multi-threaded trial run without the

protection of transactional memory or locking. Doing this with a real application

would almost certainly produce incorrect results. We call this unattainably good

performance the “unprotected” version.

Algorithm 4 shows the pseudocode for the microbenchmark. The algorithm, at the

core, is nothing more than multiple threads executing a random set of array accesses.

Several parameters are necessary: A1 and A2 are the sizes of two arrays, the first

a partitioned array for non-conflicting accesses, the second a smaller shared array

for conflicting accesses; R and W are, respectively, the average number of reads and


(a)im

pactof

working-setsize

(b)im

pactof

tran

sactionsize

(c)im

pactof

trueconflicts

010

2030

4050

6070

Size

of A

rray

1 (M

B)

012345678

Speedup

0102030405060708090100

% of Txns Violated

010

020

030

040

0

# of

Rea

ds

012345678

Speedup010203040506070809010

0

% of Txns Violated

110

100

Size

of A

rray

2 (K

B)

012345678

Speedup

0102030405060708090100

% of Txns Violated

(d)im

pactof

write-set

size

(e)im

pactof

number

ofthread

s

020

4060

8010

012

014

0

# of

Writ

es

012345678

Speedup

0102030405060708090100

% of Txns Violated

12

48

12

48

# of T

hrea

ds

012345678

Speedup

Med

ium

TX

Shor

t TX

00 0

Unp

rote

cted

TMA

CC

-LE

TMA

CC

-GE

TL2

TMA

CC

-LE

Vio

latio

nsTM

AC

C-G

E V

iola

tions

TL2

Vio

latio

ns

Figure

2.12:Microbenchmarkperform

ance

forvariou

sparam

eter

sets.Speedupis

show

nfor8thread

s(exceptin

(e))


working-set transaction true conflicts write-set threads threadslabel (a) (b) (c) (d) (e) med (e) small

A1 (MB) 0.5 ⇠ 64 64 64 64 64 64A2 - - 256 ⇠ 16,384 - - -R 80 10 ⇠ 400 40 80 80 4W 4 max(1, R ⇤ 0.05) 2 1 ⇠ 128 4 1C false false true false false false

N 8 8 8 8 1 ⇠ 8 1 ⇠ 8

Table 2.7: Parameter sets used in the microbenchmark evaluation. The labels herematch those used in Figure 2.12.

writes, per transaction; T is the number of transactions executed per thread; N is the

number of threads; and C is a flag determining whether or not conflicting accesses

should be performed. Note that if C is unset, there should be no violations since

every thread only accesses its partition of the array. If C is set, then the shared A2

array is accessed in addition to the normal accesses to A1, decoupling the working

set size and the read/write ratio from the probability of violation.

We now use the microbenchmark to evaluate the performance of our two TMACC

systems across several di↵erent variables. Table 2.7 shows the parameter sets used

in the study, and the performance results are displayed in Figure 2.12. All graphs

in this section show both speedup relative to sequential execution with no locking or

transactional overhead (solid lines) and the percentage of started transactions that

were violated (dotted lines). In all graphs except for (e), speedup is shown for 8

threads.

Throughout our analysis, the baseline STM for comparison is TL2 [34], which is

generally regarded as a high-performing, modern STM implementation that is largely

immune to performance pathologies. We use the basic GV4 versioned locks in TL2,

the default in the STAMP distribution [76]. We use TL2 because its algorithms for

version management and conflict detection are the closest match to the TMACC

algorithms, allowing for the best indication of the speedup achieved using the hard-

ware. SwissTM [35] is the highest performing STM of which the authors are aware

and provides 1.1 to 1.3 times the performance of TL2 on the STAMP applications

presented here. We also present the best possible performance using the aforemen-

tioned “unprotected” method as an upper bound. Note that this is truly an upper


bound and usually unattainable because it will produce incorrect results in the face

of any conflicts. Throughout the analyses of results, TMACC-GE and TMACC-LE

represent the schemes described in Section 2.4.4.

Graph (a) shows the impact of working set size on TM systems. The prominent

knee in the performance of each system corresponds to the working set size outgrowing

the on-chip cache. Below the knee, where all user data and TM metadata fit on-chip,

TL2 is spared from o↵-chip accesses and outperforms the TMACC systems which

must still pay the costly round trip communication with the FPGA. This e↵ect would

be heavily mitigated with faster (or closer) hardware, and it is certainly rare for the

working set of real parallel workloads to fit in the on-chip cache.

Above the knee, we observe that both TMACC-GE and TMACC-LE significantly

outperform TL2, around 1.35x and 1.75x respectively, approaching the upper bound

of 1.95x. In this region, TL2’s performance su↵ers because its extra metadata causes

significant cache pressure. Specifically, TL2 relies on its metadata for conflict detec-

tion, so its metadata grows proportionally to a transaction’s read set. This indicates

that much of the overhead imposed by TL2 is not in the addition of a few instruc-

tions to the instruction stream, but the cache misses related to the meta data. When

everything fits into the cache, TL2 doesn’t add much overhead. It is when there

is cache pressure that the overhead becomes significant. TMACC-GE, on the other

hand, uses metadata only for commit, so its metadata grows with a transaction’s

write set, which is almost always smaller than its read set.

Graph (b) explores the impact of transaction size on speedup and violation rate.

In this graph, we see a well-defined di↵erence in speedup among the systems. In

the flat region in the middle, the speedup of each system is nearly identical to the

speedup of large working sets in graph (a). In this region, the speed up is bounded

by the available memory bandwidth, which explains why the unprotected execution

isn’t able to achieve a full 8x improvement. For small transactions, TMACC-GE’s

speedup diminishes because the relative cost of the FPGA round trip latency and

global epoch management grows as transaction size decreases. We will take a closer

look at short transactions in graph (e). For large transactions, the performance of

TMACC-LE drops because the lack of ordering information in local epochs causes


the missed sets to become polluted and emit more false positives. This is one case

where global epochs are preferred over local epochs.

Graph (c) depicts the impact of varying the probability of violations by turning

on C and varying the size of A2 in our microbenchmark. Note that the graph uses

semi-log axes. With a small A2, there are many violations and transactional retries

dominate performance, making the conflict detection overhead less important. As

A2 grows, contention decreases and the conflict detection overhead becomes more

important, explaining the expanding performance gap between TMACC-LE, with its

low-overhead conflict detection, and the others.

Graph (d) explores the impact of write set size, and again it is not surprising

that the false positive rate of TMACC-LE becomes non-trivial due to the inherent

pessimism in the local epoch scheme. However, these false positives are not enough

to outweigh the performance advantage of low-overhead conflict detection.

Interestingly, TMACC-GE also shows diminishing speedup as write-set size in-

creases. On closer inspection, we found that this degradation is due to the cache line

migration of locks between the two CPU sockets during commit. As explained in

Section 2.4.4, TL2 uses more locks than TMACC-GE so it is not as sensitive to this

issue. Increasing the number of locks used by TMACC-GE diminishes the e↵ect, but

reduces overall performance. Having the FPGA participate in the coherence fabric

significantly increases the last level cache miss penalty for all processors. This is a

prominent factor in the TMACC-GE results, and experiments in Section 2.4.5 show

that moving to an ASIC implementation would largely eliminate the performance

degradation of TMACC-GE seen here.

Graph (e) examines the impact of number of threads using both medium-sized

transactions and small-sized transactions. Overall, the systems show worse perfor-

mance for small-sized transactions because they all pay a constant overhead per trans-

action, which is not easily amortized by short transactions. With the long commu-

nication delay to the FPGA, TMACC-GE and TMACC-LE are unable to achieve

better performance than TL2 for short transactions running on 2 or 4 threads. While

the FARM system limits us to 8 threads, scalability to many more threads can be

achieved using multiple FPGAs. This scheme would require communication between


Name Input parametersvacation-low n2 q90 u98 r1048576 t4194304vacation-high n4 q60 u90 r1048576 t4194304

genome g16384 s64 n16777216kmeans-low m256 n256 65536-d32-c16.txtkmeans-high m40 n40 65536-d32-c16.txt

ssca2 s20 i1.0 u1.0 l3 p3labyrinth x512-y512-z7-n512.txt

Table 2.8: STAMP benchmark input parameters.

NameRD/tx WR/tx CPU cycles/tx Memory Conflicts

usage (MB)vacation-low 220.9 5.5 37740 573 very lowvacation-high 302.14 8.5 37642 573 low

genome 55.8 1.9 48836 1932 lowkmeans-low 25 25 690 16 highkmeans-high 25 25 680 16 low

ssca2 1 2 2360 1320 very lowlabyrinth 180 177 6.1 * 109 32 high

Table 2.9: STAMP benchmark application characteristics.


the FPGAs and is left for future work.

The dramatic drop in TL2 performance for short transactions at 8 threads is the

result of moving from a single chip to two chips and the large miss penalty described

above. Taking the FPGA out of the system eliminates this drop in performance as

shown in Section 2.4.5. We note that this poor TL2 performance on FARM is only

present when transactions are very short.

To summarize, we see that TMACC provides significant acceleration of transac-

tional memory except when transactions are too short to amortize the extra overhead

imposed by communicating with the Bloom filters. We also find that in the case of

TM acceleration, global epochs only perform better than local epochs when a large

number of shared reads and writes are performed in a relatively short running trans-

action. In this case, the lack of ordering information is a larger factor in system

performance.

Performance Evaluation using STAMP

In this section, we evaluate the performance of TMACC on FARM using STAMP[18],

a transactional memory benchmark suite composed of several applications which vary

in data set size, memory access patterns, and size of transactions. Intruder, bayes,

and yada from the STAMP suite did not execute correctly in the 64-bit environment

of FARM (even using TL2) due to bugs in the STAMP code and have been omitted

from the study. Bayes’s and yada’s long transactions with a high violation rate are

similar to those in labyrinth, and intruder’s short transactions are similar to those

in kmeans-high. Thus, the absence of these apps does not significantly reduce the

coverage of the suite. Table 2.8 summarizes the input parameters and Table 2.9 the

key characteristics of each application. Cycles per transaction were measured during

single-threaded execution with no read and write barriers. We can roughly group

the applications into two sets by transaction size: vacation, genome, and labyrinth

have larger transactions while ssca2 and kmeans use smaller transactions. Kmeans

has large amounts of spatial locality in its data access and thus uses fewer cycles per

transaction despite having more shared reads and writes.

For this analysis, we include RingSTM [75]. This STM system uses a similar


vacation

-low

vacataion-high

genom

e

12

48

# of T

hrea

ds

012345

Speedup

0 10 20 30 40 50 60 70 80 90 100

% of Txns Violated

12

48

# of T

hrea

ds

012345

Speedup

0 10 20 30 40 50 60 70 80 90 100

% of Txns Violated

12

48

# of T

hrea

ds

012345678

Speedup

0 10 20 30 40 50 60 70 80 90 100

% of Txns Violated

Unp

rote

cted

TMA

CC

-LE

TMA

CC

-GE

TL2

Rin

gSTM

TL2

Vio

latio

nsTM

AC

C-L

E V

iola

tions

TMA

CC

-GE

Vio

latio

nsR

ingS

TM V

iola

tions

kmeans-low

kmeans-high

ssca2

laby

rinth

12

48

# of T

hrea

ds

012345678

Speedup

0 10 20 30 40 50 60 70 80 90 100

% of Txns Violated

12

48

# of T

hrea

ds

01234

Speedup

0 10 20 30 40 50 60 70 80 90 100

% of Txns Violated

12

48

# of T

hrea

ds

01234

Speedup

0 10 20 30 40 50 60 70 80 90 100

% of Txns Violated

12

48

# of T

hrea

ds

01234

Speedup

0 10 20 30 40 50 60 70 80 90 100

% of Txns Violated

Figure

2.13:STAMPperform

ance

ontheFA

RM

prototype.


approach to accelerating transactional barriers as TMACC, but the Bloom filters

are implemented in software rather than hardware. Like TMACC but unlike TL2,

RingSTM provides privatization safety. Our RingSTM implementation is based on

the latest open-source version [74] and uses the single-writer algorithm. To provide

a better comparison to TL2 and our TMACC variants, this implementation uses the

write bu↵er implementation from TL2 instead of the hash table typically used in

RingSTM. In our experiments, the ring is configured to have 1024 entries, where each

entry is a 1024-bit filter.

Figure 2.13 shows performance results from executing the STAMP applications on

the FARM prototype. In this graph, we present speedups of 1, 2, 4 and 8 cores and the

percentage of started transactions that were violated. At first glance, we see that the

general trends we saw in the microbenchmark are present in the STAMP applications;

TMACC performs well with large transactions but is unable to provide acceleration

to small transactions. We also provide the unprotected execution time, using the

same method we used in Section 2.4.5. As before, the result of such execution is

incorrect and serves as a strict upper bound. As expected, not all applications were

able to run unprotected; some would crash or fall into infinite loops.

For vacation-high, vacation-low, and genome, the common characteristics are a rel-

atively large number of reads per transaction, small number of writes per transaction,

and small number of conflicts. See Table 2.8 for exact values. Commit overhead is

low due to the small write set and minimal time wasted retrying transactions because

of the small number of conflicts. Also, constant overheads such as register check-

pointing are amortized over the long running length. Thus, in these large-transaction

applications, the numerous reads make the barrier overhead the dominant factor in-

fluencing performance of the TM system. We saw this e↵ect in Figure 2.12.(b). This

graph uses a microbenchmark parameter set which corresponds to the characteristics

of these applications, and we see a very similar spread in performance results for the

large-transaction STAMP applications. Performance gain with respect to TL2 for

these applications averages 1.36x for TMACC-GE and 1.69x for TMACC-LE. Unpro-

tected execution provides an average speedup of 2.18x. Note that for vacation-high

running on TMACC-LE, while the number of reads is about 300, the drop shown in


Figure 2.12.(b) does not happen because vacation-high does not have as many writes

as the microbenchmark used in that graph.

The TMACC systems perform similar to RingSTM for low thread counts but do

not su↵er from the drop in performance at higher thread counts like RingSTM. The

drop in performance at higher thread counts seen in RingSTM arises because it is

unable to quickly check individual reads against write set filters like TMACC is able

to do. It instead checks read set filters against write set filters, and this filter to filter

comparison has a much higher probability of false positives, leading to very high false

conflict rates and significantly degrading performance.

Kmeans-low features a relatively small number of reads, large number of writes,

and small number of conflicts. From Figure 2.12.(b), we can expect that a small num-

ber of reads will diminish the performance gap between TL2 and TMACC. We also see

in Figure 2.12.(d) that the large number of writes will further diminish TMACC-GE’s

performance. The combined e↵ect explains what we see for kmeans-low in Figure 2.13

where for 8 threads TMACC-LE shows a 9% acceleration over TL2 but TMACC-GE

is 5% slower. We also see in Table 2.9 that the kmeans application spends very little

time inside transactions, with few reads and writes per transaction. This explains

the superior scalability of kmeans-low and means that there is very little time spent

in the read and write barriers, leaving very little computation to be accelerated.

Even though kmeans-high has very similar characteristics to kmeans-low except

for the number of conflicts, the large number of violations in kmeans-high overshadows

any other e↵ects and limits the speedup of all three systems to a mere 1.3x with 8

threads. This situation is captured in Figure 2.12.(c) where the performance of the

three systems converges as the rate of violation increases. As in kmeans-low, the small

transactions make it di�cult to amortize the communication overheads of TMACC

and it is not able to achieve any speedup over TL2. Both TMACC systems were

additionally undermined by an even larger number of violations than TL2, which is

interesting because Figure 2.12.(c) shows the TMACC systems having fewer violations

in the face of true conflicts. We suspect this is a result of TL2’s versioned locks giving

more importance to the lower bits of the address in performing conflict detection. This

causes TL2 to have fewer false positives when addresses are close together, as they


Vacation-Low Vacation-High Genome Kmeans-Low Kmeans-High SSCA2 Labyrinth Average0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

Exec

utio

n tim

e

TL2RingSTMTMACC-LETMACC-GE

Figure 2.14: Single threaded execution time relative to sequential execution.

are in kmeans-high. The single-writer variant of RingSTM we use is not able to scale

because of the large number of writes in both kmeans-low and kmeans-high, even

though its violation rate is comparable to the other systems.

Like kmeans-low and kmeans-high, TMACC performance on ssca2 is bound by

communication latency. The characteristics of ssca2 are well captured by the mi-

crobenchmark parameter set used to produce the short transactions graph in Fig-

ure 2.12.(e) which mirrors the ssca2 speedup graph in Figure 2.13. Refer to the

discussion of graph (e) in Section 2.4.5 for an explanation of the results. RingSTM

violates 2.5% of transactions when running 8 threads while the others violate less

than 0.01%. ssca2 has such a large number of transactions that even a 2.5% violation

rate adds significant overhead.

Labyrinth is a special case. As seen in Table 2.8, this application has a very

large number of computational cycles inside each transaction. The execution time is

therefore decided by non-deterministic execution paths and the number of violated

transactions rather than TM overhead. In Figure 2.12.(c) we saw that, in general,

TMACC-GE has fewer false positives than the other systems. So in labyrinth with 8

threads, the TMACC-GE system minimized the number of violations and performed

well. For labyrinth’s long-running transactions, the periodic intra-transaction incre-

ment of the TMACC-LE local epoch was especially important.

Finally, Figure 2.14 highlights the single thread overhead of the systems using


the single threaded execution time relative to sequential execution time. We see that

TMACC and RingSTM have less overhead than TL2 running vacation because of

the frequent barriers. As transactions get smaller in applications like kmeans and

ssca2, commit time becomes more important and the TMACC systems su↵er, while

RingSTM continues to do well. Note that TMACC-GE consistently has more over-

head than TMACC-LE because of the extra time required to (unnecessarily) obtain

the locks during commit. With few barriers and very long transactions, labyrinth has

almost no overhead in any of the systems.

Performance Projection for TMACC ASIC

In the previous sections, we have observed a few artificial e↵ects caused by the large

cache miss penalty in the FARM system. Since both TMACC and TL2 witness

performance degradation due to these issues, an interesting question is whether the

conclusions drawn thus far would still be valid in a system absent of these latency

anomalies, such as an o↵-chip ASIC or part of the uncore on a chip. The acceleration

hardware as presented does not require a high clock frequency and would occupy

a small silicon footprint in modern processes. Thus in this section, we modify our

system to project the performance of TMACC onto the design point of an o↵-chip

ASIC. This could be either a stand-alone chip, or part of the system’s north bridge

or memory controller, for example. The performance of an on-chip TM accelerator

would be even better, since it has a shorter round-trip latency. An ASIC or on-chip

implementation would also support larger Bloom filters, enabling larger transactions

without higher false violation rates.

To simulate the performance of an ASIC TMACC implementation, we first detach

the FPGA from the system, eliminating the FPGA-induced snoop latency witnessed

by all coherent nodes on every cache miss. Then, we replace FPGA-communication

software routines with idle loops in which we control the number of iterations to

simulate di↵erent desired communication latencies. In addition, we change the conflict

detection to report a conflict randomly with a given probability. We keep all the STM

overheads but simulate hardware latency. This modified system is a performance

simulator; like the unprotected version it does not provide serializable execution, but


WR14 3 3 3 3 3 3 3

12 3 3 3 3 3 3 3 1 TL2 performs better by more than 3%

10 3 3 3 3 3 3 3 2 Two schemes show similar performance

8 2 2 3 3 3 3 3 3 TMACC-GE performs better by more than 3%

6 1 2 2 3 3 3 3

4 1 1 2 2 2 2 3

2 1 1 1 1 2 3 3

2 4 6 8 10 12 14 RD

Figure 2.15: Performance comparison of TMACC-GE (ASIC) and TL2 for shorttransactions.

can serve as good indicator of real performance.

In order to closely model the o↵-chip ASIC configuration, we had to determine a

value to use as the communication latency to the ASIC. We propose that last level

cache miss latency is a good estimate for this number, the rationale being that the

ASIC is about as “far” away from the processor as DRAM. We therefore measured

the o↵-chip cache miss latency on this new system (without the FPGA attached) and

used this value as the communication latency. For each run, we used the measured

violation percentage from the equivalent run on FARM as the probability for violation

in the projected run.

For the projection study, we repeated the microbenchmark experiments performed

in Section 2.4.5 using these techniques. We used the measured o↵-chip cache miss

latency as the communication latency in our simulation, the rationale being that the

ASIC is about as “far” away from the processor as DRAM. In general, we found

the trends and conclusions are the same as those presented in Section 2.4.5 expect

where we explicitly mentioned otherwise in the discussions of graphs (d) and (e) of

Figure 2.12. The results for these experiments are shown in Figure 2.16.

A common trend seen in all the experiments is that the performance of TMACC-

GE now comes closer to the unprotected, since the ASIC design point significantly

reduces the cache migration latency, and thus the overhead of global epoch man-

agement. As noted in the discussion of graph (d) in Section 2.4.5, the dramatic

performance degradation of TMACC-GE as the write set grows disappears with the


(d) impact of write-set size (e) impact of number of threads

0

1

2

3

4

5

6

7

8

0 50 100 150Number of WR

Spee

d-up

0

10

20

30

40

50

60

70

80

90

100

% Vio

lation

0

1

2

3

4

5

6

7

8

1 2 4 8 1 2 4 8# Thread

Spee

d-Up

Medium TX Short TX

Figure 2.16: Projected microbenchmark performance with TMACC ASIC.

reduced cache miss penalty of an ASIC implementation. Also, the performance of TL2

with small transactions no longer drops dramatically when moving to a dual socket

configuration. Both TMACC systems also performed better than before for short

transactions; TMACC-LE outperforms TL2 on 8 threads by 9% now, but TMACC-

GE still falls 5% short of TL2 performance.

To determine the point where TMACC-GE begins to outperform TL2, we repeated

the short transaction experiment from Figure 2.12.(e), sweeping the number of reads

and writes from 2 to 14, the result is presented as a schmoo plot in Figure 2.15. When

there are more than 8 reads or writes, TMACC-GE is able to match the performance

of TL2. When there are more than 12, there are enough accelerated barriers to

compensate for the extra cost of communication, and TMACC-GE outperforms TL2.

TMACC-LE outperformed TL2 for all of these points. The inability of TMACC to

accelerate very small transactions suggests that TMACC would compliment a system

that targets small transactions, such as a best-e↵ort HTM that uses a processor’s

write bu↵er to store speculative data and falls back to using TMACC for larger

transactions.

These results indicate that the TMACC hardware would best function as an ASIC

chip, located around the same “distance” from the processor as main memory. One

important advantage of the ASIC design point is the need for little modification

in an SMP environment. Multiple CPUs would utilize the same hardware. Another

interesting design point for a CMP would be to place the hardware out-of-core, but on


012345678

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

Vacation-High SSCA2 Kmeans-High Kmeans-Low

Speedup

TL2 TMACC-GE TMACC-LE Unprotected

Figure 2.17: Projection of STAMP performance with TMACC ASIC

the same die, perhaps integrating the Bloom filters into an on-chip memory controller.

Such generic Bloom filters would not necessarily be dedicated to TM acceleration and

could be utilized by any other non-TM applications.

For the STAMP projection study, we chose four representative applications from

the suite. Vacation-high represents applications with large transactions, while ssca2

those with small transactions. Kmeans-high covers applications with a large number

of violations, while kmeans-low those with a large write set. Figure 2.17 shows the

results. As in the results from the microbenchmark projection, the absolute perfor-

mance improves across the board, while the performance gap between the TMACC

systems and TL2 is still as large as we saw in Figure 2.13.

The speedup results in vacation-high are very close to those of STAMP on the

Sirius platform. This clarifies that the large coherence penalties imposed by the

FPGA on Sirius did not play a large role in determining the accelerator speedup with

respect to TL2. For ssca2, TL2 showed a decrease in speedup at 8 threads when

run on the Sirius platform. The ASIC projection alleviates the large cache migration

penalty, and we thus see TL2 scaling as expected. This mirrors the improvement

we saw in Figure 2.16.(e) as compared with Figure 2.12.(e). Note that even with an

ASIC, we are unable to amortize the overhead of short transactions, and absolute

speedup remains relatively poor for all systems.

Since true violations are the dominant factor in kmeans-high performance, TL2

and TMACC-GE show very similar performance. TMACC-LE begins to diminish

at 8 threads because of the large number of violations. For kmeans-low, we saw in


Figure 2.13 that the advantage of the TMACC systems over TL2 was minimal. For

the ASIC projection, the lower latency allows the hardware accelerator to amortize

the overhead of smaller transactions. The transaction sizes in kmeans-low lie near

this boundary, so both TMACC systems now see much more speedup (up to 15%)

relative to TL2.

2.4.6 Comparison with Simulation

We now briefly contrast our experiences and results with hardware to our early ex-

ploratory work done using software simulation. Considerable e↵ort went in to mak-

ing our simulations “cycle accurate”, and our performance predictions for SigTM

and TL2, presented in Figure 2.10, roughly matched the results presented in the

corresponding papers. Initial results from the actual hardware, however, were quite

di↵erent from those the simulator had predicted. One main reason for the discrep-

ancy was the di↵erence between the simulated and actual CPUs. The simplistic CPU

model used in simulation (in-order with one non-memory instruction per cycle) drasti-

cally overstated the importance of reducing the instruction count in the transactional

read and write barriers. Modern processors, such as those in FARM, are much more

tolerant of extra instructions in barriers, reducing the benefit of eliminating those

instructions.

Another primary source of inaccuracy arose from the fact that our simulated in-

terconnect did not model variable latency and command reordering. The presence

of these in a real system led us to develop the global and local epoch schemes pre-

sented in this thesis and thus significantly impacted the performance of the system.

In addition, our simulator assumed the processors were capable of performing true

“fire-and-forget” stores with weak consistency without a↵ecting the execution of the

core. We therefore did not model the write combining bu↵er and its e↵ect on system

performance. In addition, smaller data sets used to run simulation in a reasonable

time frame a↵ected the system performance very di↵erently than a real workload, in

terms of bandwidth consumption, caching e↵ects and TLB pressure.


Even though we could have performed a more accurate simulation and we even-

tually approached our desired performance using a modified design, we believe our

experiences provide a strong example of the importance of building actual hardware

prototypes. Although developing and verifying hardware requires increased time and

e↵ort when compared with using a simulator, hardware is essential to accurately gauge

the performance of proposed architectural improvements and to bring out the many

issues one might encounter in actually implementing the idea. Having a hardware

implementation is also a strong evidence of the correctness and validity of a system.

2.5 Other Applications

We now briefly explore what other applications could be e�ciently accelerated using

fine-grained acceleration. Application, such as transactional memory, that require a

small amount of computation on each memory access are prime candidates for such

acceleration. Examples include bug detection, such as data race detection [84] or

array bounds checking, and runtime profiling [85]. Coherent access to the CPU’s

cache can simplify the design of previous intelligent I/O devices [59]. A system such

as FARM could also be used to prototype intelligent memory systems for performance

[42] or for security [47]. Such a system would extend the memory controller described

in Section 2.1.3 with the required intelligence using additional information available

through the coherent interface.

In addition, coherent FPGAs can help prototype advanced coherent protocols. For

example, one could prototype directory structures like [4], or snoop filtering techniques

like [55]. Note that such extensions of the underlying broadcasting coherence protocol

(cHT) have been proposed in the original design [44] but actual implementations have

been rare.

Chapter 3

Loosely Coupled Acceleration

In this chapter, we now turn to the other broad class of domain specific accelerators:

those that do not have a tight coupling with the rest of the system and operate fairly

independently. These accelerators are characterized by their infrequent communica-

tion with the general purpose processors in the system. Infrequent communication

implies that coarse grained accelerators will work autonomously on a chunk of data

for a long period of time, performing a large amount of compuation. The data being

processed can also be, but is not necessarily, a large amount of data.

One decision that must be made is how the data is transferred from the general

purpose processor to the accelerator and how results are transferred back. This

decision is determined in large part by the placement of the accelerator in the system.

Much like the process of partitioning a workload between the di↵erent processors of

the system, designers must analyze the data flow on an application by application

basis to determine the best placement of the accelerator in their system. For example,

if an accelerator is going to process data that will be shared with other computational

processes in the system, it might make sense place the processor directly in the

processor interconnect in a system such as FARM, preventing the need to duplicate

the data in two places in the system. If, however, the data to be processed will be

discarded after processing, the accelerator can be attached to the peripheral bus, or

even exist in a completely separate appliance, connected via a rack-level interconnect

such as Ethernet or InfiniBand. Loosely coupled accelerators like those described

64

CHAPTER 3. LOOSELY COUPLED ACCELERATION 65

in this chapter are usually, by their nature, more dependent on the bandwidth of

their connection to the system than its latency. This decision of how to connect the

accelerator is important for overall system performance; however, it highly dependent

on the characteristics of the computation being accelerated and will not be generally

discussed further in this work.

Loosely coupled accelerators are often quite complex pieces of hardware that are

di�cult and expensive to design due to the amount of computation performed and

the speed at which they must operate to outperform a high performance general

purpose processor. This is in contrast to tightly coupled acclerators, where it is

often not necessary for the hardware to be very complicated since it can accelerate

an application by o✏oading even a small amount of computation. For example, the

bloom filter module used to accelerate transactional memory in Section 2.4.2 is a

relatively simple hardware design. Extra care must therefore be taken when deciding

to persue building a loosely coupled accelerator. It should always be asked if it would

be just as good, in terms of whatever metrics are important to the system, to add

an extra general purpose processor, or perhaps a domain-specific processor such as a

GPU, to the system to perform the task considered for acceleration. If this is the case,

or it will probably be the case in the near future due to technology improvements, that

is almost certainly the approach to take due to the substantially cheaper development

cost of software over complicated hardware.

For tasks that consume and/or produce a large amount of data, one key indica-

tor that the task is a good candidate for acceleration is the inability of a general

purpose processor to fully saturate the memory bandwidth. If the computation is

saturating the available memory banwidth, then the performance is memory-bound

and no amount of special purpose hardware will speed it up (although if power is a

concern, an accelerator could potential perform the computation at lower power, but

we will not generally explore these cases in this work). Memory bandwidth utiliza-

tion then becomes a convenient measure of the accelerator’s utility. If the accelerator

can achieve significantly better utilization of the memory bandwidth than a general

purpose processor could ever hope to achieve, the accelerator will probably be worth

the cost.


To provide insight into the type of issues that arise and techniques that can be

used in accelerators that work on large amounts of data in an attempt to fully saturate

memory bandwidth, we again turn to a case study: accelerating database operations.

We propose hardware designs that accelerate three important primitive database

operations: selection, merge join, and sorting. These three operation can be combined

to perform one of the most fundamental database operations: the table join. Since the

primary goal in our designs is to build hardware that can fully utilize any amount of

memory bandwidth, we have designed the hardware to have as few limiters to scaling

as possible. The goal is that as logic density increases more hardware can be added

to increase the throughput of the design with very little redesign of the architecture.

This chapter includes the following key contributions:

• We detail hardware to perform a selection on a column of data streamed at

peak memory bandwidth. (Section 3.2.2).

• We describe hardware to merge two sorted columns of data. (Section 3.2.3).

• We present hardware to sort a column of data using a merge sort algorithm.

(Section 3.2.4).

• We describe how to combine these hardware blocks to perform an equi-join

entirely in hardware. (Section 3.2.5).

• We prototype all three designs on an FPGA platform and discuss issues we

faced when building the prototype. (Section 3.3).

• We analyze the performance of our prototype and identify key bottlenecks in

performance. (Section 3.3).

• For each hardware design, we explore the hardware resources necessary and how

those resources requirements grow with bandwidth requirements. (Section 3.3).


3.1 Background

By the late 1970s, Database Machines became a popular topic in the database research

community and commercial products were being planned. In an attempt to improve

access time to very large databases, these machines placed special purpose proces-

sors between the processor and the disk containing the database. They first placed

a processor at each disk track, then at each disk head, and finally placed multiple

processors with a large disk cache between a conventional disk and the host proces-

sor. These systems initially looked very promising; however, processor performance

increased much more dramatically than I/O performance and database machines soon

no longer made sense [14]. Because of the gap between disk bandwidth and processor

performance, there wasn’t any performance left on the table using general purpose

processors and commodity storage systems. It was easy for a processor to keep the

disk busy, thus without a dramatic increase in disk performance, special purpose

processing was unnecessary.

While the database machines of the 70s, with special purpose processing at the

disk, became obsolete. By the early 90s, with the widespread adoption of the rela-

tional data model, the community had developed massively parallel and performant

database-centric systems using commodity processors and storage systems [32]. These

database systems became a driving force in the development of highly parallel sys-

tems.

Massively parallel database systems have continued to evolve to the present day.

Their performance has grown steadily along with the performance of the system

components they are built on top of. With the advances of memory technology and

the subsequent increase in capacity of main memory in these systems, many large

database tables now reside entirely in main memory, further improving the database

performance. It has even been proposed that disks be replaced entirely with random

access memory and “relegated to a backup/archival role” [64].

With databases residing entirely within main memory, database performance is

no longer bound by the glacial performance of a rotating magnetic disk. Unlike

the systems in the 70s, however, single-threaded processor performance is leveling


o↵. Systems must now rely almost entirely on parallelization to achieve increases

in performance. While Moore’s law continues to hold and the number of transistors

available to chip architects continues to increase, power constraints limit the number

of logic transistors that can be active at any given time on a chip [15]. It is unlikely

that general purpose processing elements will ever be able to fully utilize the amount

of memory bandwidth available to a chip while performing all but the most basic

database operations. As an example, recent studies have increased join performance

into the 100s of million tuples per second [45, 43], with 64-bit tuples this corresponds

to a data bandwidth of one to five gigabytes per second. Modern chips, conversely,

can achieve memory bandwidth over 100 GB/s [10]. Clearly using general purpose

compute is leaving performance on the table and database acceleration is a prime

candidate for acceleration.

Another enabling change in database systems is the move to columnar data stor-

age as opposed to row-wise data storage. This move was sparked in the 1990’s by

MonetDB [52]. Since then other database systems using column-oriented storage,

such as C-store [77], have appeared. The move to columnar storage is a result of

attempts to better utilize the increasingly limited amount of memory bandwidth

available to processing cores. This work provides methods for transforming row-wise

query operations into column-wise vector operations. Having database tables stored

in columnar format allows processors, and accelerators, to quickly stream through

relevant columns of data, fully utilizing any available memory bandwidth.

3.2 Hardware Design

3.2.1 Barrel shifting and multiplexing

Barrel shifters are used through our design so we begin with a brief reminder of how

to build these components. In our designs, we often use shifters which take in an

array of words and an amount to shift them, word wise, in one direction. So instead

of shifting by a certain number of individual bits, the bits are shifted by a certain

number of words. For example, a shifter that takes in four 32-bit words is 128 bits


Figure 3.1: A pipelineable eight word barrel shifter.

wide and shifts by 0, 32, 64, or 96 bits. This is implemented by simply replicating

a traditional 4-bit barrel shifter, which is implemented using four 4:1 multiplexors.

Thus, a barrel shifter for four 32-bit words would take 4 ⇤ 32 = 128 4:1 multiplexors,

since each of the 128 bits of output is assigned to one one of four input bits. More

generally, a barrel shifter for N b-bit words takes N ⇤ b N :1 multiplexors.

Large input multiplexors can be e�ciently implemented using several stages of

smaller multiplexors. A 256:1 multiplexor can be implemented with just two stages

of 16:1 multiplexors. If you are able to multiple M signals in a clock cycle on a

platform, the number of stages for an N wide multiplexor is logM

(N). Figure 3.1

provides an implementation of an eight word barrel shifter using a 4:1 stage and a 2:1

stage. In a modern FPGA fabric, a 16-to-1 multiplexor can be implemented using

two logic blocks (i.e. CLB, ALM, etc) [36][6].

3.2.2 Selection

In this chapter we define the selection operation to take two inputs, a bit mask of

selected elements and a column of data stored as an array of equal width machine

data types. The inputs can either come from arrays laid out linearly in memory,

or be produced by another operation which may be looking at a di↵erent column of

data. In some cases the bit mask may be RLE compressed and must be decompressed

before being used by the selection unit. A common case would have the bit mask


coming from another operation and the data column being read from memory. The

output of the operation is values from the input column that correspond to the true

bits in the bit mask, in the same order that they appear in the original column. Like

the input, the output data can be streamed to another processing unit or written

sequentially into memory.

There are many ways to implement selection in software. One e�cient implemen-

tation fills a SIMD register with the next values from the input column. A portion of

the bit mask is used as an index into a look up table which contains indices for the

SIMD shu✏e operation to shu✏e the selected data to one end of the SIMD register.

The resulting SIMD register is written to the output array and the output pointer

is incremented by the number of valid data elements that were written. This store

is thus an unaligned SIMD memory access, which was added in SSE4, and has little

performance impact when writing to the L1 cache. These unaligned stores are used

to incrementally fill the output with compacted data. Parallel algorithms must first

scan through the bit mask counting bits to determine the proper o↵set to begin writ-

ing each portion of the result. Once those o↵sets are calculated, the column can be

partitioned for multiple threads to work on in parallel.

Hardware to perform this selection is presented in Figure 3.2. We call the number

of elements consumed each pass through the hardware the “width” of the selection

block. The hardware in Figure 3.2 thus has a width of four. Assuming a fully pipelined

implementation, the bandwidth of the block is fully determined by the width of the

block and the clock speed. As mentioned in Section 3.2.1, a barrel shifter can be

e�ciently implemented using multiple stages of multiplexors; however, such large

barrel shifters must be pipelined to achieve high clock frequencies, so the datapath

in Figure 3.2 was carefully designed to avoid feedback paths containing large barrel

shifters which would necessitate pipeline stalls (or a very slow clock). As is, the only

feedback path in the design is a very small addition (with width log2(W )), allowing

for a deeply pipelined design to achieve a high clock rate.

The first step is to produce a word array in which all selected words from the input

are shu✏ed next to each other at one end of the array (in this case, the right side).

A combinational logic block takes in a segment of the mask stream and produces a


Figure 3.2: Data and control paths for selection of four elements.


Figure 3.3: Control logic for the selection unit.

count of the number of selected elements in the segment, a bus of valid lines, and an

index vector which specifies which word should be selected for each position in the

shu✏ed word array.

For small input widths, this combination logic can simply be implemented as a

single ROM. Such a ROM would have depth 2W . This is clearly not feasible for any

realistic input width. Using pure-combinational logic, such as a cascade of leading-

1-detectors, would also not be feasible for larger input widths. We thus use smaller

sections of the mask as addresses into multiple smaller ROMs. So for example, instead

of using all 16 bits of a mask segment to address a 64k deep ROM, we can use each

4-bit nibble of the mask to address four 16 element ROMs. It is then necessary to shift

the output of each ROM into the correct position of the final index vector, based on

the accumulated count from the adjacent ROM. Figure 3.3 shows an implementation

of this for an input width of 16. This datapath has no feedback paths and can thus

be e�ciently pipelined to achieve full throughput. Decreasing the size of the ROMS

and including more of them results in lower total ROM space but higher latency and

more adders, barrel shifters, and pipeline registers.

For a given input width W, the count is log2(W ) bits wide, the valid array is

W bits wide, and the selection vector’s width is is log2(W ) + log2(W � 1) + ... + 1.

The number of control lines for this section thus grows quite rapidly, from 69 bits for


an input width of 16 (4 for count, 16 for valid, and 8 ⇤ 4 + 4 ⇤ 3 + 2 ⇤ 2 + 1 = 49

for selection) to 904 bits for an input width of 128. Unfortunately there is no way

to reduce the amount of control signals needed in this initial step. To consume W

values of data each cycle, the value at the edge of output of the shu✏e array could be

any of those W inputs. Thus, to achieve a bandwidth of W values per cycle requires

a W :1 multiplexor for that word. Including multiplexors for the other values, any

implementation requires W � 1 word multiplexors with sizes from W :1 down to 2:1

to consume W values per cycle.

Once the selected values are shu✏ed to the right side, they are rotated left to a

position indicated by the current number of saved values ready to be output. Values

in the input that complete a full output are sent directly to the output and values

that will make up a partial output are saved in registers. For example, if two values

were previously saved in the registers, and three values are selected in the input, the

input will be rotated right by two, such that the lowest (furthest right) two values fill

the left two positions in the output, and the third input word is saved in the register

furthest to the right, ready to be added to selected values from the next input.

3.2.3 Merge Join

The merge join operation takes two sorted columns of fixed-width keys as input, each

with an associated payload column, and produces an output column which contains

all the keys that the two columns have in common, together with the associated

payload values. When there are duplicate matching keys, the cross product of all

payload values are produced. For example, if there are four entries of a key x in one

input column, and six entries of x in the other input, there will be 24 entries in the

output with key x.

This operation can be performed in software by sequentially moving through each

input column and advancing the pointer of the column with the lower value. When

two keys match, an output row is written to the output array and the output pointer

incremented. Care must be taken to handle the case of multiple matching keys and

produce the correct cross-section output. The resulting code has a large number of


Figure 3.4: Hardware to perform the merge join operation. The green lines exitingdiagonally from each comparator encompass the key, both values, and the result ofthe comparison.

unpredictable branches that result in a very low IPC and quickly becomes processor

bound, not able to keep up with the memory bandwidth available to even a single

core.

Our hardware design to perform this operation is laid out in Figure 3.4. The

basic design is rather straightforward; all combinations of a section of keys from one

input, the “right” input, and a section of keys from the other input, the “left” input,

are compared. An array of possible output combinations with a bit mask indicating

which should be used is produced. This output can then be sent into the selection

unit from Section 3.2.2 to produce the actual output rows. The highest value from

each input is compared, the input with the lower highest value is advanced, while the

same selection from the other input remains. This ensures that any combination of

input keys that could potentially match are compared.

Complications arise, however, when the highest value of each input selection is

equal. In this case it is necessary to bu↵er the keys from the left input and advance

through the left input until the highest keys no longer match. When that happens,

it is guaranteed that the highest right input is lower than the highest left input,


Figure 3.5: Merge join optimization. Shaded blocks have potential matches in them;the line is the path the unoptimized design takes through the data, with the zig-zagat the end representing a replay. In this case, the optimized design looks at 4 crosssections and the unoptimized design looks at 8.

and the right input can be advanced. Any values bu↵ered are then replayed and

compared against the new selection from the right. When the replay bu↵er is empty,

execution continues as normal. Our design uses two local bu↵ers and control logic

that allows the bu↵er to spill into a pre-allocated bu↵er in DRAM. Once the top bu↵er

is filled, the bottom bu↵er is filled, when the bottom bu↵er is filled, it is drained into

DRAM, ready to be filled again. Using two bu↵ers in this way assures that when the

replay starts data is immediately available to be replayed (the data in the top bu↵er).

While that bu↵er is being replayed, any data that spilled over into DRAM can be

pre-fetched, hiding the DRAM latency. The bottom bu↵er is used to provide a large

burst of data to write to DRAM, instead of small individual writes, which decreases

the impact on overall DRAM performance.

Because the number of comparators grows quadratically with the width of input,

it is di�cult to implement hardware with a wide input array. An optimization to

help increase the throughput of the design looks at a much wider selection of each

input than the actual comparator grid. The input is partitioned into sections that

fit into the comparator grid and the highest and lowest values are compared. Using

those comparisons, only those cross sections with potential matches are sent into the

comparator grid sequentially while the others are skipped. Figure 3.5 is an example

where four chunks of data from each input are considered at once. The cross sections


shaded green correspond to blocks that have potential matches and must be examined.

The unshaded blocks do not have to be examined. The unoptimized design would

follow the path drawn through the data, examining all eight of the cross-sections it

moves through. Optimized hardware would only examine the four shaded blocks,

then advance one of the inputs as in the original design.

3.2.4 Sorting

Sorting an array, or column, of numbers has been and will continue to be a very active

area of research and is an essential primitive operation in many application domains,

including databases. Quicksort based algorithms have traditionally been considered

to have the best average case performance among software sorting algorithms. How-

ever, recent advances in both CPU and GPU architectures have brought merge sort

based algorithms, such as bitonic sort and Batcher odd-even sort, to the forefront of

performance as they are able to exploit new architectures more e↵ectively and better

utilize a limited amount of bandwidth [25, 68, 73, 50]. Satish et.al.[69] provide a

comprehensive overview of state of art sorting algorithms, and their limitations and

trade-o↵s, for general purposes CPU and GPU processors.

We present here a dedicated hardware solution to perform a merge sort entirely

in hardware. The goal of this design is to sort an in-memory column of values while

streaming the column to and from memory at full memory bandwidth as few times as

possible. Figure 3.6 depicts the essence of a merge sort. We call the merge done at the

individual node a “sort merge”, which is distinguished from a “merge join” presented

in Section 3.2.3. To accomplish this we implement a merge tree directly in hardware,

stream unsorted data from memory into the merge tree and write out sorted portions

of the column. Those sorted portions then become the input to each input leaf of the

merge tree again, generating much larger sorted portions. This process is repeated

until the entire column is sorted. The number of passes required through the tree is

dependent on the width of the merge tree. If the tree has width W and the column

has N elements of data, N/W portions of length W are created on the first pass

through. On the second pass, those N/W portions are merged into N/W 2 portions


4 8 2 1 5 5 7 0

4 8 1 2 5 5 0 7

1 2 4 8 0 5 5 7

0 1 2 4 5 5 7

Figure 3.6: Sorting using a sort merge tree.

of length W 2. This continues until N < W p where p is the number of passes. The

number of passes required to sort a column of size N is then dlogW

(N)e. Thus, if

W is relatively large, the number of passes required grows extremely slowly with the

size of the input table and very large tables can be sorted in just two or three passes

of the data.

Before we describe the design of the merge tree itself, we first look at an individual

node in the merge tree. The maximum throughput of data through the merge tree

will be ultimately limited by the throughput of data through the final node at the

bottom of the tree. Depending on the data, other nodes of the tree can also become

a bottleneck. For example, if the far left input on a second pass contains all of the

lowest elements of the full column, then only the far left branches of the tree will be

used until the entire portion is consumed. It is thus not practical to move only the

lowest single value of the two inputs of a node to the output. This would result in

the throughput of the tree being only one element per cycle. It is thus necessary to

consume multiple values every cycle.

Figure 3.7 gives a logical overview of how multiple values from the input are

merged at a time. The input to the unit is two bu↵ers, each containing two sorted

lists. To maintain a high bandwidth, each input is multiple values wide (Figure 3.7

shows four, but in general it can be much wider). Each iteration, the lowest value of

each input are compared and the entire width of the input, in this case four values,


is removed from the input queue with the lower lowest value. These four values are

merged with the highest four values from the previous iteration. The four lowest

values resulting from that merge are guaranteed to be lower than any other value yet

to be considered since any values lower than the fourth would already have been pulled

in because both inputs are already sorted. The highest four values, however, may be

higher than values yet to be pulled in from the input not chosen at the beginning of

the iteration. They must therefore be fed back and merged with the next set of input

values. In this way, four values are produced and four values are consumed from one

of the inputs each iteration.

It is not necessary, however, to put a merge network like that in Figure 3.7 at each

node of the tree. Each level of the tree need only supply values as fast as the level

below it can consume values. Thus, each level need only match the throughput of the

final node of the tree, which need only match the write memory bandwidth to keep up

with memory. Figure 3.8 presents the hardware that encompasses a single level of a

merge tree, which we call a “sort merge unit”. There are four memories at each level.

A data memory which bu↵ers the input data to the level. It is only necessary to hold

as a single value for each input leaf to the level. The data memory is partitioned into

“left” and “right” data so that both inputs to a particular node can be read at once,

but each can be written separately. Another memory holds the feedback data from

the previous merge of values for each node in the level. A valid memory holds a bit

for each input leaf to indicate that the data for that leaf is valid, and a bit for each

entry in the feedback memory. These valid bits are blocked in chunks, so a single read

or write works on multiple values at once. Finally, a “request sent” memory, which is

blocked like the valid memory, holds a single bit for each input leaf to indicate that

a request has been sent up the tree to fill the data for that leaf. Note that there are

no output bu↵ers, as the outputs are bu↵ered at the next level in the tree.

We now describe three operations performed on a sort merge unit: a push, a

request, and a pop. A push, whose data path is black in Figure 3.8, is performed

when a previously requested input data comes from above the unit in the tree. First,

the data is written to the data memory, which is known to be invalid because it was

previously requested, and the valid and request outstanding blocks are read. The


Figure 3.7: Merging multiple values at once. In this diagram, four values will bemerged each iteration through the logic.

corresponding valid bit is set and the request outstanding bit is cleared, and the new

blocks are written back into the respective memories. The new block of valid bits is

also sent down to the lower level along with the index. If nothing is being pushed in

a particular cycle, a valid block (determined by an internal counter) is still read and

sent down to the lower level, this is not shown in the figure and prevents deadlock in

some cases.

When the valid block and associated index are sent to a sort merge unit, it initiates

a request operation, which follows the green data path in Figure 3.8. First, the level’s

own valid and request outstanding blocks corresponding to the valid bits received are

read. The incoming valid block, which represent data valid at nodes above, and the

local valid and request outstanding blocks are examined to to find invalid elements

that have two valid parents and have not been requested. One such element is selected,

a bit for it is set in the request outstanding memory, and the request is sent up to

the parent.

Finally, an incoming request from below results in a pop operation, which follows

the orange data path. Both data values, the feedback data, and corresponding valid


Figure

3.8:

Sortmerge

unit.Notethat

forsimplicity,ports

tothesamemem

oryareseparated.


block are read. The lowest values in each data bu↵er are compared. The block with

the lowest is sent to the merge network along with the feedback data (if valid) and

the valid bit corresponding to the consumed leaf is cleared while the valid bit for the

feedback data is set. The lower values from the merge network are sent to the next

level to pushed and the higher values are written back into the feedback memory.

All three of these operations must be pipelined to ensure continuous flow of data

through the merge tree. Section 3.3.3 gives a brief description of how we pipelined

our implementation to achieve high throughput. Even with a fully pipelined design,

however, the throughput of the entire merge tree is limited by the throughput of the

final node in the tree. The design in Figures 3.7 and 3.8 can sustain a throughput

of multiple values every cycle as long as there are plenty of input nodes with data

and available output nodes. In this case merges of multiple nodes in the level are

happening simultaneously. However, the final node of the tree has only two inputs.

That means that an entire iteration must complete before the next merge can begin,

since the feedback data is required to pull another element from a parent node. We

can estimate the latency of a reasonably pipelined implementation of Figure 3.8 to

be the number of stages in the merge network, which is O(lg(width)), making the

throughput through the final node in the tree O(width/lg(width)). For the final node,

however, the ability to handle multiple merges at once isn’t necessary, and it should

more idealy have bandwidth that is O(width).

Figure 3.9 presents a higher bandwidth sort merge unit which only implements a

single node of the tree, not an entire level with multiple nodes like Figure 3.8. Instead

of consuming and merging a set number of values from one of the inputs, shift registers

are used to consume a variable number from each input and new values are shifted in

as space becomes available. Let W be the number of values to output each iteration.

Let Li

and Ri

be the values in the left and right shift registers, respectively, with

i ranging from 0 to 2W � 1. To determine the four lowest value from across both

shift registers, each Lx

is compared with R(W�1�x) for x between 0 and W � 1. The

lower of the two in each case is advanced to the sort network while the higher remains

in the shift register. For example, if L0 < R3, then at least one from the left and

no more than three from the right are among the lowest, so L0 is necessarily one of


Figure 3.9: High bandwidth sort merge unit.

Figure 3.10: Full system block diagram and data paths.


the lowest and R3 is necessarily not. Likewise for L1 and R2, L2 and R1, and L3

and R0. The number taken from each side is counted and the shift register is shifted

by that amount. If there is enough free space in the shift register, an input section

is consumed, shifted into the correct position, and stored. The four lowest values

are then sent into a full sort network. A merge network like that in Figure 3.7 is

insu�cient here since the input is not necessarily split into two equally sized, already

sorted arrays. A simple merge network of twice the width could be used, with some

number of the inputs on each side disabled, but a merge network of size 2N takes

more resources than a full sort network of size N .

The datapath in Figure 3.9 still has feedback paths which prevent a pipelined

implementation from being fully utilized; the critical feedback path is a bit count,

barrel shifter, and 2:1 multiplexor. This path is much shorter and grows much less

quickly as the width increases than the feedback path of Figure 3.8 which include a full

merge network. Since the number of stages in the barrel shifter is O(lg(width)) (See

Section 3.2.1), the bandwidth through this unit is still O(width/lg(width)); however,

the base of the logarithm is much higher, making the bandwidth much closer to the

ideal O(width).

Finally, Figure 3.10 shows the datapath for a full merge tree. A “tree filler” block

has the same interface as a sort merge unit, but fulfills requests by fetching from

DRAM. It continually sends blocks of “valid” bits which indicate that data is still

available for a particular input, turns requests from the top level of the merge tree

into DRAM requests, and turns replies from DRAM into pushes into the top sort

merge unit. During the initial pass through the memory, the data for an input can

come from anywhere, so the input column is read linearly and sent through a small

initial bootstrap sort network since the sort merge units expect blocks of sorted data

as input. To prevent very wide levels that make routing more di�cult, the top levels

of the tree are split into four sub-trees, which operate independently of each other.

The final two levels of the tree are the high bandwidth merge sort unit to maintain

the total throughput of the tree and merge the output of the four lower bandwidth

trees to produce a single sorted output.

On passes after the initial pass through data, the tree filler must obtain data from


the particular sorted portion that matches the tree input of the request. Depending

on the number of portions remaining to be merged, the tree filler maps some number

of inputs of the tree to each of the remaining portions. For example, if the full tree is

16k inputs wide and there are four portions remaining to be merged, the first portion

is mapped to the first 4k inputs, the second to the next 4k, etc. This means that

some values of the portion are re-merged, but also has the e↵ect of using sections of

the tree as an input bu↵er for each of the portions. The fewer portions that remain to

be merge, the larger the “input bu↵er” for each portion is and the larger the requests

to DRAM can be. When the number of portions remaining to be sorted is equal to

the number of inputs to the tree, only a single chunk of a portion can be requested

at a time, leading to ine�cient use of the DRAM bandwidth. We see the results of

this in Section 3.3.3.

To support using portions of the merge tree as an input bu↵er in subsequent

passes, the tree filler keeps a bit mask of tree inputs that it has received a request

for. When enough of the inputs mapped to a particular portion have been requested,

a single large request for the next values in that portion are requested and all of the

requests are fulfilled in bulk.

For a well distributed data set, the throughput of the merge tree is the same as

the throughput through the final node, so close to O(width) and can easily grow

with available memory throughput by making the final nodes wider. However, for

some data sets the throughput will be limited by the bandwidth through one of the

lower bandwidth merge nodes. Consider the case of sorting a list that is already in

order. For the first path, the data will come through all branches of the tree and

full bandwidth will be achieved. However, on the second path, the far left input to

the tree will need to be drained before moving on to the next input. In this case,

all data is coming from just one branch of the tree, and the throughput through the

tree is the throughput through a low-bandwidth merge node with only one input, or

O(width/lg(width)), where width is the width of the low-bandwidth merge network

in Figure 3.7.


3.2.5 Sort Merge Join

A full join operation is the same operation as a merge join, described in Section 3.2.3,

but does not require the input columns to be sorted. Two main algorithms are most

often used to perform joins, a hash join and sort merge join [45]. A hash join builds

a hash table of one of the two input columns, then looks each element of the other

column up in the hash table to find matches. Modern hash join implementation use

sophisticated partitioning schemes to parallelize the operation and utilize a processors

cache hierarchy. A sort merge join simply sorts both input columns then performs

a merge join on the sorted columns. Implementations leverage the massive amount

of research to improve the performance of sorting. Typically the final merge step is

all but ignored because sorting the columns takes such a huge percentage of the time

necessary for a sort merge join.

Figure 3.10 shows how each of the three blocks previously described can be com-

bined to perform an entire sort merge join in hardware. Two independent sort trees

are used to sort each of the two input columns. On the final pass through each col-

umn, the sorted data is sent to the merge join block instead of back to DRAM. The

merge join output is sent to the select block as before and only the result of the join

operation is written back into DRAM. The design also include data paths that allow

the sort, merge join, and select blocks to be used independently of each other.

3.3 Implementation and Results

To prototype the design we used a system from Maxeler Technologies described in

Figure 3.11. This system features four large Xilinx Virtex-6 FPGAs. Each FPGA

has 475k logic cells and 1,064 36 Kb RAM blocks for a total of 4.67 MB of block

memory. Each FPGA is connected to 24 GB of memory via a single 384 bit memory

channel capable of running at 400 MHz DDR, for a line speed of 307.2 Gbps, or 38.4

GB/s per FPGA. This gives a total line bandwidth between the FPGAs and memory

of 153.6 GB/s, comparable to modern GPUs. The FPGAs are connected in a line

with connections capable of 4 GB/s in each direction. For each design, we clocked


Figure 3.11: Block diagram of prototyping platform from Maxeler Technologies.

the FPGA fabric at 200 MHz. Finally, each FPGA is connected via PCIe x8 to a

host processor which is two 2.67 GHz Xeon 5650s, each containing 6 multi-threaded

cores. These processor each have a line memory bandwidth of 32 GB/s.

Our purpose in prototyping the design was not entirely to determine the perfor-

mance of the design, although we do provide performance numbers. As long as the

components are able to match or exceed the memory bandwidth, the performance

is largely determined by the memory system of the design, and thus many of the

performance results are as much a test of Maxeler’s memory system as they are of

the acceleration design. Our main purpose in building the prototype was to drive the

design using a real world implementation instead of what are often inaccurate sim-

ulation models, and to be able to determine the challenging issues that arise as the

hardware scales to higher bandwidths. Indeed, the final designs we have presented are

fairly di↵erent from the original designs we came up with based on early simulations.

We chose the Maxeler platform for the large amount of memory capacity and

bandwidth available to the FPGAs; we wanted to ensure that our prototype handled

a su�cient amount of bandwidth to prevent masking any scalability issues. The

largest performance bottleneck we faced using the platform is the relatively narrow


intra-FPGA links, which prevented us from e↵ectively emulating a single chip with

a full 153.6 GB/s of memory bandwidth. Thus, for all but Section 3.3.4, we use a

single FPGA, since using the narrow intra-FPGA links skews the results in terms of

the memory bandwidth utilization.

Since many of the performance numbers are dominated by the performance of the

memory system on the Maxeler platform, we also present percentage of the maximum

memory throughput (by which we mean the line bandwidth of the memory interface)

as a metric of comparison. Since our hardware is designed to scale with available

bandwidth, these percentages give an idea of how the design would perform in di↵erent

platforms with di↵erent memory systems. They also provide a metric of comparison

with previous work, as it is di�cult to make a true “apples-to-apples” comparison

when the hardware is so vastly di↵erent. We also give some intuition as to how the

resource requirements of each design will scale to platforms with di↵erent memory

bandwidths.

3.3.1 Selection

We implemented the software algorithm described in Section 3.2.2 and optimized at

the assembly language level. On our system’s host processor, this implementation is

able to achieve a maximum throughput using 8 threads, with an average throughput

of 7.4 GB/s and 6.0 GB/s as the selection cardinality moves from 0% to 100%. This

corresponds to 23.1% to 18.8% of the 32 GB/s maximum memory throughput of the

Xeon 5650. For reference, the STREAM benchmark[54] also achieves the maximum

bandwidth with 8 threads and is able to copy memory at a maximum speed of 11.8

GB/s1, about 36.8% of the line rate memory bandwidth of the Xeon 5650. Results

reported on the STREAM benchmark website [1] indicate that this utilization of

maximum memory bandwidth is typical for modern processors, including the Sandy

Bridge based E5-4650.

Our implementation uses three SIMD registers, one to hold the data to be shu✏ed,

1The STREAM benchmark reported 23.6 GB/s, but counts bytes both read and written, or the“STREAM” method; the number here is for the “bcopy” method, which counts total bytes moved,which is more aligned with our use of bandwidth in this work.


0 10 20 30 40 50 60 70 80 90 100Cardinality (%)

16

17

18

19

20

21

22

23

24

Thro

ughp

ut (G

B/s)

Figure 3.12: Measured throughput of the select block prototype.

one to hold the bit mask, and one to hold the shu✏e indices loaded from memory.

Thus, the lack of available SIMD registers accounts for the inability of the processor

to fully pipeline the selection process and achieve the throughput of STREAM. The

Xeon’s in our test system support 16 byte wide SIMD instructions; using the 32

byte wide AVX2 integer instructions in the upcoming Haswell processors we would

expect better performance. We conclude that it is reasonable to expect a highly tuned

software selection algorithm to match the throughput of STREAM. However, doing

so would require most of, if not all of the chip’s capacity. In contrast our customized

hardware is more than able to keep up with high memory bandwidth using much

fewer resources.

The design in Section 3.2.2 maps almost directly to the FPGA platform and

we built a block that processes 72 64-bit values per clock cycle, for a maximum

throughput of 14.4 billion values per second, or 115.2 GB/s. This is much more than

the memory bandwidth available to a single chip; we will see in Section 3.3.2 why we

made it that wide. We note that the non-power of two number comes from the width

of the memory interface, which is 384 bits DDR and is run at twice the frequency as

the main clock domain, resulting in a 1536 bit or 192 byte bus. It is resource intensive

to convert this to a nice power of two, but not too di�cult to convert to a multiple

of 96 bytes, thus all of our datapaths work in multiples of 96 bytes.

Figure 3.12 shows the measured throughput of the prototype. Throughout Sec-

tion 3.3, bandwidth numbers are measured as the number of input bytes processed


per second 2. We could alternatively use total number of bytes read and written.

This is pertinent here because a selection with cardinality of 0% transfers half the

amount data as one with cardinality of 100%. With a constant amount of memory

bandwidth that can be used for either reading or writing data, the 100% case will

take longer to execute, but would have higher throughput if bytes both read and

written were counted. Counting only bytes read, the cardinality of 100% case shows

lower bandwidth since it takes longer to process the same amount of input data. This

explains the nearly linear drop from 24.7 GB/s down to 17.8 GB/s as the cardinality

moves from 40% to 100%. Below 40% the limits of a single port of the DRAM con-

troller are reached and the full line rate of the memory interface is not realized. At

100% cardinality, the memory controller is more e�cient with two streams of data

(in and out) and is able to utilize 93% of the 38.4 GB/s of line bandwidth. This high

utilization is achieved because of the very linear nature of the data access pattern (i.e.

every column is accessed in a row before moving on to the next row) and by putting

the source and destination columns in di↵erent ranks of the DRAM, preventing them

from interfering with one another.

At low cardinalities, the 24.7 GB/s achieved is 64.3% of the 38.4 GB/s maximum

memory throughput of the FPGA. This represents a 2.8x increase in the memory

bandwidth utilization over the 23.1% utilization of the software, and a 1.7x increase

over the STREAM benchmark, which is as high as any software implementation could

possibly achieve.

Note that the results of Figure 3.12 are per selection block and measured using

only a single FPGA. Benefits from attempting to use the memory bandwidth of the

other FPGAs for a single selection block would be thwarted by the narrow intra-

FPGA links. Using all four FPGAs to emulate a design with four selection blocks

would result in 4x the throughput but four separate output columns.

Since Figure 3.12 is simply just a measure of the memory system on the Max-

eler platform, We now look at the number of resources required to scale the design.

Figure 3.13 shows the resources used by the implementation as the width, and thus

2Also note that “GB” is here is really gigabyte, not gibibyte, making percentage of line bandwidth,which is also in GB, not GiB, make sense


64 88 112 136 160 184 208 232 256 280 304 328 352 376Throughput (bytes/clock)

24 36 48 60 72 84 96 108 120 132Throughput (GB/s @ 400 MHz)

0

2

4

6

8

10

Coun

t (th

ousa

nds)

ROM bits16:1 mux4:1 muxregisters

Figure 3.13: Amount of resources needed as the desired throughput of the select blockincreases.

bandwidth, of the block increases (note the di↵erent scale for registers and the other

components). We present throughput as bytes per clock to decouple the results from

any particular frequency, but also present GB/s at 400 MHz for reference. The range

in throughput represents the range in width from 8 to 144 64-bit words. In choosing

the number of stages used in the initial shu✏e control (see Section 3.2.2), we exper-

imentally found a good number of stages to use is W/4, where W is the width in

words of the selection block.

Note that the numbers in Figure 3.13 present resources at the bit level. So a

multiplexor that select between 4 64-bit words requires 64 4:1 multiplexors. For

convenience, we lump 2:1 multiplexors in with 4:1 multiplexors and 8:1 multiplexors

in with 16:1 multiplexors. Any multiplexor wider than 16 inputs is split into multiple

stages to ease routing congestion and maintain clock speed. The swap that occurs

at 496 bytes/block (or 62 to 68 words) results from the second stage of an 68:1

multiplexor requiring 16:1 multiplexors instead of the 4:1 second stage of smaller

widths (W/16 > 4 when W > 64).

The most dramatic increase in resources as throughput increases comes from the

number of registers. This results from the additional pipeline stages needed as the


width increases. In addition to addition stages in the shu✏e multiplexor and barrel

shifter, we added duplicate registers to reduce fanout for each 16 inputs to help with

the routing on the FPGA.

3.3.2 Merge Join

We prototyped the design presented in Section 3.2.3. The prototype is designed to

merge two streams of elements composed of 32-bit keys and 16-bit values. Because of

the high demand for routing resources, the structure did not map well to the FPGA

fabric and we were able to achieve a block with a width of eight words for each input.

The output combinations, which are a 32-bit key and two 16-bit values, and equality

bit vector are sent into a selection block, which is wide enough to accept all 64 64-bit

inputs.

The throughput of the prototype for varying amounts of output vs the input table

size is presented in Figure 3.14. The line labeled “m=1” is the raw comparison grid

without the optimization of not examining unnecessary cross sections. The other line,

“m=8” shows the throughput for looking at 8 chunks of each input and only actually

comparing chunks with potential matches. The output ratio is the size of the output

compared to the input table size (which is two equally sized tables). The keys are

uniformly distributed within a range that is changed to vary the output ratio.

At low output ratios, the throughput is contrained by the throughput of the hard-

ware block itself (eight six byte values at 200 MHz is 9.6 GB/s). As the output

ratio increases, it is necessary to “replay” portions of the input more often (see Sec-

tion 3.2.3) and the throughput decreases. Above a ratio of 1.5 (i.e. the output is

1.5 times the size of the input), the throughput is entirely limited by the write mem-

ory bandwidth. We looked at non-uniform distributions, but saw no variance in the

throughput for any given output ratio. Most skewed data, such as data with a Zipf

distribution used in the literature, produced a very large amount of output and were

all limited by the write memory bandwidth.

The optimization to look at 8 chunks of input and only comparing possible cross

sections resulted in a 13% speed up when the data was distributed enough to produce


0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3Output ratio

5.5

6

6.5

7

7.5

8

Thro

ughp

ut (G

B/s)

m=1m=2m=3m=8

Figure 3.14: Throughput of the merge join prototype.

a very small output. As the keys become more dense and the output ratio goes to

3.0, fewer cross sections can be eliminated and the speedup is reduced to 11%.

We do not plot the required resources for the merge join block because it is dom-

inated entirely by the comparators and routing resources and is simply a quadratic

function of the bandwidth required. To consume N values from either input every

cycle required N2 comparisons. Higher bandwidth could be obtained by replicating

the merge block and partitioning the data, but doing so is left for future work.

3.3.3 Sorting

Our implementation of the design outlined in Section 3.2.4 is designed to handle 12

64-bit values every other 200 MHz cycle, providing a maximum throughput of 19.2

GB/s, which is able to keep up with the memory bandwidth of an individual FPGA

(assuming a column is being read and written). One of the major challenges faced in

implementing the low bandwidth merge sort unit was the number of memory ports

needed. In particular, it was necessary to access five di↵erent addresses of the valid

memory in any given cycle. The local memories on the FPGA have two full RW

ports. To solve the issue we duplicated each valid memory and time multiplexed

the ports, alternating between reading and writing (thus handling a new input every

other cycle). Table 3.1 details how each port was used to achieve a virtual 5-port


Memory Port Read Cycle Write Cycle

valid copy 1A Read for push Write for pushB Read for pop Write for pop

valid copy 2A Read for request Write for pushB Idle Write for pop

requestoutstanding

A Read for push Write for pushB Read for request Write for request

Table 3.1: Memory port usage in sort merge unit.

memory. Note that each port must perform the same operation on the write cycle to

maintain coherent duplication.

All the other structures mapped directly to the FPGA logic. To maintain 19.2

GB/s through the entire tree, the three high bandwidth sort merge units at the

bottom of the tree were built to accept 24 values every four cycles to accommodate

the feedback path. The most challenging aspect was getting the control for the fine

grained communication between levels correct. As an example, the pop operation is

pipelined to take six cycles: 1) start the read of data and valid blocks; 2) decode

the index; 3) start the read of the feedback data; 4) the reads complete, compare

the data; 5) multiplex the data based on the comparison result; 6) merge decoded

index with read valid blocks, update the valid block, and send the feedback data and

selected data to the merge network. At every other pipeline stage the index being

pushed is compared with the incoming index and if the two fall within the same block,

the decoded index, which indicates the valid bit to set, is updated and the incoming

push is considered complete. The pipelines for the request and push operations are

similar.

The memories on the FPGA provided enough space for 12 levels in the merge

tree, with a top level 8k inputs wide. The data bu↵ering alone for the merge tree

(including the feedback data) occupied 18.6 Mbits, or 50%, of the 37.4 Mbits of block

ram available on the device.

Figure 3.15 shows the throughput of the prototype as the size of the input column

grows. Note that when performing two passes over the entire data set, the theoretical

maximum throughput is one quarter of the maximum memory throughput (each value


375K

750K

1.5M 3M 6M

12.5

M

25M

50M

100M

200M

400M

800M 1.6B

3.2B

6.4B

12.5

B

25B

50B

Size of Input

400

600

800

1000

1200

1400

1600

Mill

ion

valu

es p

er s

econ

d 2 passes3 passes3 passes (projected)

Figure 3.15: Throughput of the sort tree prototype.

needs to be both read and written twice), or 9.7 GB/s in our case. At small input

sizes, we achieve 8.7 GB/s, which is 22.7% of the maximum memory bandwidth, or

89% of the theoretical maximum with two passes. This high utilization is possible

because there are fewer partially sorted portions to merge in the second pass and as

a result each portion has a large virtual input bu↵er and the requests to memory

can be large (see Section 3.2.4). For reference, recent work on sorting values on both

CPUs and GPUs achieved rates as high as 268 million 32-bit values per second [69].

This corresponds to 1 GB/s of throughput, which is 3.9% of the 25.6 GB/s available

to the Core i7 used (GPU performance was worse). We thus see a 5.7x improvement

in terms of memory bandwidth utilization.

As the size of the input increases, the number of portions that must be merged

on the second pass increases and the size of the requests to memory decrease. At

an input size of 25M values, the memory requests are too small to fully utilize the

memory bandwidth and performance begins to degrade. When the input size reaches

400M values, there are enough portions in the second pass that it is advantageous to

perform a third pass. In this case, the portions from the first pass are partitioned into

groups small enough that large memory requests can be used and each partition is

sequentially merged into portions ready to be merged in the third pass. Above 800M

values, there was insu�cient memory to hold both the input and output columns, we


17 19 21 23 25 27 29 31 33 35Input Size in bytes (log2)

16

18

20

22

24

26

28

30

Mem

ory

bits

(log

2)

Figure 3.16: Memory bits required to achieve optimal sort throughput for a giveninput size. Note the log/log scale.

therefore projected the performance for larger columns, making use the throughput

seen on the second pass of smaller columns to predict the memory bandwidth for a

certain table size.

Unlike the previous sections, the interesting resource metric is not how the resource

usage grows with desired bandwidth, but how the resource usage grows with input

size, keeping bandwidth constant. A very small merge tree could maximize bandwidth

for small inputs, but performance would rapidly decrease as input size grows. For

example, our prototype was able to use the maximum amount of memory bandwidth

until the input was over 12.5 million values. To see where this limit comes from,

let N be the size of the input, in bytes, and let W be the width of the top level

of the tree in bytes (in our prototype W = 8k ⇤ 12 records ⇤ 8 bytes/record =

786432 bytes). The number of portions left after the first pass through the data is

L = N/W and the maximum size of each read on the second pass is W/L, or W 2/N .

If the minimum read size for optimal memory throughput is M , the maximum input

size that achieves optimal memory performance is W 2/M . For the Maxeler platform,

M is measured to be 6144 bytes, which gives a maximum size of 100 MB, or 12.5M

64-bit values. Likewise, W must bepM ⇤N for a table of size N to fully utilize the

memory bandwidth on the second pass. Figure 3.16 provides the number of memory

bits needed to achieve maximum memory bandwidth e�ciency for given input sizes,

provided a minimum read size of 6144 bytes.


To obtain the highest throughput possible using our platform, we tested a pro-

totype where one quarter of the input column was split onto FPGAs 0 and 2, while

the remaining three quarters were put on FPGA 1. With this configuration, the two

smaller portions were individually sorted then streamed over the intra-FPGA link to

the FPGA with the bulk of the data. These streams were simply treated as extra

inputs to the top of the tree on the final merge pass and essentially augmented the

memory bandwidth. Note that the sort tree hardware did not change, just where

the data came from. With this configuration, we achieved a throughput of 1.4 billion

values per second, or 11.2 GB/s. With the narrow intra-FPGA links in play, this

is a much lower percentage of the memory bandwidth available to the three chips

used (9.7%). We mention it here to demonstrate that the throughput of the sort tree

hardware is purely constrained by the memory bandwidth.

3.3.4 Sort Merge Join

Finally, we combine the selection, merge join, and sorting blocks to prototype the full

design in Figure 3.10. The resources of a single FPGA were too constrained to fit all

three blocks on a single FPGA, so we put the merge join and selection blocks on one

FPGA and sort trees on the two adjacent FPGAs. Figure 3.17 outlines the process

used to perform a full join. Each of the columns to be joined is held entirely on a

seperate FPGA. Each table is individually sorted, except the output of the sort tree

on the final pass is sent across the intra-FPGA links to the merge join block described

in Section 3.3.2. These blocks are su�ciently wide to keep up with the bandwidth of

the intra-FPGA links. Since the first sorting pass through the table has a constant

throughput limited by the memory bandwidth, and the second and final pass through

the data is limited by the intra-FPGA link, the end-to-end throughput of the whole

design is a consistent 6.45 GB/s across all table sizes and output cardinality, or just

over 800 million key/value pairs a second. This is slightly under the aggregate intra-

FPGA bandwidth of 8 GB/s due to the initial pass through the data for sorting. The

achieved 6.45 GB/s is 5.6% of the 115.2 GB/s of memory bandwidth available to the

three chips. This lower utilization is due to the narrow intra-FPGA links.


System Clock Throughput/ % of BWFreq Mem BW (GB/s)

Multi FPGA 200 MHz 6.45 / 115.2 5.6%Single FPGA 200 MHz 6.25 / 38.4 16.3%Kim [45] (CPU) 3.2 GHz 1 / 25.6 3.8%Kaldewey [43] (GPU) 1.5 GHz 4.6 / 192.4 2.3%

Table 3.2: Summary of sort merge join results.

Figure 3.17: Full multi-FPGA join process. Each table is first sorted separately onthe respective FPGA. Finally, both tables are sent to the FPGA containing the mergejoin block to be merged.

If all three blocks were able to fit on a single chip, the second pass through the

data would be constrained by the throughput of the merge-join block. In this case,

the end-to-end throughput would be 6.25 GB/s, which is lower absolute throughput

than the multi-FPGA design due to using only one FPGA’s memory bandwidth, but

is 16.3% of that FPGA’s maximum memory throughput.

Table 3.2 summarizes our results and compares with other recent work on join

processing. In Kim et. al.’s work [45], they used a Core i7 965 with 25.6 GB/s to

achieve a join throughput of 128 million 64-bit tuples per second, or 1 GB/s and

3.9% of memory bandwidth. Our multi-FPGA design achieved a 40% increase over

this utilization, and a single-chip design would provide a 4.1x increase in utilization.

More recent work by Kaldewey et. al. [43] uses a GTX 580 GPU with 192.4 GB/s of

memory bandwidth to achieve 4.6 GB/s of aggregate throughput. These results used


UVA memory access over a PCIe link since their experiments showed that the com-

putational throughput of the GPU was less then the PCIe data transfer throughput.

This, even if the tables were contained in device memory, the join throughput would

remain at 4.6 GB/s, or 2.3% of memory bandwidth of the device.

3.4 Related Work

There has been a growing interest in using dedicated acceleration logic to accelerate

database operations, specifically using FPGAs as an excellent platform to explore

custom hardware options. Mueller et.al. proposed an FPGA co-processor that per-

forms a streaming median operator which utilizes a sorting network [56]. This work

performs a di↵erent operation and is directed at much smaller data sets and lower

bandwidths than our work. In their design, it was only necessary to have single

merge unit that data flowed through, sorting small eight word blocks in a sliding

window independent of each other. Our design incorporates a full sorting tree that

has many merge units coordinating the sorting of the entire memory stream. This

same team has also proposed Glacier, a system which compiles queries directly to a

hardware description [58, 57]. This is complimentary to our work as it looks at ways

to incorporate accelerators into an overall database system.

Koch and Torrenson also propose an architecture for sorting numbers using FP-

GAs [46]. The design in this work has similarities to the sorting implementation

presented here; however, they were constrained to a system with much lower mem-

ory bandwidth and capacity and thus achieve results on the order of 1 to 2 GB/s of

throughput. They do not discuss scaling their results to higher bandwidths, which

requires fundamental design changes as illustrated in our work. Our work builds on

top of this work by presenting new designs that make use of a modern prototyping

system with a large amount of memory capacity and bandwidth.

More recently, researchers at IBM proposed an architecture to accelerate database

operations in analytical queries using FPGAs [78]. Their work focuses on row decom-

pression and predicate evaluation and concentrates on row based storage systems.

Netezza, now part of IBM, provides systems that use FPGA based query evaluators


that sit between disks and the processor [2]. Like Glacier, this work is complimentary

and shows the possibilities of incorporating accelerators like those presented here into

real database systems.

Chapter 4

Conclusions

Building accelerators that actually accelerate computation is hard. In this thesis,

we have discussed accelerators that have succeeded in detail to provide insight for

developers of future domain specific accelerators. We presented FARM, a hardware

prototyping system based on an FPGA coherently connected to multiple processors.

In addition, we revealed practical issues inherent in using such an accelerator system

and described methods of addressing these issues. We also used FARM to successfully

prototype an STM accelerator that relies on low-latency fine-grained communication.

FARM provides tools that enable researchers to prototype a broad range of interesting

applications that would otherwise be expensive and di�cult to implement in hard-

ware. The conclusion of this work is that communicating coherently with a processor

requires careful design and employment of techniques such as the use of epochs to

reason about the timing of events in an asycnhronous system.

We have presented an architecture, TMACC, for accelerating STM without mod-

ifying the processor cores. We constructed a complete hardware implementation of

TMACC using a commodity SMP system and FPGA logic. In addition, two novel

algorithms which use the TMACC hardware for conflict detection were presented and

analyzed. Using the STAMP benchmark suite and a microbenchmark to quantify and

analyze the performance of a TMACC accelerated STM, we showed that TMACC

provides significant performance benefits. TMACC outperforms a plain STM (TL2)

by an average of 69% in applications using moderate-length transactions, showing

100

CHAPTER 4. CONCLUSIONS 101

maximum speedup within 8% of an upper bound on TM acceleration. TMACC

provides this performance improvement even in the face of the high communication

latency between TMACC and the CPU cores. Overall we conclude, and this thesis

demonstrates, that it is possible to accelerate TM with an out-of-core accelerator and

mitigate the impact of fine-grained communication with the techniques presented.

We have presented three new hardware designs to perform important primitive

database operations: selection, merge join, and sorting. We have shown how these

hardware primitives can be combined to perform an equi-join of two database ta-

bles entirely in hardware. We described an FPGA based prototype of the designs

and discussed challenges faced. We showed that our hardware designs were able to

obtain close to ideal utilization of available memory bandwidth, resulting in a 2.8x,

5.7x, and 1.4x improvement in utilization over software for selection, sorting, and

joining, respectively. We also present the hardware resources necessary to implement

each hardware block and show how those hardware resources grow as the bandwidth

increases.

Thus, while actually accelerating computation using hardware accelerators is al-

most never a straight forward mapping of algorithms to hardware, it is still possible

and practical to achieve significant improvements in computation speed and e�ciency

using custom designed but flexible and programmable hardware components. As com-

puter systems evolve to overcome various “walls”, domain specific accelerators will

provide important and irreplacable building blocks to enable new capabilities. Their

importance will only continue to grow as general purpose computation reaches fun-

damental limits to its e↵ectiveness. This work has aimed to add significant insight

and knowledge to the field of designing and building these accelerators.

Bibliography

[1] STREAM: Sustainable memory bandwidth in high performance computers.

[2] The Netezza FAST engines framework, 2008.

[3] A & D Technology, Inc. Procyon, the ultra-high-performance simulation and

control platform.

[4] Manuel E. Acacio, Jose Gonzalez, Jose M. Garcıa, and Jose Duato. A new

scalable directory architecture for large-scale multiprocessors. In HPCA ’01:

Proceedings of the 7th International Symposium on High-Performance Computer

Architecture, 2001.

[5] Ali-Reza Adl-Tabatabai, Brian Lewis, Vijay Menon, Brian R. Murphy, Bratin

Saha, and Tatiana Shpeisman. Compiler and runtime support for e�cient soft-

ware transactional memory. In PLDI ’06: ACM SIGPLAN Conference on Pro-

gramming Language Design and Implementation, 2006.

[6] Altera. Advanced Synthesis Cookbook, July 2009.

[7] Inc. AMD. Maintaining cache coherency with amd opteron processors using

fpga’s.

[8] Woongki Baek, Chi Cao Minh, Martin Trautmann, Christos Kozyrakis, and

Kunle Olukotun. The OpenTM transactional application programming inter-

face. In PACT ’07: 16th Internation Conference on Parallel Architecture and

Compilation Techniques, 2007.

102

BIBLIOGRAPHY 103

[9] L.A. Barroso, S. Iman, and J. Jeong. RPM: A rapid prototyping engine for

multiprocessor systems. IEEE Computer, 1995.

[10] Michael Bauer, Henry Cook, and Brucek Khailany. CudaDMA: optimizing GPU

memory bandwidth via warp specialization. In Proceedings of 2011 International

Conference for High Performance Computing, Networking, Storage and Analysis,

SC ’11, pages 12:1–12:11, New York, NY, USA, 2011. ACM.

[11] B. Bloom. Space/time trade-o↵s in hash coding with allowable errors. Commu-

nications of ACM, 1970.

[12] Colin Blundell, Joe Devietti, E. Christopher Lewis, and Milo M. K. Martin.

Making the fast case common and the uncommon case simple in unbounded

transactional memory. In ISCA ’07: 34th International Symposium on Computer

Architecture, 2007.

[13] J. Bobba, N. Goyal, M.D. Hill, M.M. Swift, and D.A. Wood. Tokentm: E�cient

execution of large transactions with hardware transactional memory. In ISCA

’08: 35th International Symposium on Computer Architecture, 2008.

[14] Haran Boral and David J. DeWitt. Database machines: An idea whose time

passed? a critique of the future of database machines. In IWDM’83.

[15] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De. Pa-

rameter variations and impact on circuits and microarchitecture. In Design Au-

tomation Conference, 2003. Proceedings, June 2003.

[16] Shekhar Borkar and Andrew A. Chien. The future of microprocessors. Commun.

ACM, 54(5):67–77, May 2011.

[17] Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun. A prac-

tical concurrent binary search tree. In Proceedings of the 15th ACM SIGPLAN

Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,

pages 257–268, New York, NY, USA, 2010. ACM.

BIBLIOGRAPHY 104

[18] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun.

STAMP: Stanford transactional applications for multi-processing. In IISWC

’08: Proc. The IEEE International Symposium on Workload Characterization,

2008.

[19] Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald,

Nathan Bronson, Jared Casper, Christos Kozyrakis, and Kunle Olukotun. An

e↵ective hybrid transactional memory system with strong isolation guarantees.

In ISCA ’07: 34th International Symposium on Computer Architecture, 2007.

[20] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions.

Journal of Computer and System Sciences, 18(2), 1979.

[21] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu,

Stefanie Chiras, and Siddhartha Chatterjee. Software transactional memory:

Why is it only a research toy? Queue, 6(5), 2008.

[22] Luis Ceze, James Tuck, Pablo Montesinos, and Josep Torrellas. BulkSC: bulk en-

forcement of sequential consistency. In ISCA ’07: 34th International Symposium

on Computer architecture, 2007.

[23] Hassan Chafi, Jared Casper, Brian D. Carlstrom, Austen McDonald, Chi

Cao Minh, Woongki Baek, Christos Kozyrakis, and Kunle Olukotun. A scalable,

non-blocking approach to transactional memory. In HPCA ’07: 13th Interna-

tional Symposium on High Performance Computer Architecture, 2007.

[24] Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders

Landin, Sherman Yip, Hakan Ze↵er, and Marc Tremblay. Simultaneous specula-

tive threading: a novel pipeline architecture implemented in sun’s rock processor.

In ISCA ’09: 36th Intl. Symposium on Computer Architecture, 2009.

[25] Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, William Macy, Mostafa

Hagog, Yen-Kuang Chen, Akram Baransi, Sanjeev Kumar, and Pradeep Dubey.

E�cient implementation of sorting on multi-core SIMD CPU architecture. Proc.

VLDB Endow., 1(2):1313–1324, August 2008.

BIBLIOGRAPHY 105

[26] Andrew A. Chien, Allan Snavely, and Mark Gahagan. 10x10: A general-purpose

architectural approach to heterogeneity and energy e�ciency. Procedia Computer

Science, 4(0):1987 – 1996, 2011.

[27] P. Chow. Why put fpgas in your cpu socket? In Field-Programmable Technology

(FPT), 2013 International Conference on, pages 3–3, Dec 2013.

[28] Convey Computer Corp. Instruction set innovations for convey’s hc-1 computer.

[29] Luke Dalessandro, Michael F. Spear, and Michael L. Scott. NOrec: streamlining

STM by abolishing ownership records. In PPoPP ’10: 15th ACM SIGPLAN

Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,

2010.

[30] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir,

and Dan Nussbaum. Hybrid transactional memory. In ASPLOS ’06: 12th In-

ternation Conference on Architectural Support for Programming Languages and

Operating Systems, October 2006.

[31] R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc.

Design of ion-implanted MOSFET’s with very small physical dimensions. Solid-

State Circuits, IEEE Journal of, 9(5):256–268, October 1974.

[32] David DeWitt and Jim Gray. Parallel database systems: the future of high

performance database systems. Commun. ACM, 35(6):85–98, June 1992.

[33] Sarang Dharmapurikar, Praveen Krishnamurthy, T.S. Sproull, and J.W. Lock-

wood. Deep packet inspection using parallel bloom filters. Micro, IEEE, 24(1),

Jan.-Feb. 2004.

[34] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II. In DISC ’06:

20th Internation Symposium on Distributed Computing, 2006.

[35] Aleksandar Dragojevic, Rachid Guerraoui, and Michal Kapalka. Stretching

transactional memory. In PLDI ’09: ACM SIGPLAN Conference on Program-

ming Language Design and Implementation, 2009.

BIBLIOGRAPHY 106

[36] Paul Gigliotti. XAPP195: Implementing Barrel Shifters Using Multipliers. Xil-

inx, August 2004.

[37] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis,

Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and

Kunle Olukotun. Transactional memory coherence and consistency. In ISCA

’04: 31st International Symposium on Computer Architecture, 2004.

[38] Tim Harris and Keir Fraser. Language support for lightweight transactions. In

OOPSLA ’03: 18th ACM SIGPLAN Conference on Object-oriented Programing,

Systems, Languages, and Applications, 2003.

[39] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural

support for lock-free data structures. In ISCA ’93: 20th International Symposium

on Computer Architecture, 1993.

[40] Owen S. Hofmann, Christopher J. Rossbach, and Emmett Witchel. Maximum

benefit from a minimal HTM. In ASPLOS ’09: 14th International Conference on

Architectural Support for Programming Languages and Operating Systems, 2009.

[41] Sungpack Hong, Tayo Oguntebi, Jared Casper, Nathan Bronson, Christos

Kozyrakis, and Kunle Olukotun. Eigenbench: A simple exploration tool for or-

thogonal tm characteristics. In IISWC ’10: International Symposium on Work-

load Characterization, 2010.

[42] Christopher J. Hughes and Sarita V. Adve. Memory-side prefetching for linked

data structures for processor-in-memory systems. J. Parallel Distrib. Comput.,

65(4), 2005.

[43] Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. GPU join pro-

cessing revisited. In Proceedings of the Eighth International Workshop on Data

Management on New Hardware, DaMoN ’12.

[44] Chetana N. Keltcher, Kevin J. McGrath, Ardsher Ahmed, and Pat Conway. The

amd opteron processor for multiprocessor servers. IEEE Micro, 23(2), 2003.

BIBLIOGRAPHY 107

[45] Changkyu Kim, Tim Kaldewey, Victor W. Lee, Eric Sedlar, Anthony D. Nguyen,

Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey. Sort vs.

hash revisited: fast join implementation on modern multi-core CPUs. Proc.

VLDB Endow., 2:1378–1389, August 2009.

[46] Dirk Koch and Jim Torresen. FPGASort: a high performance sorting archi-

tecture exploiting run-time reconfiguration on fpgas for large problem sorting.

In Proceedings of the 19th ACM/SIGDA international symposium on Field pro-

grammable gate arrays, FPGA ’11.

[47] P. Kocher, R. Lee, G. McGraw, A. Raghunathan, and S. Ravi. Security as a new

dimension in embedded system design. In Design Automation Conference, 2004.

Proceedings. 41st, 2004.

[48] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and An-

thony Nguyen. Hybrid transactional memory. In PPoPP ’06: 11th ACM SIG-

PLAN Symposium on Principles and Practice of Parallel Programming, 2006.

[49] Jim Larus and Ravi Rajwar. Transactional Memory. Morgan Claypool Synthesis

Series, 2006.

[50] N. Leischner, V. Osipov, and P. Sanders. GPU sample sort. In Parallel Dis-

tributed Processing (IPDPS) 2010.

[51] Marc Lupon, Grigorios Magklis, and Antonio Gonzalez. FASTM: A log-based

hardware transactional memory with fast abort recovery. In PACT ’09: 18th

International Conference on Parallel Architecture and Compilation Techniques,

2009.

[52] Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. Optimizing database

architecture for the new bottleneck: memory access. The VLDB Journal,

9(3):231–246, December 2000.

[53] Virendra J. Marathe, William N. Scherer III, and Michael L. Scott. Adaptive

Software Transactional Memory. In DISC ’05: 19th International Symposium

on Distributed Computing, 2005.

BIBLIOGRAPHY 108

[54] John D. McCalpin. Memory bandwidth and machine balance in current high

performance computers. IEEE Computer Society Technical Committee on Com-

puter Architecture Newsletter, pages 19–25, December 1995.

[55] Andreas Moshovos. Regionscout: Exploiting coarse grain sharing in snoop-based

coherence. In ISCA ’05: Proceedings of the 32nd annual international symposium


[56] Rene Mueller, Jens Teubner, and Gustavo Alonso. Data processing on FPGAs.

Proc. VLDB Endow., 2(1):910–921, August 2009.

[57] Rene Mueller, Jens Teubner, and Gustavo Alonso. Streams on wires: a query

compiler for FPGAs. Proc. VLDB Endow., 2(1):229–240, August 2009.

[58] Rene Mueller, Jens Teubner, and Gustavo Alonso. Glacier: a query-to-hardware

compiler. In Proceedings of the 2010 ACM SIGMOD International Conference

on Management of data, SIGMOD ’10, pages 1159–1162, New York, NY, USA,

2010. ACM.

[59] S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood. Coherent network inter-

faces for fine-grain communication. In ISCA ’96: 23rd International Symposium


[60] University of Heidelberg (Germany). UoH cHT-Core (coherent HT Cave Core).

[61] Tayo Oguntebi, Sungpack Hong, Jared Casper, Nathan Bronson, Christos

Kozyrakis, and Kunle Olukotun. FARM: A prototyping environment for tightly-

coupled, heterogeneous architectures. In FCCM ’10: 18th Symposium on Field-

Programmable Custom Computing Machines, 2010.

[62] Marek Olszewski, Jeremy Cutler, and J. Gregory Ste↵an. JudoSTM: A dynamic

binary-rewriting approach to software transactional memory. In PACT ’07: 16th

International Conference on Parallel Architecture and Compilation Techniques.

[63] Kunle Olukotun and Lance Hammond. The future of microprocessors. Queue,

3(7):26–29, September 2005.

BIBLIOGRAPHY 109

[64] John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob

Leverich, David Mazieres, Subhasish Mitra, Aravind Narayanan, Diego Ongaro,

Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and

Ryan Stutsman. The case for RAMCloud. Commun. ACM, 54(7):121–130, July

2011.

[65] Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter, Owen S. Hof-

mann, Aditya Bhandari, and Emmett Witchel. Metatm/txlinux: transactional

memory for an operating system. SIGARCH Computer Architecture News, 35(2),

2007.

[66] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and

Ben Hertzberg. McRT–STM: A high performance software transactional memory

system for a multi-core runtime. In PPoPP ’06: 11th ACM SIGPLAN Sympo-

sium on Principles and Practice of Parallel Programming, 2006.

[67] Bratin Saha, Ali-Reza Adl-Tabatabai, and Quinn Jacobson. Architectural sup-

port for software transactional memory. In MICRO ’06: International Sympo-

sium on Microarchitecture, 2006.

[68] N. Satish, M. Harris, and M. Garland. Designing e�cient sorting algorithms for

manycore GPUs. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE

International Symposium on.

[69] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Vic-

tor W. Lee, Daehyun Kim, and Pradeep Dubey. Fast sort on CPUs and GPUs:

a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIG-

MOD International Conference on Management of data, SIGMOD ’10, pages

351–362, New York, NY, USA, 2010. ACM.

[70] Tatiana Shpeisman, Vijay Menon, Ali-Reza Adl-Tabatabai, Steven Balensiefer,

Dan Grossman, Richard L. Hudson, Kate Moore, and Bratin Saha. Enforcing iso-

lation and ordering in stm. In PLDI ’07: Conference on Programming Language

Design and Implementation, 2007.

BIBLIOGRAPHY 110

[71] Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott. Flexible decou-

pled transactional memory support. In ISCA ’08: 35th International Symposium


[72] Arrvindh Shriraman, Michael F. Spear, Hemayet Hossain, Virendra J. Marathe,

Sandhya Dwarkadas, and Michael L. Scott. An integrated hardware-software

approach to flexible transactional memory. SIGARCH Computer Architecture

News, 35, June 2007.

[73] Erik Sintorn and Ulf Assarsson. Fast parallel GPU-sorting using a hybrid al-

gorithm. Journal of Parallel and Distributed Computing, 68(10):1381 – 1388,

2008.

[74] Michael F. Spear. Lightweight, robust adaptivity for software transactional mem-

ory. In SPAA ’10: 22nd ACM Symposium on Parallelism in Algorithms and

Architectures, 2010.

[75] Michael F. Spear, Maged M. Michael, and Christoph von Praun. RingSTM: scal-

able transactions with a single atomic instruction. In SPAA ’08: 20th Symposium

on Parallelism in Algorithms and Architectures, 2008.

[76] STAMP: Stanford transactional applications for multi-processing. http://

stamp.stanford.edu.

[77] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cher-

niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth

O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-store: a column-

oriented DBMS. In Proceedings of the 31st international conference on Very large

data bases, VLDB ’05, pages 553–564. VLDB Endowment, 2005.

[78] Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer,

Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. Database analytics

acceleration using FPGAs. In Proceedings of the 21st international conference

on Parallel architectures and compilation techniques, PACT ’12.

BIBLIOGRAPHY 111

[79] Fuad Tabba, Mark Moir, James R. Goodman, Andrew Hay, and Cong Wang.

NZTM: Nonblocking zero-indirection transactional memory. In SPAA ’09: 21st

Symposium on Parallelism in Algorithms and Architectures, 2009.

[80] Cheng Wang, Wei-Yu Chen, Youfeng Wu, Bratin Saha, and Ali-Reza Adl-

Tabatabai. Code generation and optimization for transactional memory con-

structs in an unmanaged language. In CGO ’07: International Symposium on

Code Generation and Optimization, 2007.

[81] John Wawrzynek, David Patterson, Mark Oskin, Shih-Lien Lu, Christoforos

Kozyrakis, James C. Hoe, Derek Chiou, and Krste Asanovic. Ramp: Research

accelerator for multiple processors. IEEE Micro, 27(2), 2007.

[82] Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos,

Mark D. Hill, Michael M. Swift, and David A. Wood. LogTM-SE: Decoupling

Hardware Transactional Memory from Caches. In HPCA ’07: 13th International

Symposium on High Performance Computer Architecture, 2007.

[83] Luke Yen, S.C. Draper, and M.D. Hill. Notary: Hardware techniques to enhance

signatures. In MICRO ’08: 41st International Symposium on Microarchitecture,

2008.

[84] Pin Zhou, R. Teodorescu, and Yuanyuan Zhou. Hard: Hardware-assisted lockset-

based race detection. In HPCA ’07: Proceedings of the 13th International Sym-

posium on High-Performance Computer Architecture, 2007.

[85] Craig B. Zilles and Gurindar S. Sohi. A programmable co-processor for profil-

ing. In HPCA ’01: Proceedings of the 7th International Symposium on High-

Performance Computer Architecture, 2001.

Jared Casper

I certify that I have read this dissertation and that, in my opinion, it

is fully adequate in scope and quality as a dissertation for the degree

of Doctor of Philosophy.

(Kunle Olukoutn) Principal Adviser




(Christos Kozyrakis)




(Mark Horowitz)

Approved for the University Committee on Graduate Studies