DOMAIN SPECIFIC HARDWARE ACCELERATION
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Jared Casper
January 2015
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/pw135js0060
© 2015 by Jared Arthur Casper. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Oyekunle Olukotun, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Mark Horowitz
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Christos Kozyrakis
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
The performance of microprocessors has grown by three orders of magnitude since
their beginnings in the 1970s; however, this exponential growth in performance has
been achieved without overcoming substantial obstacles. These obstacles were over-
come due in large part of the exponential increases in the amount of transistors
available to architects as transistor technology scaled. Many today call the largest of
the hurdles impeding performance gain “walls”. Such walls include the Memory Wall,
which is memory bandwidth and latency not scaling with processor performance; the
Power Wall, which is the processor generating too much heat to be feasibly cooled; and
the ILP wall, which is the diminishing return seen when making processor pipelines
deeper due to the lack of available instruction level parallelism.
Today, computer architects continually overcome new walls to extend this ex-
ponential growth in performance. Many of these walls have been circumvented by
moving from large monolithic architectures to multi-core architectures. Instead of
using more transistors on bigger, more complicated single processors, transistors are
partitioned into seperate processing cores. These multi-core processors require less
power and are better able to exploit data level parallelism, leading to increased per-
formance for a wide range of applications. However, as the number of transistors
available continues to increase, the current trend of increasing the number of ho-
mogeneous cores will soon run into a “Capability Wall” where increasing the core
count will not increase the capability of a processor as much as it has in the past.
Amdahl’s law limits the scalability of many applications and power constraints will
make it unfeasible to power all the transistors available at the same time. Thus, the
capability of a single processor chip to compute more things in a given time slot will
iv
stop improving unless new techniques are developed.
In this work, we study how to build hardware components that provide new ca-
pabilities by performing specific tasks more quickly and with less power then general
purpose processors. We explore two broad classes of such domain specific hardware
accelerators: those that require fine-grained communication and tight coupling with
the general purpose computation and those that require much a looser coupling with
the rest of the computation. To drive the study, we examine a representative example
in each class.
For fine-grained accelerators, we present a transactional memory accelerator. We
see that dealing with the latency and lack of ordering in the communication chan-
nel between the processor and accelerator presents significant challenges to e�ciently
accelerating transactional memory. We then present multiple techniques that over-
come these problems, resulting in an accelerator that improves the performance of
transactional memory application by an average of 69%.
For course-grained loosely coupled accelerators, we turn to accelerating database
operations. We discuss that since these accelerators are often dealing with large
amounts of data, one of the key attributes of a useful database accelerator is the
ability to fully saturate the bandwidth available to the system’s memory. We provide
insight into how to design an accelerator that does so by looking at designs to perform
selection, sorting, and joining of database tables and how they are able to make the
most e�cient use of memory bandwidth.
v
Acknowledgements
In the last few years I’ve learned that the proverb is true, it takes a village to raise a
child. I have also learned that it takes a village to get a Ph.D. I sincerely appreciate
the help and encouragement of all those, too many to name, that I have interacted
with along the way.
In particular, my loving and incredible wife Colleen has been my rock throughout
the entire process and has never faltered in her support. Her and my two daughters,
Elliot and Amelia, have been there to share the joys of accomplishment and bouy my
spirits during the depths of the lows. They have made it all worth it.
My principal advisor, Kunle Olukotun, has been the epitome of the patient and
wise master to see me through the maze of academia, for which I will be eternally
grateful. Many of the other incredible scholars on the Stanford CS faculty, Christos
Kozyrakis especially, have provided insight and advice that considerably advanced
my work and saved me many hours of frustration. My fellow graduate students have
likewise been an incredible source of inspiration.
Finally, my parents Art and Luana and their unconditional love of me and my
family (and oft-needed financial support) have provided the foundation upon which
I have built my life. Without them being who they are, none of this would have ever
been possible.
vi
Contents
Abstract iv
Acknowledgements vi
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 The Free Lunch . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Multi-Core Processors . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 The Capability Wall . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Tightly Coupled Acceleration 9
2.1 FARM: Flexible Architecture Research Machine . . . . . . . . . . . . 11
2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 FARM System Architecture . . . . . . . . . . . . . . . . . . . 14
2.1.3 Module Implementation . . . . . . . . . . . . . . . . . . . . . 18
2.2 Techniques for fine-grain acceleration . . . . . . . . . . . . . . . . . . 21
2.2.1 Communication Mechanisms . . . . . . . . . . . . . . . . . . . 22
2.2.2 Tolerating latency and reordering . . . . . . . . . . . . . . . . 26
2.3 Microbenchmark Analysis . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Transactional Memory Case Study . . . . . . . . . . . . . . . . . . . 31
2.4.1 TM Design Alternatives and Related Work . . . . . . . . . . . 32
2.4.2 Accelerating TM . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.3 Implementing TMACC on FARM . . . . . . . . . . . . . . . . 38
vii
2.4.4 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 46
2.4.6 Comparison with Simulation . . . . . . . . . . . . . . . . . . . 62
2.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3 Loosely Coupled Acceleration 64
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.1 Barrel shifting and multiplexing . . . . . . . . . . . . . . . . . 68
3.2.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.3 Merge Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.4 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.5 Sort Merge Join . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.2 Merge Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.3 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3.4 Sort Merge Join . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4 Conclusions 100
Bibliography 102
viii
List of Tables
2.1 Hardware specifications of the FARM system. . . . . . . . . . . . . . 15
2.2 Summary of FPGA resource usage. . . . . . . . . . . . . . . . . . . . 21
2.3 Comparison of Cache Miss latency . . . . . . . . . . . . . . . . . . . 24
2.4 Summary of communication mechanisms. . . . . . . . . . . . . . . . . 24
2.5 TMACC hardware functions used by TMACC-GE. . . . . . . . . . . 42
2.6 TMACC hardware functions used by TMACC-LE. . . . . . . . . . . . 45
2.7 TMACC Microbenchmark Parameter Sets . . . . . . . . . . . . . . . 49
2.8 STAMP benchmark input parameters. . . . . . . . . . . . . . . . . . 52
2.9 STAMP benchmark application characteristics. . . . . . . . . . . . . 52
3.1 Memory port usage in sort merge unit. . . . . . . . . . . . . . . . . . 93
3.2 Summary of sort merge join results. . . . . . . . . . . . . . . . . . . . 97
ix
List of Figures
2.1 Diagram of the Procyon system with the FARM hardware on the FPGA. 14
2.2 Photo of the Procyon system . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 FARM Data Transfer Engine . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 FARM Coherent Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Comparison of DMA schemes. . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Comparison of non-coherent and coherent polling. . . . . . . . . . . . 25
2.7 Local and Global Epochs . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 FARM Communication Mechanisms . . . . . . . . . . . . . . . . . . . 29
2.9 FARM Experiment Vizualization . . . . . . . . . . . . . . . . . . . . 30
2.10 TMACC Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.11 Logical block diagram of Bloom filters. . . . . . . . . . . . . . . . . . 37
2.12 TMACC Microbenchmark Results . . . . . . . . . . . . . . . . . . . . 48
2.13 STAMP performance on the FARM prototype. . . . . . . . . . . . . . 54
2.14 Single threaded execution time relative to sequential execution. . . . . 57
2.15 TMACC ASIC Comparison - Short Transactions . . . . . . . . . . . . 59
2.16 Projected microbenchmark performance with TMACC ASIC. . . . . . 60
2.17 Projection of STAMP performance with TMACC ASIC . . . . . . . . 61
3.1 A pipelineable eight word barrel shifter. . . . . . . . . . . . . . . . . . 69
3.2 Data and control paths for selection of four elements. . . . . . . . . . 71
3.3 Control logic for the selection unit. . . . . . . . . . . . . . . . . . . . 72
3.4 Merge Join Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 Merge Joine Optimization . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6 Sorting using a sort merge tree. . . . . . . . . . . . . . . . . . . . . . 77
x
3.7 Multi-Way Merge Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.8 Sort Merge Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.9 High bandwidth sort merge unit. . . . . . . . . . . . . . . . . . . . . 82
3.10 Full system block diagram and data paths. . . . . . . . . . . . . . . . 82
3.11 Block diagram of prototyping platform from Maxeler Technologies. . 86
3.12 Measured throughput of the select block prototype. . . . . . . . . . . 88
3.13 Select Hardware Resources . . . . . . . . . . . . . . . . . . . . . . . . 90
3.14 Throughput of the merge join prototype. . . . . . . . . . . . . . . . . 92
3.15 Throughput of the sort tree prototype. . . . . . . . . . . . . . . . . . 94
3.16 Sort Hardware Memory Usage . . . . . . . . . . . . . . . . . . . . . . 95
3.17 Full Multi-FPGA Join Process . . . . . . . . . . . . . . . . . . . . . . 97
xi
Chapter 1
Introduction
1.1 Background and Motivation
To understand why domain specific accelerators are necessary, we must first under-
stand the problems facing computer architects today and why traditional approaches
fall short. This section takes a brief look back at how performance gains have histor-
ically been achieved and discusses how these techniques are not able to cope with the
challenges that computer architects face today.
1.1.1 The Free Lunch
The performance of general purpose processors, beginning with the introduction of
the Intel 4004 in 1971, grew exponentially until a few years into the 21st century.
This increase in performance was due largely to two contributing factors: improve-
ments in the underlying technologies, i.e. transistor scaling, and improvements in the
microarchitecture techniques used by chip designers, including pipelining and cache
hierarchies. We call this period “The Free Lunch” because software developers did
not have to do anything to realize performance gains in their application. Software
companies could simply wait for the next generation of processors to be released and
their product would automatically become faster, allowing them to add new features
and new capabilities without improving the performance of the existing code.
1
CHAPTER 1. INTRODUCTION 2
This exponential gain in processor performance was due in large part to the scaling
of the MOS transistor, both in terms of speed and size. The scaling is typically known
as Dennard Scaling as it was predicted by Robert Dennard in the early 1970s[31].
Dennard stated that the power density of transistors would remain constant as they
decreased in size. Thus, as transistors got smaller, more of them could be put into
a chip without substantially increasing the power consumption. During the The
Free Lunch period, dimensions of transistors were reduced by 30% every two years,
or every generation, while the electric fields required to maintain reliability were
held constant. Reducing transistor dimensions by 30% results in a 50% reduction
in the area needed for a given number of transistors. Thus, in the same die size,
developers had twice the number of transistors to use (i.e. Moore’s Law). Reducing
the transistors dimensions also results in an increase in performance, as it takes fewer
electrons to achieve the same electric field required to switch the transistor. The 30%
reduction in size typically resulted in a 40% increase in performance. Finally, these
processors were able to stay within a power budget because the supply voltage scaled
down with the size. Thus, a given number of new transistors consumed the same
amount of energy has half that number of the previous generation.
With a significant increase in the number of transistors available, processor de-
signers were also able to incorporate new architectural techniques to increase the
performance of single processors. Techniques such as branch prediction; superscalar,
out-of-order, and speculative execution; deep pipelining; and vector processing all
significantly contributed to fuel the free lunch. These performance increases were
quantified in Pollack’s Rule which states that performance increases as the square
root of the number of transistors in a processor. Thus, with twice the number of
transistors, performance will increase by 40%. More transistors also allowed design-
ers to include larger caches with the processor, which improved overall memory access
times. All of this combined with the performance increase of the transistors them-
selves to allow Moore’s Law to continue uninhibited through much of the past 30
years.
CHAPTER 1. INTRODUCTION 3
1.1.2 Multi-Core Processors
Two main factors combined to spell the demise of the free lunch:the ILP wall, and
the power wall. Most of the architectural techniques discussed in the previous section
were driving towards the goal of executing as many instructions as possible per clock
cycle: increasing the “instruction per clock” or IPC. Processors contain large complex
structures to analyze the instruction stream to determine what instructions can safely
be executed and multiple pipelines to execute more than one instruction at a time
if they are available. When it is unclear which instructions can or will execute, the
processor will even predict which will run, speculatively execute those, and roll back
that execution if it determines that its prediction was incorrect. The complexity and
design cost of developing these complicated structures made it increasingly expensive
and di�cult to increase the IPC. In addition, typical instruction streams have enough
dependencies between instructions that there is a limit to how many can actually
execute in parallel, typically four instructions per cycle [63]. Thus, the high e↵ort
to make processors capapable of executing more instructions in parallel resulted in
very little actual performance increase in the vast majority of applications. This lack
of more inherent parallelism in the instruction stream is often referred to as the ILP
wall.
The other main factor that impeded the progress of single processor development
was power. While smaller transistors do require less power to switch, the number of
transistors on a die and the higher frequencies they were running at caused the overall
power needed by the chip to grow exponentially. Eventually, chips were consuming
so much power in such a small space that it became impossible to keep them cool
enough to function. This led to the clock rate of processors levelling o↵ around 4
GHz and limited the complexity of the power-hungry structures that enabled deep
pipelining and super scalar execution; processor had run into what is known as the
power wall.
While the single thread performance still sees marginal gains, it is not enough to
add entirely new capabilities like the exponential growth previous seen was. Thus
designers turned to using the still increasing number of transistors available to add
more processing cores to a single chip rather than improving upon a single core.
CHAPTER 1. INTRODUCTION 4
Now, instead of executing a single thread of execution faster, new processors execute
more and more threads of execution at about the same speed as they did before.
Adding new capabilities to software is now a bit more di�cult than it was, since the
computation must be partitioned into multiple threads, but it is possible. While the
marketing focus had previously been the core clock rate of the processor, it is now the
number of cores the processor has. See the 2005 ACM QUEUE article “The Future of
Microprocessors” by Kunle Olukotun and Lance Hammond [63] for a full treatment
of the move to multi-core architectures.
1.1.3 The Capability Wall
The switch to multiple cores per processor avoided the power wall in a few ways.
Multiple cores are now able to share certain components such a large caches and
power-hungry high-speed communication circuits that communicate with the rest of
the system. In addition, with more cores to share the workload, the performance of
a single core is not as important as it was before (as long as the workload can be
su�ciently parallelized) and each core can run at a lower frequency while maintaining
overall system performance. On a workload that has two completely independent but
equal tasks to perform, two cores can complete that workload in the same time as a
single core running at half of the clock rate of the single core.
However, this does not solve the fundamental problem that a chip can only con-
sume so much power before it becomes impossible to keep cool. Processors are still
seeing an exponential increase in the number of transistors per chip, which leads to
more cores and bigger caches on a single chip. In addition, as a transistors continue
to get smaller, the amount of power that is consumed even when turned o↵ and not
switch, or leakage power, becomes more dominant. Thus, we are quickly approach-
ing a point where power can not be supplied to all the transistors that can fit in a
chip [16]. Some of the transistors will have to be left o↵ the chip entirely or completely
powered o↵, and turning them on means powering o↵ some other part of the chip.
In addition to the power wall still looming, the ILP wall will return with a di↵erent
face. Just as there is a limit to the amount of parallelism in a typical instruction
CHAPTER 1. INTRODUCTION 5
stream, there is a limit to the amount of inherent paralleism in many workloads.
Many computational tasks are serial by nature. The result of one step must be
obtained before the next step can be started. These inherently serial tasks see little
benefit in additional processing cores. While the throughput of performing many of
these tasks can be improved, the latency of completing a single task can not. In
addition, most workloads that have mostly indepedent tasks that can be executed in
parallel still have some portion that must be executed serially. Amdahl’s Law provides
an upper bound on the increase in performance given the amount of serial execution
in a workload. For example, if just 5% of a workload is serial, the maximum speedup
from parallelization is just 1/0.05 = 20x, no matter how many cores a system has.
Thus, even if we could power more cores, it won’t help improve the performance of
most application past a certain point.
We call this limit on the benefit of adding more cores to a system the “capability
wall”. We can no longer rely on general purpose hardware improvements to enable
more capabilities. To overcome the capability wall, many researchers are turning to
heterogenous computing. Instead of adding more cores that are all the same, proces-
sors can have cores that each excel at a di↵erent type of computation. Heterogeniety
means that processors are specialized for a particular set of workloads, or a domain.
In other words, they are domain specific. These domain specific blocks can be powered
o↵ when not in use to make more power available to other devices. This argument for
dark silicon holds at the chip level, which is power limited by the package’s thermal
envelope; all the way to the data center which is power limited by its power and
cooling provisioning. Such acceleration blocks thus make sense at the chip level as
part of an SoC, at the system level such as an external accelerator connected to the
system’s main memory or peripheral bus, or at the rack level, as a separate appliance
in a compute cluster.
In a recent article, Andrew Chien suggested one way to look at this move towards
heterogeniety is to see it as a move from 90/10 optimization, where e↵ort is spent
optimizing the common case, to “10x10 optimization” where “the goal is to attack
performance as a set of 10% optimization opportunities” [26]. If 10 new ideas (or 8,
or 12, etc.) can each improve the performance of 10% of the tasks in a workload,
CHAPTER 1. INTRODUCTION 6
than the overall performance on that entire workload will improve dramatically. In
this way, heterogeneity can break through the capability wall. Andrew Chien and
Shekhar Borkar published an article in Communications of the ACM in 2011, also
titled “The Future of Microprocessors” [16], that gives full treatment of the case for
heteregenous processors.
In this thesis, we explore the extreme case of the domain specific processor: the
domain specific accelerator. We di↵erentiate a domain specific accelerator from a
domain specific processor by looking at its generality. A domain specific processor
is a general-purpose processor that is specialized for a broad domain of applications.
For example, a GPU is general-purpose in that it could conceiably perform almost
any computation; however, it is specifically built to perform computation dealing with
graphics processing. It thus excels at any workloads that have characteristiccs similar
to that of rendering a picture and performs poorly on any other workloads (but it
can do them). In contrast, a domain specific accelerator is not general-purpose at
all. It is very limited in what type of computation it is able to perform, but what it
does, it does extremely well in terms of speed and e�ciency. We will focus entirely
on domain specific accelerators for the remainder of this work; although many of the
concepts presented could apply equally well to domain specific processors.
1.2 Contributions
In this thesis, we explore the extreme case of the domain specific processor: the
domain specific accelerator. We di↵erentiate a domain specific accelerator from a
domain specific processor by looking at its generality. A domain specific processor
is a general-purpose processor that is specialized for a broad domain of applications.
For example, a GPU is general-purpose in that it could conceiably perform almost
any computation; however, it is specifically built to perform computation dealing with
graphics processing. It thus excels at any workloads that have characteristiccs similar
to that of rendering a picture and performs poorly on any other workloads (but it
can do them). In contrast, a domain specific accelerator is not general-purpose at
all. It is very limited in what type of computation it is able to perform, but what it
CHAPTER 1. INTRODUCTION 7
does, it does extremely well in terms of speed and e�ciency. We will focus entirely
on domain specific accelerators for the remainder of this work; although many of the
concepts presented could apply equally well to domain specific processors.
The primary contribution of this work is to provide insight into the design and
development of domain specific accelerators. To do so, we first classify the di↵erent
types of interesting accelerators into two fundamental categories by looking at how
tightly coupled with the rest of the system the accelerator is. We then examine each
of these accelerators in turn to describe the various problems unique to each class
and provide techniques that can be used to mitigate those problems.
We first look at accelerators that are very tightly coupled with the rest of the
system in Chapter 2. Such accelerators are often di�cult to prototype and design
due to the rigid interfaces through which general purpose processors communicate
over their lowest latency, highest bandwidth links (i.e. with other processors in the
system). We thus begin our look at tightly coupled accelerators by detailing a system
that allows rapid prototyping of hardware connected directly and coherently to the
other processors in the system (Section 2.1). The then describe useful mechanisms for
processor-accelerator communication in this regime (Section 2.2) and provide bench-
marks to characterize a system’s communication performance (Section 2.3). This
is important because the communication between the accelerator and the other pro-
cessing elements often becomes the dominanating characteristic which determines the
performance of the accelerators (Section 2.2). Finally, we put the prototyping sys-
tem and communication techniques to practice in and accelerator for Transactional
Memory (Section 2.4).
We then turn to accelerators that are loosely coupled with the rest of the system
in Chapter 3. In these systems, the accelerator typically has a large task to perform
asynchronously with the rest of the system. We will see that the dominating charac-
teristic is often the accelerator’s ability to quickly access large amounts of memory and
make the most e�cient use of the supplied memory bandwidth. We thus spend the
majority of the chapter detailing a case study of a database operation accelerator. In
examining the detailed design of each component of the accelerator, we provide useful
examples and patterns that can be emulated to design other accelerators that make
CHAPTER 1. INTRODUCTION 8
e�cient use of memory bandwidth (Section 3.2). We then discuss implementation
details and performance analysis of the accelerator to provide practical knowledge
about gleaning the most out of a particular platform (Section 3.3).
Chapter 2
Tightly Coupled Acceleration
We first look at the class of domain specific accelerator which is tightly coupled
with the computation being performed in the rest of the system. The dominating
characteristic of tightly coupled accelerators is frequent communication with the rest
of the system. This is opposed to loosly coupled accelerators, such as those we will
look at in Chapter 3, where the accelerator works for a large amount of time on a
large amount of data without any synchronization or communication with the rest of
the system. To characterize the space of accelerators that are tightly coupled with
the rest of the system, we look at one application that, as we will show, requires
frequent communication between the general purpose processor and the accelerator:
Transactional Memory (TM). TM provides an ideal proving ground for exploring
issues that arise when designing and implementing such an accelerator.
In tightly coupled accelerators, the frequent communication means that the char-
acteristics of the communication, for example the amount of data and whether the
communication is synchronous or asynchronous, will be a dominant factor in the ac-
celerator’s ability to improve performance. Another dominant factor is where and
how the accelerator connects with the rest of the system. For example, an accelerator
that requires frequent synchronous communication over a high latency link will not
perform well. By reducing the amount of synchronous communication, an accelerator
can more resiliant its placement in the overall system. To this end, in this chap-
ter we present techniques that can be generally employed to deal with asynchronous
9
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 10
communication and thus tolerate a significant amount of latency between the host
system and the acclerator. The TM case study solidifies these techniques by detail-
ing how they are used to make the majority of communication with the accelerator
asynchronous.
Even with the techniques presented to tolerate latency, tightly coupled accelerators
will generally perform better when they are connected to the rest of the system with
high-bandwidth low-latency links. Modern processors have eliminated the traditional
front-side bus architecture and have instead moved the memory controller into the
processor and connected multiple processors with a point to point mesh network.
AMD processors use HyperTransport links while Intel processors use the QuickPath
Interconnect (QPI). This has led to the emergence of a new method of attaching
custom hardware to the system: directly into the processor interconnect. Several
companies have produced boards with FPGAs on them that plug directly into a
standard processor socket [27]. The main advantage of such accelerators is the ability
to participate in the cache coherent protocol of the system and the ability to own a
portion of the systems physical memory space. These systems also provide the custom
logic with the advantageous a high-bandwidth, low-latency link to rest of the system.
While PCI Express 3.x o↵ers similar bandwidth and latency characteristics [27], we
will see that using the cache coherency in the system allows an accelerator to make
better use of the high performance link than it would be able to on a peripheral bus
such as PCIe.
As this technology is relatively new and has not been heavily utilized to date, it
is interesting to explore the performance characteristics of such systems. We start
this chapter by presenting and analyzing FARM, a framework built on top of the
Procyon system from A&D Technology [3]. FARM not only allows us to measure key
performance characteristics, but serves as the platform upon which we can build our
TM accelerator.
The major contributions of this chapter are thus:
• We present FARM, a novel prototyping environment based on the use of a co-
herent FPGA. We detail its design, implementation, and characteristics. (Sec-
tion 2.1)
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 11
• We describe useful mechanisms for processor-accelerator communication, in-
cluding techniques for tolerating the latency of fine-grained asynchronous com-
munication with an out-of-core accelerator. (Section 2.2).
• We provide a thorough study of the performance characteristics of FARM, pro-
viding insight for designers considering the use of coherent FPGA solutions for
their own problems. (Section 2.3)
• We present a system (both software and hardware) for Transactional Memory
Acceleration using Commodity Cores (TMACC). We detail two novel algorithms
for transactional conflict detection, both of which employ general purpose out-
of-core Bloom filters. (Section 2.4).
• We demonstrate the potential of TMACC by evaluating our implementation
using a custom microbenchmark and the STAMP benchmark suite. We show
that, for all but short transactions, it is not necessary to modify the processor
to obtain substantial improvement in TM performance. TMACC outperforms
an STM by an average of 69%, showing maximum speedup within 8% of an
upper bound on TM acceleration (Section 2.4.5).
2.1 FARM: Flexible Architecture Research Ma-
chine
Heterogeneous architecture that incorporate domain specific archiectures are fun-
damentally di↵erent from existing hardware and di�cult to accurately model using
traditional simulators. In particular, traditional simulation techniques fall short when
domain specific accelerators are tightly coupled to general purpose computer through
the system interconnect (see Section 2.4.6). New hardware prototypes are therefore
extremely useful, being faster and more accurate than simulators. In addition to pro-
viding better insight into the system and being able to run larger and more realistic
pieces of code (such as an OS), prototyping allows researchers to find bugs and design
holes earlier in the development cycle.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 12
FARM is based on an FPGA that is coherently tied to a multiprocessor system.
E↵ectively, this means that the FPGA contains a cache and participates in coherence
activities with the processors via the system’s coherence protocol. Throughout this
chapter we refer to an FPGA connected coherently as a “coherent FPGA.” Coherent
FPGAs allow for prototyping of some interesting segments of the architectural de-
sign space. For example, architectures requiring rapid, fine-grained communication
between di↵erent elements can be easily represented using FARM. Ideas involving
modifications to memory tra�c, coherence protocols, and related pursuits can also
be implemented and observed at the hardware level, since the FPGA is part of the
coherence fabric. The close coupling also obviates the need for soft cores or other
processors on the FPGA in many cases, since general computation can be done on
the (nearby) processors. Section 2.1.2 provides details about the system architec-
ture and implementation of FARM. In addition to prototyping, FARM’s architecture
is naturally well-suited to exploring the domain specific architectures this thesis is
exploring, with the FPGA functioning as the accelerator.
Using a tightly-coupled coherent FPGA, whether as an accelerator or for proto-
typing, presents communication and sharing challenges. One must provide e�cient
and low-latency methods of communication to and from the FPGA. When function-
ing in the capacity of an accelerator, in particular, it is very necessary to understand
the behavior of the communication mechanisms o↵ered by FARM. The mechanisms
include: traditional memory-mapped registers (MMRs), a streaming interface, and a
coherent cached interface. Section 2.2.1 details these methods of communication and
suggests how one important application characteristic, frequency of synchronization,
could a↵ect the choice of communication mechanism.
System designers must understand the tradeo↵s and overheads that accompany
each communication type when using it to accelerate applications with various char-
acteristics, especially di↵ering levels of synchronization between the FPGA and the
processors. In particular, knowledge of the execution overhead introduced by using
a dedicated remote accelerator would suggest a minimum for the speedup benefits
gained when using that accelerator. Furthermore, this overhead is not constant, but
rather a function of the type of communication chosen as well as other characteristics,
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 13
such as latency and synchronization. Section 2.3 explores these issues by presenting
the performance of a synthetic benchmark on FARM for all communication mech-
anisms and various other factors. Such data should influence users of FARM-like
systems when deciding on implementations of heterogeneous prototypes or coproces-
sors.
2.1.1 Related Work
The FARM prototyping environment follows in the tradition of previous FPGA-based
hardware emulation systems such as the Rapid Prototyping engine for Multiproces-
sors (RPM) [9]. RPM focused on prototyping multiprocessor architectures where
FPGAs are used primarily for gluing together symmetric cores, but not much for
computation. RAMP White [81] is a similar approach, prototyping an entire SMP
system with an FPGA, including CPU cores and a coherency controller. We di↵er in
that our approach is more directed at evaluating heterogeneous architectures, where
the FPGA prototypes a special-purpose module (e.g. an energy-e�cient accelera-
tor) attached to high-performance CPUs. Convey Computer Corporation’s HC-1 is a
high-performance computing node that features a coprocessor with multiple FPGAs
and a coherent cache [28]. Convey’s machines are di↵erent in that they optimize
for memory bandwidth in high-performance, data-parallel applications. The copro-
cessor’s cache is usually only used for things like synchronizing the start sequence.
Recently, AMD researchers have also implemented a coherent FPGA [7]. AMD’s and
our system use di↵erent versions of the University of Heidelberg’s cHT core to han-
dle link-level details of the protocol1, but AMD does not give a thorough analysis of
system overheads for various configurations and usages.
Indeed, there has not been much discussion on how these coherent FPGA systems
can be well-utilized, and what kinds of applications can benefit from them. In this
section we discuss issues such as system utilization and present some key considera-
tions to account for when building with these systems. We also provide the detailed
1The cHT core was provided by the University of Heidelberg [60] under an AMD NDA. Wemade modifications and extensions to the core to improve functionality, increase performance andintegrate with the FARM platform.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 14
1.8GHzCore 064K L1
512KBL2 Cache
2MBL3 Shared Cache
512KBL2 Cache
…
HyperTransport
512KBL2 Cache
2MBL3 Shared Cache
512KBL2 Cache
…
HyperTransport
32 Gbps
32 Gbps
1.8GHzCore 364K L1
1.8GHzCore 464K L1
1.8GHzCore 764K L1
AMD Barcelona ~60 ns6.4 Gbps,
6.4 Gbps cHTCore™Hyper Transport (PHY, LINK)
AlteraStratix II FPGA (132k Logic Gates)
ConfigurableCoherent Cache
Data Transfer Engine
Cache Interface
Data Stream Interface
User ApplicationMMR
~380 ns
Figure 2.1: Diagram of the Procyon system with the FARM hardware on the FPGA.
design and implementation of our system.
2.1.2 FARM System Architecture
This section presents the design details of FARM. We begin with a description of
the system architecture and the hardware specifications of our particular implemen-
tation. We then describe the usage of the FPGA in FARM and detail the design and
structure of some of our key units. We also reveal our implementation of the coher-
ent HyperTransport protocol layer and describe methods and strategies for e�ciently
communicating coherently with CPUs.
FARM is implemented as an FPGA coherently connected to two commodity
CPUs. The three chips are logically connected using point-to-point coherent Hy-
perTransport (HT) links. Figure 2.1 shows a diagram of the system topology, along
with bandwidth and latency measurements, as well as the high level design of the
FARM hardware. Memory is attached to each CPU node (not shown). Latency mea-
surements in the figure represent one-way trip time for a packet from transmission to
reception, including de-serialization and bu↵ering logic.
We used the Procyon system, developed by A&D Technology Inc. [3], as a baseline
in the construction of the FARM prototype. Procyon is organized as a set of three
daughter boards inter-connected by a common backplane via HyperTransport. Figure
2.2 shows a photograph of the Procyon system. The first board is a full system board
featuring an AMD Opteron CPU, some memory, and standard system interfaces
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 15
Figure 2.2: Photo of the Procyon system with a main board, CPU board, and FPGAboard.
CPU Type AMD Barcelona 4-core (2 CPUs)Clock Freq 1.8 GHzL1 Cache Private: 64KB Data, 64KB InstrL2 Cache Private: 512KB UnifiedL3 Cache Shared: 2MB
Physical Topology 3 boards connected via backplaneDRAM 3GB (2GB on main system board)
HT Link Type HyperTransport: 16-bit linksCPU-CPU HT Freq HT1000 (1000 MT/s)
CPU-FPGA HT Freq HT400 (400 MT/s)FPGA Device Stratix II EP2S130
Logical Topology Single chain of point-to-point links
Table 2.1: Hardware specifications of the FARM system.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 16
such as USB and GigE NIC. The second board houses another Opteron CPU and
additional memory. The third board is an FPGA board with an Altera Stratix II
EP2S130 and support components used for programming and debugging the FPGA.
The photograph shows the FPGA board, secondary CPU board, and full system
board from left to right, respectively. The system runs on both Linux and Solaris;
our experiments were run on Arch Linux with linux kernel 2.6.31. Table 2.1 gives a
detailed listing of FARM’s hardware specifications.
Our FARM device driver is somewhat unique in that it is the driver for a coherent
device, which looks quite di↵erent to the OS than a normal non-coherent device. To
allow for flexibility in communication with the FPGA, the driver reconfigures the
system’s DRAM address map (in the MTRRs and PCI configuration space) to map
a section of the physical address space above actual physical memory to “DRAM” on
the FPGA. We must keep this memory hidden from the OS to prevent it from being
used for normal purposes. Using the mmap mechanism, these addresses are mapped
directly into the user program’s virtual address space. The FPGA then acts as the
memory controller for this address space, allowing the user program to read and write
directly to the FPGA. (see Section 2.2.1).
The FARM device driver is also used to pin memory pages and return their phys-
ical address in order to facilitate coherent communication from the FPGA to the
processor. An alternative, albeit more complicated, solution would be to maintain a
coherent TLB on the FPGA.
Reconfigurability in a prototype built with FARM is provided via the attached
FPGA. The FPGA houses modules that allow for general coherent connectivity to
the processors as well as a means by which the coprocessor or accelerator can use
these modules. As shown in Figure 2.1, the FARM platform implements a version of
AMD’s proprietary coherence protocol, called coherent HyperTransport (cHT). With
some exceptions, the cHT definition is a superset of HyperTransport that allows for
the interconnection of CPUs, memory controllers, and other coherent actors. Coher-
ent HyperTransport implements a MOESI coherence protocol. The cHT core, also
described in the introduction, handles only link-level details of the protocol such as
flow control, CRC generation, CRC checking, and link clock management. Primarily,
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 17
the core interfaces between the serialized incoming HT data (in LVDS format) and
the standard cHT packets which are exchanged with the logic behind the core. We
designed and implemented the custom transport layer logic, the Data Transfer Engine
(DTE), to process these packets. The DTE handles: enforcement of protocol-level
correctness; piecing together and unpacking HT commands; packing up and sending
HT commands; and HT tag management. The DTE also handles all the details of
being a coherent node in the system, such as responding to snoop requests. In ad-
dition, the FARM platform includes a parameterized set-associative coherent cache.
We will provide design and implementation details for the DTE and the cache later
in this section. Finally, there is also a small memory mapped register (MMR) file for
status checking and other small-scale communication with the processors.
The FARM platform provides three communication interfaces for the hardware
being prototyped by the user on the FPGA, or the user application. One is a co-
herent interface. Having a coherent cache, the FPGA can communicate with a CPU
using the normal coherence protocol. In the current implementation, we circumvent
the need for a coherent TLB on the FPGA by using only physical addresses of a
pinned contiguous memory region. Another interface is a stream interface where we
support streaming (or “fire-and-forget”) non-coherent communication. To implement
this interface, the FPGA is assigned a specific range of the physical address space.
This memory region can be marked as uncacheable, write-combining, write-through,
or write-back. Our original design marked this “FARM memory” as uncacheable
to allow for communication with FARM that bypassed the cache. However, the
Barcelona CPUs impose very strict consistency guarantees on uncacheable memory,
so we instead mark this section as write-combining in FARM. This marks the region as
non-coherent and bypasses the Opteron’s store ordering requirements without impos-
ing strict consistency, as “uncacheable” memory does, that would impede streaming
data. The final interface is standard memory-mapped registers (MMR). A detailed
comparison of these interfaces can be found in Section 2.2.1.
We use dual-clock bu↵ers and (de-)serialization blocks to partition the FPGA into
three di↵erent clock domains: the HyperTransport links, the cHT core, and the rest of
the acceleration logic (everything “above” the cHT core). In our base configuration:
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 18
cHT BuscHT
Core
DTE
Co-herentCache
MMR
Snoop Handler
Data Requester
Data Handler
Stream-in Traffic Handler
to user App
Figure 2.3: Block diagram of data transfer engine (DTE) components. Arrows rep-resent requests and data buses.
the user application and cHT core run at 100 MHz and the HyperTransport links at
200 MHz.
2.1.3 Module Implementation
The DTE and the cache are two vital units allowing the accelerator to communicate
with the processors, process snoops, and store coherent data. In this section, we
briefly describe the design and structure of these modules as implemented on our
FPGA.
Data Transfer Engine
The DTE’s primary responsibility is ensuring protocol-level correctness in Hyper-
Transport transactions. Figure 2.3 shows a block diagram of the components of the
DTE. A typical transaction is the following: If the data requester on the FPGA
requests data from remote memory (owned by one of the Opteron CPUs), snoops
and responses must be sent among all coherent nodes of the system (assuming no
directory) to ensure that any dirty cached data is accounted for. In this example,
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 19
because the FPGA is the requester, the DTE’s data handler is responsible for count-
ing the responses from all caches as well as the data’s home memory controller and
selecting the correct version of the data. Evictions from the FPGA’s cache to remote
memory are also fed to the cHT core via the data requester. In addition, snoops
incoming to the FPGA are processed by the snoop handler in the DTE. The DTE
also handles incoming tra�c for stream and MMR interfaces. In doing so, the DTE
acts as a pseudo-memory controller for memory requests belonging to the FPGA’s
memory range. Coherent HyperTransport supports up to 32 simultaneously active
transactions by assigning tags to each transaction, so the design must be robust to
transaction responses and requests arriving out of order. The DTE handles this by
using tag-indexed data structures and tracking tags of incoming and outgoing packets
in the data stream interface.
Configurable Coherent Cache
In general, FARM’s configurable coherent cache behaves like an ordinary data cache;
it coherently keeps data in the vicinity of the computation by initiating data transfers
and responding to snoop requests. However, have made a few di↵erent choices in the
design and implementation of the cache to best serve are target applications. For
example, in current implementation, we don’t have a coherent TLB on the FPGA,
but using only physical addresses of a pinned contiguous memory region.
Figure 2.4 shows the block diagram of our coherent cache module. The cache is
composed of three major subblocks. The core is where the traditional set-associative
memory lookup happens; the write bu↵er keeps track of evicted cache lines until they
are completely written back to memory; and the prefetch bu↵er is an extended fill
bu↵er to increase data fetch bandwidth. There are three distinct data paths from the
cache to the DTE: fetching data, writing data back, and snooping. All data transfers
happen at cache line granularity. The user application can request that the cache
prefetch a line and read or write to memory using a normal cache interface.
Our normal cache interface supports simple in-order reads and writes at word
granularity.2 This is a valid compromise of design complexity (and power, area, and
2In actuality, our cache is not strictly in-order but supports hit-under-miss. That is, the interface
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 20
DTEUser
AppConfigurableCache Core
Prefetch Buffer
Write Buffer
snoop
Coherent Cache
Fetch Data
Figure 2.4: Block diagram of coherent cache components. Arrows represent directionof data flows, rather than that of requests.
verification) against application performance since we seldom expect complex out-
of-order computation behind our cache. However, the user application can initiate
multiple data fetch transfers through the prefetch interface. Unlike the normal inter-
face, the prefetch interface is non-blocking as long as there is an empty slot in the
bu↵er. This design is based on the observation that in many cases the user application
can pre-compute a set of addresses to be accessed.
The cache module is responsible for maintaining the coherence of the data it has
cached. First, the cache answers incoming snoop requests by searching for the line
in all three subblocks simultaneously. Snoop requests have the highest priority since
their response time is critical to system-wide cache miss latency. Second, the module
must maintain the coherence status of each cached line. For simplicity, our current
implementation assumes that cache lines are either modified or invalid; exclusive
access is requested for each line brought in to the cache. This simplification is based
on the observation that for our current set of target applications, the cache is most
often used for producer-consumer style communication where non-exclusive access to
the line is not beneficial.
stalls at the second miss, not the first.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 21
FARM modules4Kbit Block RAMs 144 (24%)Logic Registers 16K (15%)LUTs 20KFPGA Device Stratix II EPS130FPGA Speed Grade -3 (Fastest)
Table 2.2: Summary of FPGA resource usage.
The cache uses physical addresses, not virtual addresses. This saves us from
implementing address translation logic, a TLB, and a page-table walker in hardware
and from modifying the OS to correctly manage the FPGA’s TLB. Instead we rely
on the software to use pinned pages provided by our device driver for shared data.
FPGA Resource Usage
Table 2.2 shows an overview of the resource usage on the FPGA. We made an e↵ort
to minimize the usage of FPGA resources by FARM modules in order to maximize
free resources for the user application. Note that the cache module has several con-
figuration parameters, including total size and associativity of the cache, size of each
cache line, and others. These parameters are configured at synthesis time to meet
area, frequency and performance constraints for application. The numbers for FARM
modules in the table reflect a 4KB, 2-way set associative cache.
2.2 Techniques for fine-grain acceleration
When an accelerator requires frequent communication with the computation done
on the general purpose processors, two fundamental design decisions must be made:
how to communicate with the accelerator at the lowest level, and how to tolerate
the adverse characteristics of the underlying interconnect, such as large, variable
latency, and out of order delivery of data. In this section we first describe various
communication mechanisms and when they should be used, noting how they have
been implemented in FARM when applicable. We then look at methods of tolerating
latency and reordering in the underlying interconnect.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 22
2.2.1 Communication Mechanisms
Fundamentally, communication between a accelerator and a processor can be per-
formed either synchronously or asynchronously. As we will see, breaking down the
communication into these two methods and reducing the synchronous communica-
tion as much as possible is a critical step in the process of designing an acceleration
system.
A single method of communicating with a accelerator will not be su�cient for
all situations. For example, nearly all accelerators will need an asychronous method
of moving data from the processor to the accelerator, but will also need to occa-
sionally perform synchronous communication just as most parallel algorithms require
synchronization to coordinate the computation across the nodes.
FARM supports multiple communication mechanisms tailored for di↵erent situa-
tions. Applications may use traditional memory-mapped registers (MMRs), a stream-
ing interface for pushing large amounts of data to the FPGA with low overhead, or a
coherent cache for communicating with the FPGA as if it were another processor in
a shared memory system.
MMRs are traditionally used for infrequent short communication, such as config-
uration, because of the time required to read and write to them. FARM allows for
much faster access to the MMRs because of the FPGA’s location as a point-to-point
neighbor of the processors. Specifically, we measured the total time to access an
MMR on farm to be approximately 672 ns, nearly half the measured 1240 ns to read
a register on an ethernet controller directly connected to the south bridge via 2nd
generation PCIe x4 on our system and inline with the latency of PCIe 3.x devices.
This lower latency allows MMRs in FARM to be used for more frequent communica-
tion patterns like polling. More detailed measurements show that most of the 672 ns
is spent handling the access inside the FPGA, indicating that this latency could be
further reduced by upgrading to a faster FPGA.
FARM’s MMRs uses uncached memory, which provides strong consistency guar-
entees. However, this means that access to multiple MMRs will not overlap and the
total access time will grow linearly with the number of registers accesses, just like
those to normal PCI registers. With FARM it is just as simple to put the MMRs in
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 23
the write-combining space, which has weaker consistency guarantees but would allow
multiple outstanding accesses (although still disallow caching) and thus provide much
faster multi-register access. Section 2.4.5 uses uncached memory for the MMRs, as
the uncached semantics are closer to the expected use of MMRs.
FARM’s streaming interface is an e�cient way for the CPU to push data to
the FPGA. To facilitate streaming data, a physical address range marked as write-
combining is mapped to the FPGA. Writes to this address range are immediately
acknowledged and piped directly to the user application module. The internal pipeline
passes 64 bits of data and 40 bits of address to the user application per clock.
On the CPU, write requests to the streaming interface are queued in the core’s
write-combining bu↵er and execution continues without waiting for the request to
be completed. Consecutive accesses to the same cache line are merged in the write-
combining bu↵er, reducing o↵ chip bandwidth overhead. Thus, to avoid losing writes,
every streamed write must be to a di↵erent, ideally sequential, address. The CPU
periodically sends requests from the bu↵er to the FPGA or an explicit flush can be
performed to ensure that all outstanding requests are sent to the FPGA.
Finally, the coherent cache allows for shared memory communication between the
CPUs and FPGA. Since the cache on the FPGA is kept coherent, the FPGA can
transparently read data either directly from a CPU’s cache or from DRAM, and vice
versa. The communication latency is simply the o↵-chip cache miss latency, which is
summarized in Table 2.3. In the table, the column labelled FARM shows the cache
miss latency measured on the current FARM system. Except when the requesting
CPU is two hops away from the FPGA, this latency is fairly constant because the
FPGA’s response to the snoop dominates any other latency. For comparison we also
provide measurements using the same system with the FPGA removed. This increase
in latency would be intolerable for an end product, but is reasonable for a prototype
platform and would be mitigated by using a faster FPGA.
The coherent communication mechanism is especially beneficial when performing a
pull -type data transfer (i.e. DMA), or when polling for an infrequent event. Figure 2.5
illustrates two di↵erent ways of performing a DMA from the CPU to the FPGA.
Figure 2.5.(a) is the conventional DRAM-based method, where (1) a CPU first creates
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 24
Service Location FARM FARMof cache miss w/o FPGAMemory 495 ns 189 nsOther cache (on-chip) 495 ns 145 nsOther cache (o↵-chip) 500 ns 195 nsFPGA cache (1-hop) 491 ns N/AFPGA cache (2-hop) 685 ns N/A
Table 2.3: Comparison of Cache Miss latency
CPU DRAM FPGA
(a) Through DRAM (Conventional)
CPU DRAM FPGA
(b) Through Coherent Cache
(1)
(2)
(3)
(1)
(2)
Figure 2.5: Comparison of DMA schemes.
Interface Description Approx. BandwidthProposed Usage
MMR CPU writes to FPGA’s MMR 25 MB/sInitialization or change of configuration
MMR CPU reads from FGPA’s MMR 25 MB/sPolling (likely to hit)
Stream CPU writes into FPGA’s address space 630 MB/sData push
Coherent CPU reads from FPGA’s cache 630 MB/sData pull or Polling (likely to miss)
Coherent FPGA reads from CPU’s cache (i.e. coherent DMA) 160 MB/sData pull or Polling (likely to miss)
Table 2.4: Summary of communication mechanisms..
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 25
CPU FPGA
(a) Non-coherent polling
CPU FPGA
(b) Coherent polling
(1) (1)
0
(2)
1
1
(2)
0
Figure 2.6: Comparison of non-coherent and coherent polling.
data in its own cache, (2) the CPU moves the data to DRAM, and (3) the FPGA
reads the data from DRAM. Note that during the data preparation steps, (1) and (2),
the CPU is kept busy. FARM’s coherence allows the method shown in Figure 2.5.(b),
where (1) the CPU leaves the data and proceeds while (2) the FPGA reads the data
directly from the CPU’s cache.
The coherent interface is also beneficial when polling infrequent events [59]. Fig-
ure 2.6 illustrates this by comparing (a) non-coherent polling through MMR reading
and (b) coherent polling through a shared address. In both cases, the event to be
polled is represented as a star, and the CPU polls it before and after the event, de-
noted as (1) and (2) respectively. In Figure 2.6.(a), (1) and (2) have the same MMR
reading latency, while in (b), (1) has the negligible latency of a cache hit and (2) has
up to twice the cache miss latency. Thus, when the event is infrequent, the majority
of checks performed by the CPU are simply a cache hit and do not stall the CPU at
all.
Table 2.4 summarizes communication mechanisms based on FARM’s three inter-
faces and their proposed usages. The MMR bandwidth numbers are for MMRs are
in uncached memory. The roundtrip latency to the FPGA is the limiting factor for
the MMR bandwidth. The bandwidth of the FPGA reading from the CPU’s cache
is limited by the bandwidth of the cHT core because the data read pathway has not
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 26
been optimized. Measurements indicate that optimizing this pathway could bring
this number up to at least 320 MB/s.
2.2.2 Tolerating latency and reordering
For many applications, like TM, that require fine-grained (frequent) communication
between the processor and an accelerator, asynchronous communication is essential
for performance. When using fully asynchronous communication to out-of-core de-
vices, however, it is incorrect to assume that commands are received by the accelerator
in the same order they were dispatched from the processors. Consider the following
example: One processor sends a command to add an address to a transaction’s read
set; this command stalls in the processor’s write-combining bu↵er. Later, a commit-
ting transaction on another processor sends notification that it is writing to that same
address. This notification arrives immediately (before the preceding add to read set
by the first processor) and thus the conflict is missed because the FPGA sees the com-
mit notification and the add to the read set command in reverse order. To avoid the
performance penalty of a more synchronous communication scheme (e.g. an mfence
after each command), accelerators such as those in TMACC must therefore reason
about possible command reorderings.
To address this serious issue, we present epoch-based reasoning and apply the
technique to our Bloom filter accelerators. In this scheme, we split time into variable
sized epochs, either locally determined (local epochs) or globally agreed upon (global
epochs). Global epochs can be implemented using a single shared counter variable
that is atomically incremented when a thread wants to move the system into a new
epoch. To inform the accelerator of the epoch in which a command is executed, the
epoch counter, which will usually be in the L1 cache, is read and included in the com-
mand. The accelerator then compares the epochs of commands to determine a coarse
ordering, with the atomic increment providing the necessary happens-before relation-
ship between threads. The accelerator cannot determine the ordering of commands
with the same epoch number, since it may only assume the commands were fired at
some point during the epoch (see Figure 2.7). Thus, the granularity of epoch changes
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 27
Epoch EpochN
EpochN+1N−1
Local EpochsGlobal Epochs
A
B
C
A
B
C
Figure 2.7: To determine the ordering of events, time is divided into epochs, eitherglobally or locally. In the global epochs example, it is known that A comes before Band C, but not the relative ordering of B and C. In the local case, it is known thatC comes before B, but not the ordering of A and B or A and C because their epochsoverlap.
determines the granularity at which the accelerator is able to determine ordering.
The potentially high overhead of maintaining a single global counter can be elim-
inated by using epochs local to each thread. When a thread wants to move into a
new local epoch, it sends a command to the accelerator to inform it of an epoch
change and performs a memory fence to ensure any command tagged with the new
epoch number happens after the accelerator sees the epoch change. The epoch change
command can often be included in an existing synchronous command with low cost.
While this scheme has less overhead, it leaves the accelerator with less information
about the ordering of events. Like the global scheme, the accelerator may only as-
sume the command was fired at some point during the epoch; therefore the relative
ordering of commands from di↵erent threads can only be determined if their epochs
do not overlap, as illustrated in Figure 2.7.
2.3 Microbenchmark Analysis
Designers of memory-system based accelerators, such as FARM, would benefit from
understanding how key application characteristics a↵ect the overhead introduced by
the system. For example, it is clear that one would avoid the fully synchronous MMR
write for frequent communication with the accelerator. Less obvious, however, is the
choice between using streaming versus DMA for moving data to the accelerator. Side
e↵ects such as CPU involvement, which would be considerably more for the streaming
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 28
Algorithm 1 Microbench to characterize communication mechanisms.procedure MainLoop(numIter, commType, N , M , K)
for i = 1 to numIter dofor i = 1 to K do
InitCommunication(commType, M);DoComputation(N);
Synchronize(commType);
procedure InitCommunication(commType, M)switch commType do
case MMRdoMMRWrite(M)
case STREAMdoStreamWrite(M)
case DMAInitateDMA(M)
procedure DoComputation(N)for j = i to N do
nop();
procedure Synchronize(commType)switch commType do
case MMRdoNothing(); . MMR is always synchronous
case STREAMflushWriteCombingingBu↵er();
case DMAwaitforDMADone();
case, complicate matters further.
To adequately address questions such as these, we constructed a microbenchmark
that allows for variation of key parameters a↵ecting communication overhead. Al-
gorithm 1 displays its pseudocode. Three parameters control the behavior of the
communication:
• N controls the frequency of communication. That is, communication happens
every N CPU operations.
• M controls the granularity of communication by specifying how much data (in
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 29
0
1
2
3
4
5
6
7
8
9
10
100 1000 10000 100000
Communication Granularity, M (Bytes)
Mea
sure
d C
omm
unic
atio
n O
verh
ead
(Cyc
les/
B)
STREAM (N=16384) DMA(N=16384)
STREAM(N=2048) DMA(N=2048)
STREAM(N=1024) DMA(N=1024)
STREAM(N=256) DMA(N=256)
0
1
2
3
4
5
6
7
8
9
10
100 1000 10000 100000
Communication Granularity, M (Bytes)
Mea
sure
d C
omm
unic
atio
n O
verh
ead
(Cyc
les/
B)
STREAM(N=256, K=inf) DMA(N=16384,K=inf)
STREAM(N=256,K=5) DMA(N=16384,K=5)
STREAM(N=256, K=1) DMA(N=16384,K=1)
(a) E↵ect of communication (b) E↵ect of synchronizationgranularity(M) and frequency(N) frequency (K)
Figure 2.8: Analysis of communication mechanisms using microbenchmark in Algo-rithm 1. The detailed meaning of parameter M,N,K can be found there.
bytes) is transferred per communication.
• K controls the frequency of synchronization. Synchronization occurs after ev-
ery K sets of communication/computation segments. If K is 1, we assume
synchronization happens only once: at the end of the application.
Figure 2.8 explores the e↵ects of communication granularity, communication fre-
quency, and synchronization on communication overhead. The vertical axis is com-
munication overhead measured in cycles per byte received by the FPGA (lower is
better). We first examine the case of asynchronous communication (i.e. K is 1) in
graph (a).
For the streaming interface (solid lines), the results for all communication frequen-
cies are asymptotic, with the overhead approaching 2.8 cycles/B for large M . After
taking into account the CPU clock frequency (1.8GHz), this value is close to the 630
MB/s bandwidth limit reported in Table 2.4. As we decrease M , however, we see
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 30
...
Computation Communication
Computation
Computation
Computation
CPU’sReorderingWindowSize
...
Communication
Computation
...DMA init
Computation(i)
Computation(i+1)
DMA TX(i)
DMA init
DMA TX(i+1)
DMA init
Computation(i)
Computation(i+1)
DMA TX(i-k)
DMA init
DMA TX(i-k+1)
... ...
... ...
(a) Stream interface’s case (b) DMA interface’s case
Figure 2.9: Visualized explanation of graph 2.8. For stream interface’s case, whengranularity(M) is large the communication overhead is solely determined by the band-width limit, while CPU’s instruction reordering can hide it for small M. Similar ex-planation applies to DMA’s case, where communication overhead can be completelyhidden depending on the choice of M and N.
the overhead decrease and surpass the bandwidth limit. This is because for smaller
amounts of data, the overhead can be hidden by the CPU’s out-of-order window.
Figure 2.9.(a) provides a visualized explanation of this e↵ect. For frequent communi-
cation (N=256), there is not enough computation to hide the communication latency,
which explains the increased overhead for this data point compared to the other three.
For DMA communication of data from the CPU’s cache to the FPGA(dashed
lines), we immediately see that the overhead is increased due to the Note, however,
that the general behavior of the curves is similar to that of the streaming case. Fig-
ure 2.9.(b) provides further insight into DMA behavior. The figure on the left depicts
the case where N=16384 and M=1024. In this scenario, the actual DMA transfer
time is fully overlapped with the subsequent computation. When this is the case, the
overhead is simply the time taken to setup the DMA. For very small M , the small
amount of computation per DMA is not enough to amortize this setup time. As the
amount of data per communication goes up, the setup time is amortized and the
overhead per byte goes down. If we increase M to the point that data transfer time
becomes longer than computation time (seen on the right of Figure 2.9.(b)), we see a
dramatic increase in the overhead. As in the streaming case, the overhead converges
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 31
to the bandwidth of the DMA transfer (See Table 2.4).
Figure 2.8.(b) explores the e↵ect of synchronization frequency. Smaller K means
more frequent synchronization. We take two data points from graph (a) for both
streaming (N=256) and DMA (N=16384), and we vary K. For the streaming in-
terface, synchronization means flushing of the write-combining bu↵er. For coherent
DMA, synchronization requires waiting (busy wait) until all queued DMA opera-
tions have finished. For very large communication granularity (M), the overhead is
bounded by the bandwidth in both cases and synchronization does not matter. For
smaller M , however, both communication methods exhibit an increase in overhead.
For the streaming interface, flushing the write bu↵er cripples the CPU’s out-of-order
latency-hiding e↵ect, hence the increased overhead for K=1. For DMA, synchroniza-
tion adds the fixed overhead of setting up the DMA.
2.4 Transactional Memory Case Study
Transactional memory (TM) [39, 49] is a potential way to simplify parallel program-
ming. Ideally, TM would allow programmers to make frequent use of large transac-
tions and have them perform as well as highly optimized fine-grain locks. However,
this ideal cannot be realized until there are real systems capable of executing large
transactions with low overhead. Our aim in this section is to describe a TM system
that strikes a reasonable balance between performance, cost and system implementa-
tion complexity.
Researchers have proposed a wide variety of TM systems. There are systems im-
plemented completely in hardware (HTMs), completely in software (STMs), and more
recently, systems with both hardware and software components (hybrid TMs). To put
our contributions in context, we now briefly review the strengths and weaknesses of
the various TM design alternatives.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 32
2.4.1 TM Design Alternatives and Related Work
STM
Software transactional memory (STM) systems [70, 34, 38, 66, 53, 75] replace the
normal loads and stores of a program with short functions (“barriers”) that pro-
vide versioning and conflict detection. These transactional read and write barriers
must themselves be implemented using the low-level synchronization operations pro-
vided by commodity processors. The barriers can be inserted automatically by a
transaction-aware compiler [8, 80, 5] or managed runtime [66], added by a dynamic
instrumentation system [62], or invoked manually by the programmer. STMs increase
the number of executed instructions, perform extra loads and stores, and require meta-
data that takes up cache space and needs to be synchronized. The resulting inherent
performance penalty means that despite providing good scalability, most STMs fall
far short of the performance o↵ered by hardware-based approaches to TM. There have
been proposals that reduce the overhead required [80], but they do so by giving up on
the promise of TM–they require small transactions that are used rarely. Hence, using
these STMs is as di�cult as using fine-grain locks. As a result of these limitations,
STMs have been largely constrained to the domain of research [21]. However, tech-
niques developed in STM research has been successfully used for optimized parallel
data structures [17].
HTM
At the opposite end of the spectrum from STM is hardware transactional memory
(HTM) [37, 23, 12, 65, 13, 51]. HTM systems eliminate the need for software barri-
ers by extending the processor or memory system to natively perform version man-
agement and conflict detection entirely in hardware, allowing them to demonstrate
impressive performance. Version management in an HTM is performed by either
bu↵ering speculative state (typically in the cache or store bu↵er) or by maintaining
an undo log. Metadata that allows conflict detection is typically stored in Bloom
filters (signatures) or in bits added to each line of the cache. The close synergy of the
hardware with the processor core and cache allow these systems to provide very high
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 33
levels of performance; however this tight relationship causes the system to be inflexi-
ble and more costly. Recent advances in HTM design address both of these problems
by minimizing the coupling between the TM and the processor core [82, 71], but even
decoupled HTM designs introduce nontrivial design complexity and disturb the deli-
cate control and data paths in the processor core. The first-level cache has the e↵ect
of hiding loads and stores from the outside world, making it impossible to construct
an out-of-core pure HTM system. Previous studies have not explored the possibil-
ity of adding transactional acceleration hardware without modifying a commodity
processor core.
In addition to the design complexity introduced by hardware-based TMs, there
still remains some uncertainty as to the optimal lightweight, forward-compatible se-
mantics appropriate for transactional memory. Several open questions are yet to
be resolved: strong versus weak isolation, methods of handling I/O, optimal con-
tention management, virtualizing unbounded transactions, etc. They have raised
questions due to the di�culty in virtualizing them and their inability to elegantly
handle unbounded transactions. When one also considers the latent skepticism re-
garding transactional memory as a viable general programming model, the hesitation
of hardware vendors to wholly adopt TM features may seem justified. Given these
barriers to adoption, it is not terribly surprising that the microprocessor industry has
yet to embrace HTM.
HybridTM
One way of limiting the complexity required by an HTM is to provide a limited
best-e↵ort HTM that falls back to an STM if it is unable to proceed [30, 48, 79,
40, 24]. These systems are particularly well-suited for supporting lock-elision and
small transactions. However, applications that use large transactions (or cannot tune
their transactions to avoid capacity and associativity overflows) will find that they
derive no benefit. This approach is especially problematic as the research community
explores transactional memory as a programming model, since it prescribes a limit
on how transactions may be used e�ciently.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 34
Hardware Accelerated STM
Hardware accelerated STMs are a type of hybrid TM that use dedicated hardware to
improve the performance of an STM. This hardware typically consists of bits in the
cache or signatures that accelerate read set tracking and conflict detection. Existing
proposals extend the instruction set to control new data paths to the TM hardware.
Explicit read and write barriers then use the TM hardware to accelerate conflict
detection and version management [72, 67, 19].
TMACC Motivation
We observe that hardware acceleration of an STM’s barriers only requires that the
runtime be able to communicate with the hardware; the TM hardware need not be
part of the core or connected to the processor with a dedicated data path. Commodity
processors are already equipped with a network that provides high bandwidth, low
latency, and dedicated instructions for communication: the coherence fabric. This
leads to the unexplored design space of hardware accelerated TM systems that do
not modify the core, or Transactional Memory Acceleration using Commodity Cores
(TMACC). Early simulation results, presented in Figure 2.10, show the promising
potential of TMACC systems to perform within five to ten percent of an in-core
hybrid TM system. These results also suggest that much of that performance can
be realized despite a relatively large latency between the processing cores and the
TMACC hardware.
Keeping the hardware outside of the core maintains modularity, allowing archi-
tects to design and verify the TM hardware and processor core independently. This
significantly reduces the cost and risk of implementing TM hardware and allows de-
signers to migrate a core design from one generation to the next while continuing to
provide transactional memory acceleration.
There is therefore great benefit in exploring TM systems that can be feasibly
constructed using commodity processors. Such systems will allow researchers to:
1. better understand and fine-tune TM semantics using real hardware and large
applications
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 35
2 4 8 16
# of Processors
0
1
2
3
4
5
6
7
Sim
ulat
ed S
peed
upSigTMTMACC-L1TMACC-MEMTL2
Figure 2.10: Average (mean) performance on the STAMP suite of two simulatedTMACC systems, one two cycles away from the core (L1) and one two hundredcycles away (MEM). These are compared to TL2, a pure STM, and an in-core hybridTM system much like SigTM.
2. explore the extent of speedup and hardware acceleration possible without mod-
ifying the processor core
3. better understand the issues associated with tolerating the latency of out-of-core
hardware support for TM
To derive these benefits in this work, we describe the design and implementation
of a hardware accelerated TM system, implemented with commodity processor cores.
Like the accelerators presented in systems like FlexTM [71], BulkSC [22], LogTM-
SE [82], and SigTM [19], we use Bloom filters as signatures of a transaction’s read
and write sets. Unlike these previous proposals, our Bloom filters are located outside
of the processor and require no modifications to the core, caches, or coherence pro-
tocol. In this thesis we also address the non-trivial challenges encountered when the
acceleration hardware is moved out of the core.
2.4.2 Accelerating TM
In this section we present our system for Transactional Memory Acceleration using
Commodity Cores, or TMACC. We first give a high level overview of our design
decisions and describe our general use of Bloom filters. We follow with a more detailed
description of our Bloom filter hardware, which is general and flexible enough to be
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 36
placed anywhere in the system. We describe how we implement this hardware using
FARM, and using the two techniques described in Section 2.2.2, present two distinct
TM algorithms using this hardware.
In any TM system, the processor must have very low latency access to transaction-
ally written data while hiding that data from other executing threads. Performing
this version management in hardware and being able to promptly return specula-
tively written data would almost certainly require modification of the L1 data cache
or the data path to that cache. Previously proposed HTM systems use bu↵ers next to
the L1, or the L1 itself, to store this speculative data until the transaction commits.
Imposing out-of-core latencies on these accesses would significantly degrade perfor-
mance. We therefore conclude that performing hardware-based or hardware-assisted
version management in a TMACC system is impractical.
To address this issue of version management, our software runtime uses a heavily
optimized chaining hash table as a write bu↵er. A transactional write simply adds
an address/data pair to this hash table. Each transactional read must first check for
inclusion of the address in the write bu↵er. If it is present the associated data is
used; otherwise, a normal load is performed. The hash table is optimized to return
quickly in the common case where the key (the address) is not in the table. Once
the transaction has been validated and ordered (i.e. given permission to commit),
the write bu↵er is walked and each entry applied directly to memory. The details of
write bu↵er data structures are more thoroughly explored elsewhere [34, 66, 53, 29].
Application of the write bu↵er could potentially be performed by the TMACC
hardware, freeing the processor up to continue on to the next transaction. However,
initial experiments showed that any benefit is outweighed by the impact of reloading
the data into processor’s cache after application of the write bu↵er. This is an area
of potential future work.
Like version management, checkpointing the architectural state at the beginning
of a transaction and restoring that state upon rollback would require significant mod-
ification to the processor core in order to be e↵ectively and e�ciently handled in
hardware. We thus perform this entirely within the software runtime using the stan-
dard sigsetjmp() and longjmp() routines.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 37
dat
a
addr
wre
n
req
ack
Contr
ol
...
Fil
ter
0
Fil
ter
1
Fil
ter
2
Fil
ter
n
Fil
ters
Has
hes
copy_
dat
a
bit
s_in
=>
tag_gt
tag_hit
clear
query
copy_in
copy_in_data
bits_in
hit
copy_out_data
copy_out
tag_in
tag_we
we
Figure
2.11:Logical
block
diagram
ofBloom
filters.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 38
This leaves conflict detection as the best target for out-of-core hardware accel-
eration. After all, the speculative nature of an optimistic TM system means that
the latency of the actual detection of conflicts is not on the critical path. Conflict
detection is a primary contributor to execution overhead in STM systems, and many
STM proposals have attempted to improve it.
In this work, we present two novel methods for performing conflict detection,
both of which use Bloom filters as signatures of a transaction’s read and write set.
Bloom filters [11] have been shown to be an e↵ective data structure for holding sets of
keys with very low overhead and have been used for multiple applications, including
the acceleration of transactional memory [19, 82, 22, 75]. Like several other TM
proposals, TMACC uses Bloom filters to encode the read and write sets of running
transactions. When a transaction commits, each address that is written can be quickly
checked against the read and write sets of other concurrent transactions in order to
discover conflicts. Details of the TM algorithm can be found in Section 2.4.4. The
TMACC system presented in this work assumes a lazy optimistic STM. There are no
fundamental reasons, however, why TMACC could not be used to accelerate an eager
pessimistic system.
2.4.3 Implementing TMACC on FARM
In order to fully qualify a TMACC design, we needed a platform which would al-
low for easy experimentation with real applications and developed FARM [61] (see
Section 2.1). To implement TMACC, we use two of the logical interfaces for com-
munication between the TMACC accelerator and the CPU: a) the coherent interface
which uses cache lines managed by the coherence protocol and b) the stream interface
which provides streaming (or “fire-and-forget”) non-coherent communication.
Bloom filters
Figure 2.11 presents a block diagram of a collection of Bloom filters. Note that while
logic symbols are used, Figure 2.11 does not represent a physical implementation,
but a logical diagram of the functionality provided. In addition to the normal add,
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 39
clear, and query operations, each individual Bloom filter provides functionality to
copy bits in from another filter or broadcast out its bits to other filters. Each Bloom
filter also has a tag associated with it, which can be used, for example, to associate
a Bloom filter with a particular thread of execution. Programmability of the module
is achieved in the control block, which can be programmed to translate high level
application-specific operations to the low level operations (add, query, clear, copy in,
and copy out) sent to each individual Bloom filter. These operations can potentially
be predicated by the tag hit and tag gt signals.
On FARM, the Bloom filters are placed in the placeholder marked “User Appli-
cation” in Figure 2.1. We use four randomly selected hash functions from the H3
class [20]. We considered using PBX hashing [83], which is optimized for space ef-
ficiency, but we were not constrained by logic resources on the FPGA. We perform
copying by stepping through the block ram word by word. In order to reduce the
number of cycles needed to copy, filters requiring copy support use additional RAM
blocks to widen the interface, resulting in more logic cells for the datapaths. All filters
are logically 4 Kbits in size.
Software communicates with the Bloom filters using the memory subsystem, which
is the fastest (both highest bandwidth and lowest latency) I/O path to and from a
commodity processor core. Uncached “fire-and-forget” stores can be used to send
asynchronous commands to the filters, such as a request to add an address to a
transaction’s read set. FARM’s data stream interface provides similar functionality;
however, its Barcelona processors are not able to perform true fire-and-forget stores.
Instead, “write-combining” memory is used to provide a way to stream data to the
FPGA with minimal impact on the running processor [61]. The Bloom filter hardware
performs commands serially in the order they are received by the FPGA. The imple-
mentation is pipelined, allowing the filters to easily process all incoming commands
even when the link is fully saturated.
For asynchronous responses, such as a filter match notification indicating a conflict
between transactions, the filters use FARM’s coherent interface to store a message in
a previously agreed upon memory location, or mailbox [59]. The application receives
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 40
notification of Bloom filter matches (i.e. conflicts) by periodically reading this mail-
box. In the common case of no conflicts, this check is very cheap as it consists of a
read that hits the processor’s L1 cache.
Using out-of-core Bloom filters that communicate using the memory system allows
us to easily perform virtualization. The software runtime maintains the pool of Bloom
filters, explicitly managing the binding between software threads and hardware filters.
Issues such as interrupt handling, context switching, and thread migration are thus
transparent to the acceleration hardware. If the hardware were added to the processor
core, these issues would become much more complex and expensive, as the core would
be physically tied to a specific Bloom filter.
2.4.4 Algorithm Details
We propose two di↵erent transactional memory algorithms in this section: one using
global epochs (TMACC-GE) and one using local epochs (TMACC-LE). In both of
these schemes, a filter match represents a conflict that requires a transaction to abort,
and a pre-set mailbox is used to notify the STM runtime. Both schemes provide
privatization safety. Publication safety could be provided by constraining the commit
order as in an STM; we don’t expect TMACC to make this either easier or harder.
When using Bloom filters to perform conflict detection, an important decision is
what logical keys are put into the Bloom filter to designate a shared variable. This
decision determines the granularity at which conflicts are detected. In our systems, we
simply use the virtual address of the shared variable as the key (later referred to as a
reference). For structures and arrays, each unique word is a separate shared variable.
An object identifier or something similar could be also be used as a reference.
To e�ciently manage RAM resources on the FPGA, we use two slightly di↵erent
instantiations of the bloom filter design for TMACC-LE and TMACC-GE. TMACC-
LE uses 24 filters: 8 for each of the read, write and missed sets. The 16 used for
the write and missed sets support copying. TMACC-GE uses a total of 40 filters: 8
for the read sets and 32 for the write sets, none of which support copying. An ASIC
implementation would not be constrained by the number of RAM blocks, and both
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 41
Algorithm 2 Pseudocode for the TMACC-GE runtime.
procedure WriteBarrier(tid, ptr, val)AddToWritebu↵er(tid.wb, val)
procedure ReadBarrier(tid, ptr)HW AddToReadSet(tid, ptr, global epoch)if Writebu↵erContains(tid.wb, ptr) then
return Writebu↵erLookup(tid.wb, ptr)
WaitForFreeLock(ptr)Return ⇤ptr
procedure Commit(wb)AcquireLocksForWB(wb)epoch = global epochif (violation mailbox[wb.tid] == true) then return failure
for entry in wb doHW WriteNotification(wb.tid, entry.address, epoch)
violated = HW AskToCommit(wb.tid) . Synchronousif violated then ReleaseLocks(); return failure
for entry in wb do *(entry.address) = entry.specData
AtomicIncrement(global epoch)ReleaseLocks()return success
TMACC-GE and TMACC-LE could use the same design [33].
Global Epochs
In the global epoch scheme, the Bloom filters are split into two banks. One bank
maintains the read set for each active transaction in the system. Each read set
holds the references read during the execution of the associated transaction. The
other bank contains filters which hold the write set for a given epoch; the write set
is composed of writes that were performed by any transaction during that epoch.
The Bloom filter tags are used to determine which Bloom filter in this bank cor-
responds to what epoch. When the filters receive a HW AddToReadSet , the refer-
ence is added to the transaction’s read set and checked against the write set for the
given and all previous epochs. A conflict is signalled on any match, thus ensuring
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 42
Function Description
HW AddToReadSet(tid,
reference, epoch)
Asynchronously adds reference to tid’s read setand enables notification for any write that couldpossibly make this read inconsistent. Queries eachwrite set that has an epoch number less than orequal to epoch for reference, triggering a conflictin tid if a match is found or if epoch is less thanthe epoch of the oldest write set.
HW WriteNotification(tid,
reference, epoch)
Asynchronously queries all reads sets, except tid’s,and triggers a conflict in any transaction whoseread set includes reference. Adds reference to thewrite set for epoch epoch, clearing and replacing anold epoch’s write set if necessary.
HW AskToCommit(tid) Synchronously processes all outstanding commandsand returns the conflict status of tid.
Table 2.5: TMACC hardware functions used by TMACC-GE.
a match against any write that could have occurred prior to the read. When the
filters receive a HW WriteNotification, the reference is added to the given epoch’s
write set and checked against each transaction’s read set, ensuring that any read
that could possibly come after, or has come after, the associated write will signal a
conflict. In the case that there is not a filter currently associated with the epoch of
a HW WriteNotification, and the epoch is greater than the oldest epoch for which
a filter exists (i.e. this is a new epoch), the write set filter of the oldest epoch is
cleared and replaced with a new write set containing the address to be added (and
tagged with the new epoch number). If no filter exists for the epoch in either a
HW WriteNotification or a HW AddToReadSet , and the epoch is older than the old-
est epoch for which a write set exists, then the command comes from an epoch that
is too old to have a filter and conservatively triggers a conflict. Since the ordering of
reads and writes within the same epoch cannot be determined, this scheme has the
e↵ect of logically moving all reads to the end of the epoch in which they are performed
and all writes to the beginning. These operations are summarized in Table 2.5.
Algorithm 2 gives high level pseudo-code for the algorithm used by the TMACC-
GE software runtime. Each read is instrumented to inform the Bloom filters of the
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 43
reference being read. Since the command is asynchronous, the only per read barrier
cost of doing conflict detection is the cost of firing o↵ the command to the FPGA. To
commit the transaction, the runtime first acquires locks for each address in its write
bu↵er, using a similar low-overhead striped locking technique as TL2 [34]. To ensure
that all of its writes are assigned to the same epoch, a local copy of the global epoch
counter is stored and used to inform the hardware of all the references that are about
to be committed. Locks are necessary to ensure that any readers of partially com-
mitted state perform the read in the same epoch as the commit. Without them, the
epoch could be incremented and a read of a partial commit performed in the following
epoch. This read would (incorrectly) not be flagged as a conflict. Once all of the locks
are obtained, the running transaction must synchronize with the filters to ensure that
it has not been violated up until the point the filters perform the HW AskToCommit
operation. If the transaction read a value that had been committed in the current
or any previous epoch, either the HW WriteNotification would have matched on the
read set and triggered a conflict, or the HW AddToReadSet would have matched
against one of the epoch’s write sets. Therefore, when the HW AskToCommit is
performed on the FPGA, the transaction’s read set is coherent and consistent if no
conflict has been seen by the FPGA. The transaction is then placed in the global or-
dering of transactions on the system and allowed to apply its write bu↵er to memory.
Once the write bu↵er has been applied, the transaction atomically increments the
global epoch counter so that any thread that reads the newly committed value will
read it in the new epoch and not be violated. It then releases the locks and returns.
It is important to note that the locks used in TMACC-GE are simple mutex
locks only used to ensure the atomicity of a commit, not the versioned locks used for
conflict detection in TL2. TMACC-GE can thus use coarser grain locking than TL2.
We found that 216 locks is idle for TMACC-GE, while TL2 performs best with 220.
Local Epochs
To perform conflict detection using local epochs, each transaction is assigned three
filters: a read set, a write set, and a missed set. As before, the read set main-
tains the references read during the transaction. The write set holds references that
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 44
Algorithm 3 Pseudocode for the TMACC-LE runtime.
procedure WriteBarrier(tid, ptr, val)AddToWritebu↵er(tid.wb, val)
procedure ReadBarrier(tid, ptr)HW AddToReadSet(tid, ptr)if Writebu↵erContains(tid.wb, ptr) then
return Writebu↵erLookup(tid.wb, ptr)
if TimeForNewLocalEpoch() thenHW ClearMissedSet(tid); mfence
Return ⇤ptrprocedure Commit(wb)
for entry in wb doHW WriteNotification(wb.tid, entry.address)
violated = HW AskToCommit(wb.tid) . Synchronousif violated then return failurefor entry in wb do *(entry.address) = entry.specData
HW ClearWriteSet(wb.tid)return success
are currently being committed by a transaction, and the missed set holds references
committed by any other transaction during the local epoch. When a filter receives
a HW AddToReadSet , the reference is checked against all other transactions’ write
sets and the reading transaction’s missed set, ensuring that any write that could have
occurred before the associated read (i.e. in the current local epoch) will trigger a con-
flict. A HW WriteNotification causes the reference to be added to the transaction’s
write set and checked against all other transactions’ read sets, ensuring a conflict
will be triggered for any read that could have potentially seen the result of the cor-
responding write. The written reference is also added to the transaction’s read set,
preventing write-write conflicts which cause a race during write bu↵er application.
Finally, HW ClearWriteSet first copies (merges) the write set into all other missed
sets and then clears the write set. This allows each transaction to independently
decide when it no longer needs to consider missed writes as potentially conflicting.
The transaction does this with HW ClearMissedSet which clears its own missed set,
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 45
Function Description
HW AddToReadSet(tid,
reference)
Asynchronously adds reference to tid’s read set,and enables notification for any write that couldpossibly make this read inconsistent. Queries tid’smissed set and the write set for every other trans-action for reference, triggering a conflict in tid ona match.
HW WriteNotification(tid,
reference, epoch)
Asynchronously queries all reads sets except tid’s,triggering a conflict in transactions whose read setincludes reference. Adds reference to tid’s readset and to epoch’s write set.
HW ClearMissedSet(tid) Asynchronously clears tid’s missed set, moving thistransaction to a new local epoch.
HW ClearWriteSet(tid) Asynchronously copies the content of tid’s writeset into every other transaction’s missed set, thenclears the write set.
HW AskToCommit(tid) Synchronously processes all outstanding commandsand returns the conflict status of tid. Clears tid’sread and missed set in preparation for a new trans-action.
Table 2.6: TMACC hardware functions used by TMACC-LE.
e↵ectively moving it into a new local epoch. HW WriteNotification could add ref-
erences directly to the other transaction’s missed sets, but having the intermediate
step of using the local write set allows the transaction to abort a commit without
polluting the other missed sets. These operations are summarized in Table 2.6.
Algorithm 3 gives high level pseudo-code for the algorithm used by the TMACC-
LE software runtime. The main di↵erence in this software runtime, as compared to
TMACC-GE, is the absence of locks during commit. Locks are not needed when
using local epochs because the missed sets cause all of the writes performed during a
commit to be logically moved to the beginning of an epoch defined locally for each
transaction, not globally. Therefore, each transaction individually ensures that any
of its own reads of a partial commit will signal a conflict, an e↵ort which won’t be
frustrated by the update of a global epoch outside of the transaction’s control.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 46
In the local epoch scheme, an epoch is implicitly defined by what writes are con-
tained in the transaction’s missed set filter; thus no explicit local epoch counter is
needed. In addition to firing a HW AddToReadSet and locating the correct version
of the datum, read barriers may choose to begin a new local epoch by sending a
HW ClearMissedSet command. A memory fence is then used to ensure that any
subsequent read (and its corresponding HW AddToReadSet) must wait until the
HW ClearMissedSet is complete and a new missed set has begun to collect writes
performed in the new epoch. This eliminates the possibility that a conflicting read is
performed during a local epoch update and the conflict lost. Periodically incrementing
the local epoch is not necessary for correct operation but reduces the number of false
conflicts and is especially important in applications using long-running transactions.
2.4.5 Performance Evaluation
In this section, we present the performance and analysis of the TMACC-GE and -
LE architectures implemented on FARM. We present the performance results in two
parts. First, we present results from a microbenchmark that is used to explore the
full range of TM application parameters. These results characterize the range of per-
formance we might expect from TM applications and can be used to understand the
performance results from complete applications. second, we present results of full
applications from the STAMP benchmark suite [18]. We show where the STAMP
applications fit into the design space as characterized by the microbenchmark param-
eters and how these parameters explain the performance results. Finally, we project
the performance of an ASIC TMACC implementation.
Microbenchmark Analysis
In order to characterize the performance of TMACC-LE and TMACC-GE, we used
an early version of EigenBench [41] which is a simple synthetic microbenchmark spe-
cially devised for TM system evaluation. This microbenchmark has two major advan-
tages over a benchmark suite composed of complex applications. First, transactional
memory is a complex system whose performance is a↵ected by several application
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 47
Algorithm 4 Pseudocode for microbenchmark.static int gArray1[A1];static int gArray2[A2];procedure uBench(A1, A2, R, W , T , C, N , tid)
probrd
= R/(R+W );for t = 1 to T do
TM BEGIN();for j = 1 to (R +W ) do
do read = random(0,1) probrd
? true : false;addr1 = random(0,A1/N) + tid*A1/N ;
. addr1 does not conflict with othersif do read then
TM READ(gArray1[addr1]);else
TM WRITE(gArray1[addr1]);
if C == true thenaddr2 = random(0,A2);
. addr2 possibly conflicts with othersif do read then
TM READ(gArray2[addr2]);else
TM WRITE(gArray2[addr2]);
TM END();
parameters. The microbenchmark makes it simple to isolate the impact of each pa-
rameter, independently from the others. Second, a microbenchmark allows us to get
a theoretical upper bound on the best possible performance given a set of parameters.
We arrive at this bound by simply executing a multi-threaded trial run without the
protection of transactional memory or locking. Doing this with a real application
would almost certainly produce incorrect results. We call this unattainably good
performance the “unprotected” version.
Algorithm 4 shows the pseudocode for the microbenchmark. The algorithm, at the
core, is nothing more than multiple threads executing a random set of array accesses.
Several parameters are necessary: A1 and A2 are the sizes of two arrays, the first
a partitioned array for non-conflicting accesses, the second a smaller shared array
for conflicting accesses; R and W are, respectively, the average number of reads and
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 48
(a)im
pactof
working-setsize
(b)im
pactof
tran
sactionsize
(c)im
pactof
trueconflicts
010
2030
4050
6070
Size
of A
rray
1 (M
B)
012345678
Speedup
0102030405060708090100
% of Txns Violated
010
020
030
040
0
# of
Rea
ds
012345678
Speedup010203040506070809010
0
% of Txns Violated
110
100
Size
of A
rray
2 (K
B)
012345678
Speedup
0102030405060708090100
% of Txns Violated
(d)im
pactof
write-set
size
(e)im
pactof
number
ofthread
s
020
4060
8010
012
014
0
# of
Writ
es
012345678
Speedup
0102030405060708090100
% of Txns Violated
12
48
12
48
# of T
hrea
ds
012345678
Speedup
Med
ium
TX
Shor
t TX
00 0
Unp
rote
cted
TMA
CC
-LE
TMA
CC
-GE
TL2
TMA
CC
-LE
Vio
latio
nsTM
AC
C-G
E V
iola
tions
TL2
Vio
latio
ns
Figure
2.12:Microbenchmarkperform
ance
forvariou
sparam
eter
sets.Speedupis
show
nfor8thread
s(exceptin
(e))
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 49
working-set transaction true conflicts write-set threads threadslabel (a) (b) (c) (d) (e) med (e) small
A1 (MB) 0.5 ⇠ 64 64 64 64 64 64A2 - - 256 ⇠ 16,384 - - -R 80 10 ⇠ 400 40 80 80 4W 4 max(1, R ⇤ 0.05) 2 1 ⇠ 128 4 1C false false true false false false
N 8 8 8 8 1 ⇠ 8 1 ⇠ 8
Table 2.7: Parameter sets used in the microbenchmark evaluation. The labels herematch those used in Figure 2.12.
writes, per transaction; T is the number of transactions executed per thread; N is the
number of threads; and C is a flag determining whether or not conflicting accesses
should be performed. Note that if C is unset, there should be no violations since
every thread only accesses its partition of the array. If C is set, then the shared A2
array is accessed in addition to the normal accesses to A1, decoupling the working
set size and the read/write ratio from the probability of violation.
We now use the microbenchmark to evaluate the performance of our two TMACC
systems across several di↵erent variables. Table 2.7 shows the parameter sets used
in the study, and the performance results are displayed in Figure 2.12. All graphs
in this section show both speedup relative to sequential execution with no locking or
transactional overhead (solid lines) and the percentage of started transactions that
were violated (dotted lines). In all graphs except for (e), speedup is shown for 8
threads.
Throughout our analysis, the baseline STM for comparison is TL2 [34], which is
generally regarded as a high-performing, modern STM implementation that is largely
immune to performance pathologies. We use the basic GV4 versioned locks in TL2,
the default in the STAMP distribution [76]. We use TL2 because its algorithms for
version management and conflict detection are the closest match to the TMACC
algorithms, allowing for the best indication of the speedup achieved using the hard-
ware. SwissTM [35] is the highest performing STM of which the authors are aware
and provides 1.1 to 1.3 times the performance of TL2 on the STAMP applications
presented here. We also present the best possible performance using the aforemen-
tioned “unprotected” method as an upper bound. Note that this is truly an upper
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 50
bound and usually unattainable because it will produce incorrect results in the face
of any conflicts. Throughout the analyses of results, TMACC-GE and TMACC-LE
represent the schemes described in Section 2.4.4.
Graph (a) shows the impact of working set size on TM systems. The prominent
knee in the performance of each system corresponds to the working set size outgrowing
the on-chip cache. Below the knee, where all user data and TM metadata fit on-chip,
TL2 is spared from o↵-chip accesses and outperforms the TMACC systems which
must still pay the costly round trip communication with the FPGA. This e↵ect would
be heavily mitigated with faster (or closer) hardware, and it is certainly rare for the
working set of real parallel workloads to fit in the on-chip cache.
Above the knee, we observe that both TMACC-GE and TMACC-LE significantly
outperform TL2, around 1.35x and 1.75x respectively, approaching the upper bound
of 1.95x. In this region, TL2’s performance su↵ers because its extra metadata causes
significant cache pressure. Specifically, TL2 relies on its metadata for conflict detec-
tion, so its metadata grows proportionally to a transaction’s read set. This indicates
that much of the overhead imposed by TL2 is not in the addition of a few instruc-
tions to the instruction stream, but the cache misses related to the meta data. When
everything fits into the cache, TL2 doesn’t add much overhead. It is when there
is cache pressure that the overhead becomes significant. TMACC-GE, on the other
hand, uses metadata only for commit, so its metadata grows with a transaction’s
write set, which is almost always smaller than its read set.
Graph (b) explores the impact of transaction size on speedup and violation rate.
In this graph, we see a well-defined di↵erence in speedup among the systems. In
the flat region in the middle, the speedup of each system is nearly identical to the
speedup of large working sets in graph (a). In this region, the speed up is bounded
by the available memory bandwidth, which explains why the unprotected execution
isn’t able to achieve a full 8x improvement. For small transactions, TMACC-GE’s
speedup diminishes because the relative cost of the FPGA round trip latency and
global epoch management grows as transaction size decreases. We will take a closer
look at short transactions in graph (e). For large transactions, the performance of
TMACC-LE drops because the lack of ordering information in local epochs causes
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 51
the missed sets to become polluted and emit more false positives. This is one case
where global epochs are preferred over local epochs.
Graph (c) depicts the impact of varying the probability of violations by turning
on C and varying the size of A2 in our microbenchmark. Note that the graph uses
semi-log axes. With a small A2, there are many violations and transactional retries
dominate performance, making the conflict detection overhead less important. As
A2 grows, contention decreases and the conflict detection overhead becomes more
important, explaining the expanding performance gap between TMACC-LE, with its
low-overhead conflict detection, and the others.
Graph (d) explores the impact of write set size, and again it is not surprising
that the false positive rate of TMACC-LE becomes non-trivial due to the inherent
pessimism in the local epoch scheme. However, these false positives are not enough
to outweigh the performance advantage of low-overhead conflict detection.
Interestingly, TMACC-GE also shows diminishing speedup as write-set size in-
creases. On closer inspection, we found that this degradation is due to the cache line
migration of locks between the two CPU sockets during commit. As explained in
Section 2.4.4, TL2 uses more locks than TMACC-GE so it is not as sensitive to this
issue. Increasing the number of locks used by TMACC-GE diminishes the e↵ect, but
reduces overall performance. Having the FPGA participate in the coherence fabric
significantly increases the last level cache miss penalty for all processors. This is a
prominent factor in the TMACC-GE results, and experiments in Section 2.4.5 show
that moving to an ASIC implementation would largely eliminate the performance
degradation of TMACC-GE seen here.
Graph (e) examines the impact of number of threads using both medium-sized
transactions and small-sized transactions. Overall, the systems show worse perfor-
mance for small-sized transactions because they all pay a constant overhead per trans-
action, which is not easily amortized by short transactions. With the long commu-
nication delay to the FPGA, TMACC-GE and TMACC-LE are unable to achieve
better performance than TL2 for short transactions running on 2 or 4 threads. While
the FARM system limits us to 8 threads, scalability to many more threads can be
achieved using multiple FPGAs. This scheme would require communication between
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 52
Name Input parametersvacation-low n2 q90 u98 r1048576 t4194304vacation-high n4 q60 u90 r1048576 t4194304
genome g16384 s64 n16777216kmeans-low m256 n256 65536-d32-c16.txtkmeans-high m40 n40 65536-d32-c16.txt
ssca2 s20 i1.0 u1.0 l3 p3labyrinth x512-y512-z7-n512.txt
Table 2.8: STAMP benchmark input parameters.
NameRD/tx WR/tx CPU cycles/tx Memory Conflicts
usage (MB)vacation-low 220.9 5.5 37740 573 very lowvacation-high 302.14 8.5 37642 573 low
genome 55.8 1.9 48836 1932 lowkmeans-low 25 25 690 16 highkmeans-high 25 25 680 16 low
ssca2 1 2 2360 1320 very lowlabyrinth 180 177 6.1 * 109 32 high
Table 2.9: STAMP benchmark application characteristics.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 53
the FPGAs and is left for future work.
The dramatic drop in TL2 performance for short transactions at 8 threads is the
result of moving from a single chip to two chips and the large miss penalty described
above. Taking the FPGA out of the system eliminates this drop in performance as
shown in Section 2.4.5. We note that this poor TL2 performance on FARM is only
present when transactions are very short.
To summarize, we see that TMACC provides significant acceleration of transac-
tional memory except when transactions are too short to amortize the extra overhead
imposed by communicating with the Bloom filters. We also find that in the case of
TM acceleration, global epochs only perform better than local epochs when a large
number of shared reads and writes are performed in a relatively short running trans-
action. In this case, the lack of ordering information is a larger factor in system
performance.
Performance Evaluation using STAMP
In this section, we evaluate the performance of TMACC on FARM using STAMP[18],
a transactional memory benchmark suite composed of several applications which vary
in data set size, memory access patterns, and size of transactions. Intruder, bayes,
and yada from the STAMP suite did not execute correctly in the 64-bit environment
of FARM (even using TL2) due to bugs in the STAMP code and have been omitted
from the study. Bayes’s and yada’s long transactions with a high violation rate are
similar to those in labyrinth, and intruder’s short transactions are similar to those
in kmeans-high. Thus, the absence of these apps does not significantly reduce the
coverage of the suite. Table 2.8 summarizes the input parameters and Table 2.9 the
key characteristics of each application. Cycles per transaction were measured during
single-threaded execution with no read and write barriers. We can roughly group
the applications into two sets by transaction size: vacation, genome, and labyrinth
have larger transactions while ssca2 and kmeans use smaller transactions. Kmeans
has large amounts of spatial locality in its data access and thus uses fewer cycles per
transaction despite having more shared reads and writes.
For this analysis, we include RingSTM [75]. This STM system uses a similar
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 54
vacation
-low
vacataion-high
genom
e
12
48
# of T
hrea
ds
012345
Speedup
0 10 20 30 40 50 60 70 80 90 100
% of Txns Violated
12
48
# of T
hrea
ds
012345
Speedup
0 10 20 30 40 50 60 70 80 90 100
% of Txns Violated
12
48
# of T
hrea
ds
012345678
Speedup
0 10 20 30 40 50 60 70 80 90 100
% of Txns Violated
Unp
rote
cted
TMA
CC
-LE
TMA
CC
-GE
TL2
Rin
gSTM
TL2
Vio
latio
nsTM
AC
C-L
E V
iola
tions
TMA
CC
-GE
Vio
latio
nsR
ingS
TM V
iola
tions
kmeans-low
kmeans-high
ssca2
laby
rinth
12
48
# of T
hrea
ds
012345678
Speedup
0 10 20 30 40 50 60 70 80 90 100
% of Txns Violated
12
48
# of T
hrea
ds
01234
Speedup
0 10 20 30 40 50 60 70 80 90 100
% of Txns Violated
12
48
# of T
hrea
ds
01234
Speedup
0 10 20 30 40 50 60 70 80 90 100
% of Txns Violated
12
48
# of T
hrea
ds
01234
Speedup
0 10 20 30 40 50 60 70 80 90 100
% of Txns Violated
Figure
2.13:STAMPperform
ance
ontheFA
RM
prototype.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 55
approach to accelerating transactional barriers as TMACC, but the Bloom filters
are implemented in software rather than hardware. Like TMACC but unlike TL2,
RingSTM provides privatization safety. Our RingSTM implementation is based on
the latest open-source version [74] and uses the single-writer algorithm. To provide
a better comparison to TL2 and our TMACC variants, this implementation uses the
write bu↵er implementation from TL2 instead of the hash table typically used in
RingSTM. In our experiments, the ring is configured to have 1024 entries, where each
entry is a 1024-bit filter.
Figure 2.13 shows performance results from executing the STAMP applications on
the FARM prototype. In this graph, we present speedups of 1, 2, 4 and 8 cores and the
percentage of started transactions that were violated. At first glance, we see that the
general trends we saw in the microbenchmark are present in the STAMP applications;
TMACC performs well with large transactions but is unable to provide acceleration
to small transactions. We also provide the unprotected execution time, using the
same method we used in Section 2.4.5. As before, the result of such execution is
incorrect and serves as a strict upper bound. As expected, not all applications were
able to run unprotected; some would crash or fall into infinite loops.
For vacation-high, vacation-low, and genome, the common characteristics are a rel-
atively large number of reads per transaction, small number of writes per transaction,
and small number of conflicts. See Table 2.8 for exact values. Commit overhead is
low due to the small write set and minimal time wasted retrying transactions because
of the small number of conflicts. Also, constant overheads such as register check-
pointing are amortized over the long running length. Thus, in these large-transaction
applications, the numerous reads make the barrier overhead the dominant factor in-
fluencing performance of the TM system. We saw this e↵ect in Figure 2.12.(b). This
graph uses a microbenchmark parameter set which corresponds to the characteristics
of these applications, and we see a very similar spread in performance results for the
large-transaction STAMP applications. Performance gain with respect to TL2 for
these applications averages 1.36x for TMACC-GE and 1.69x for TMACC-LE. Unpro-
tected execution provides an average speedup of 2.18x. Note that for vacation-high
running on TMACC-LE, while the number of reads is about 300, the drop shown in
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 56
Figure 2.12.(b) does not happen because vacation-high does not have as many writes
as the microbenchmark used in that graph.
The TMACC systems perform similar to RingSTM for low thread counts but do
not su↵er from the drop in performance at higher thread counts like RingSTM. The
drop in performance at higher thread counts seen in RingSTM arises because it is
unable to quickly check individual reads against write set filters like TMACC is able
to do. It instead checks read set filters against write set filters, and this filter to filter
comparison has a much higher probability of false positives, leading to very high false
conflict rates and significantly degrading performance.
Kmeans-low features a relatively small number of reads, large number of writes,
and small number of conflicts. From Figure 2.12.(b), we can expect that a small num-
ber of reads will diminish the performance gap between TL2 and TMACC. We also see
in Figure 2.12.(d) that the large number of writes will further diminish TMACC-GE’s
performance. The combined e↵ect explains what we see for kmeans-low in Figure 2.13
where for 8 threads TMACC-LE shows a 9% acceleration over TL2 but TMACC-GE
is 5% slower. We also see in Table 2.9 that the kmeans application spends very little
time inside transactions, with few reads and writes per transaction. This explains
the superior scalability of kmeans-low and means that there is very little time spent
in the read and write barriers, leaving very little computation to be accelerated.
Even though kmeans-high has very similar characteristics to kmeans-low except
for the number of conflicts, the large number of violations in kmeans-high overshadows
any other e↵ects and limits the speedup of all three systems to a mere 1.3x with 8
threads. This situation is captured in Figure 2.12.(c) where the performance of the
three systems converges as the rate of violation increases. As in kmeans-low, the small
transactions make it di�cult to amortize the communication overheads of TMACC
and it is not able to achieve any speedup over TL2. Both TMACC systems were
additionally undermined by an even larger number of violations than TL2, which is
interesting because Figure 2.12.(c) shows the TMACC systems having fewer violations
in the face of true conflicts. We suspect this is a result of TL2’s versioned locks giving
more importance to the lower bits of the address in performing conflict detection. This
causes TL2 to have fewer false positives when addresses are close together, as they
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 57
Vacation-Low Vacation-High Genome Kmeans-Low Kmeans-High SSCA2 Labyrinth Average0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
Exec
utio
n tim
e
TL2RingSTMTMACC-LETMACC-GE
Figure 2.14: Single threaded execution time relative to sequential execution.
are in kmeans-high. The single-writer variant of RingSTM we use is not able to scale
because of the large number of writes in both kmeans-low and kmeans-high, even
though its violation rate is comparable to the other systems.
Like kmeans-low and kmeans-high, TMACC performance on ssca2 is bound by
communication latency. The characteristics of ssca2 are well captured by the mi-
crobenchmark parameter set used to produce the short transactions graph in Fig-
ure 2.12.(e) which mirrors the ssca2 speedup graph in Figure 2.13. Refer to the
discussion of graph (e) in Section 2.4.5 for an explanation of the results. RingSTM
violates 2.5% of transactions when running 8 threads while the others violate less
than 0.01%. ssca2 has such a large number of transactions that even a 2.5% violation
rate adds significant overhead.
Labyrinth is a special case. As seen in Table 2.8, this application has a very
large number of computational cycles inside each transaction. The execution time is
therefore decided by non-deterministic execution paths and the number of violated
transactions rather than TM overhead. In Figure 2.12.(c) we saw that, in general,
TMACC-GE has fewer false positives than the other systems. So in labyrinth with 8
threads, the TMACC-GE system minimized the number of violations and performed
well. For labyrinth’s long-running transactions, the periodic intra-transaction incre-
ment of the TMACC-LE local epoch was especially important.
Finally, Figure 2.14 highlights the single thread overhead of the systems using
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 58
the single threaded execution time relative to sequential execution time. We see that
TMACC and RingSTM have less overhead than TL2 running vacation because of
the frequent barriers. As transactions get smaller in applications like kmeans and
ssca2, commit time becomes more important and the TMACC systems su↵er, while
RingSTM continues to do well. Note that TMACC-GE consistently has more over-
head than TMACC-LE because of the extra time required to (unnecessarily) obtain
the locks during commit. With few barriers and very long transactions, labyrinth has
almost no overhead in any of the systems.
Performance Projection for TMACC ASIC
In the previous sections, we have observed a few artificial e↵ects caused by the large
cache miss penalty in the FARM system. Since both TMACC and TL2 witness
performance degradation due to these issues, an interesting question is whether the
conclusions drawn thus far would still be valid in a system absent of these latency
anomalies, such as an o↵-chip ASIC or part of the uncore on a chip. The acceleration
hardware as presented does not require a high clock frequency and would occupy
a small silicon footprint in modern processes. Thus in this section, we modify our
system to project the performance of TMACC onto the design point of an o↵-chip
ASIC. This could be either a stand-alone chip, or part of the system’s north bridge
or memory controller, for example. The performance of an on-chip TM accelerator
would be even better, since it has a shorter round-trip latency. An ASIC or on-chip
implementation would also support larger Bloom filters, enabling larger transactions
without higher false violation rates.
To simulate the performance of an ASIC TMACC implementation, we first detach
the FPGA from the system, eliminating the FPGA-induced snoop latency witnessed
by all coherent nodes on every cache miss. Then, we replace FPGA-communication
software routines with idle loops in which we control the number of iterations to
simulate di↵erent desired communication latencies. In addition, we change the conflict
detection to report a conflict randomly with a given probability. We keep all the STM
overheads but simulate hardware latency. This modified system is a performance
simulator; like the unprotected version it does not provide serializable execution, but
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 59
WR14 3 3 3 3 3 3 3
12 3 3 3 3 3 3 3 1 TL2 performs better by more than 3%
10 3 3 3 3 3 3 3 2 Two schemes show similar performance
8 2 2 3 3 3 3 3 3 TMACC-GE performs better by more than 3%
6 1 2 2 3 3 3 3
4 1 1 2 2 2 2 3
2 1 1 1 1 2 3 3
2 4 6 8 10 12 14 RD
Figure 2.15: Performance comparison of TMACC-GE (ASIC) and TL2 for shorttransactions.
can serve as good indicator of real performance.
In order to closely model the o↵-chip ASIC configuration, we had to determine a
value to use as the communication latency to the ASIC. We propose that last level
cache miss latency is a good estimate for this number, the rationale being that the
ASIC is about as “far” away from the processor as DRAM. We therefore measured
the o↵-chip cache miss latency on this new system (without the FPGA attached) and
used this value as the communication latency. For each run, we used the measured
violation percentage from the equivalent run on FARM as the probability for violation
in the projected run.
For the projection study, we repeated the microbenchmark experiments performed
in Section 2.4.5 using these techniques. We used the measured o↵-chip cache miss
latency as the communication latency in our simulation, the rationale being that the
ASIC is about as “far” away from the processor as DRAM. In general, we found
the trends and conclusions are the same as those presented in Section 2.4.5 expect
where we explicitly mentioned otherwise in the discussions of graphs (d) and (e) of
Figure 2.12. The results for these experiments are shown in Figure 2.16.
A common trend seen in all the experiments is that the performance of TMACC-
GE now comes closer to the unprotected, since the ASIC design point significantly
reduces the cache migration latency, and thus the overhead of global epoch man-
agement. As noted in the discussion of graph (d) in Section 2.4.5, the dramatic
performance degradation of TMACC-GE as the write set grows disappears with the
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 60
(d) impact of write-set size (e) impact of number of threads
0
1
2
3
4
5
6
7
8
0 50 100 150Number of WR
Spee
d-up
0
10
20
30
40
50
60
70
80
90
100
% Vio
lation
0
1
2
3
4
5
6
7
8
1 2 4 8 1 2 4 8# Thread
Spee
d-Up
Medium TX Short TX
Figure 2.16: Projected microbenchmark performance with TMACC ASIC.
reduced cache miss penalty of an ASIC implementation. Also, the performance of TL2
with small transactions no longer drops dramatically when moving to a dual socket
configuration. Both TMACC systems also performed better than before for short
transactions; TMACC-LE outperforms TL2 on 8 threads by 9% now, but TMACC-
GE still falls 5% short of TL2 performance.
To determine the point where TMACC-GE begins to outperform TL2, we repeated
the short transaction experiment from Figure 2.12.(e), sweeping the number of reads
and writes from 2 to 14, the result is presented as a schmoo plot in Figure 2.15. When
there are more than 8 reads or writes, TMACC-GE is able to match the performance
of TL2. When there are more than 12, there are enough accelerated barriers to
compensate for the extra cost of communication, and TMACC-GE outperforms TL2.
TMACC-LE outperformed TL2 for all of these points. The inability of TMACC to
accelerate very small transactions suggests that TMACC would compliment a system
that targets small transactions, such as a best-e↵ort HTM that uses a processor’s
write bu↵er to store speculative data and falls back to using TMACC for larger
transactions.
These results indicate that the TMACC hardware would best function as an ASIC
chip, located around the same “distance” from the processor as main memory. One
important advantage of the ASIC design point is the need for little modification
in an SMP environment. Multiple CPUs would utilize the same hardware. Another
interesting design point for a CMP would be to place the hardware out-of-core, but on
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 61
012345678
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
Vacation-High SSCA2 Kmeans-High Kmeans-Low
Speedup
TL2 TMACC-GE TMACC-LE Unprotected
Figure 2.17: Projection of STAMP performance with TMACC ASIC
the same die, perhaps integrating the Bloom filters into an on-chip memory controller.
Such generic Bloom filters would not necessarily be dedicated to TM acceleration and
could be utilized by any other non-TM applications.
For the STAMP projection study, we chose four representative applications from
the suite. Vacation-high represents applications with large transactions, while ssca2
those with small transactions. Kmeans-high covers applications with a large number
of violations, while kmeans-low those with a large write set. Figure 2.17 shows the
results. As in the results from the microbenchmark projection, the absolute perfor-
mance improves across the board, while the performance gap between the TMACC
systems and TL2 is still as large as we saw in Figure 2.13.
The speedup results in vacation-high are very close to those of STAMP on the
Sirius platform. This clarifies that the large coherence penalties imposed by the
FPGA on Sirius did not play a large role in determining the accelerator speedup with
respect to TL2. For ssca2, TL2 showed a decrease in speedup at 8 threads when
run on the Sirius platform. The ASIC projection alleviates the large cache migration
penalty, and we thus see TL2 scaling as expected. This mirrors the improvement
we saw in Figure 2.16.(e) as compared with Figure 2.12.(e). Note that even with an
ASIC, we are unable to amortize the overhead of short transactions, and absolute
speedup remains relatively poor for all systems.
Since true violations are the dominant factor in kmeans-high performance, TL2
and TMACC-GE show very similar performance. TMACC-LE begins to diminish
at 8 threads because of the large number of violations. For kmeans-low, we saw in
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 62
Figure 2.13 that the advantage of the TMACC systems over TL2 was minimal. For
the ASIC projection, the lower latency allows the hardware accelerator to amortize
the overhead of smaller transactions. The transaction sizes in kmeans-low lie near
this boundary, so both TMACC systems now see much more speedup (up to 15%)
relative to TL2.
2.4.6 Comparison with Simulation
We now briefly contrast our experiences and results with hardware to our early ex-
ploratory work done using software simulation. Considerable e↵ort went in to mak-
ing our simulations “cycle accurate”, and our performance predictions for SigTM
and TL2, presented in Figure 2.10, roughly matched the results presented in the
corresponding papers. Initial results from the actual hardware, however, were quite
di↵erent from those the simulator had predicted. One main reason for the discrep-
ancy was the di↵erence between the simulated and actual CPUs. The simplistic CPU
model used in simulation (in-order with one non-memory instruction per cycle) drasti-
cally overstated the importance of reducing the instruction count in the transactional
read and write barriers. Modern processors, such as those in FARM, are much more
tolerant of extra instructions in barriers, reducing the benefit of eliminating those
instructions.
Another primary source of inaccuracy arose from the fact that our simulated in-
terconnect did not model variable latency and command reordering. The presence
of these in a real system led us to develop the global and local epoch schemes pre-
sented in this thesis and thus significantly impacted the performance of the system.
In addition, our simulator assumed the processors were capable of performing true
“fire-and-forget” stores with weak consistency without a↵ecting the execution of the
core. We therefore did not model the write combining bu↵er and its e↵ect on system
performance. In addition, smaller data sets used to run simulation in a reasonable
time frame a↵ected the system performance very di↵erently than a real workload, in
terms of bandwidth consumption, caching e↵ects and TLB pressure.
CHAPTER 2. TIGHTLY COUPLED ACCELERATION 63
Even though we could have performed a more accurate simulation and we even-
tually approached our desired performance using a modified design, we believe our
experiences provide a strong example of the importance of building actual hardware
prototypes. Although developing and verifying hardware requires increased time and
e↵ort when compared with using a simulator, hardware is essential to accurately gauge
the performance of proposed architectural improvements and to bring out the many
issues one might encounter in actually implementing the idea. Having a hardware
implementation is also a strong evidence of the correctness and validity of a system.
2.5 Other Applications
We now briefly explore what other applications could be e�ciently accelerated using
fine-grained acceleration. Application, such as transactional memory, that require a
small amount of computation on each memory access are prime candidates for such
acceleration. Examples include bug detection, such as data race detection [84] or
array bounds checking, and runtime profiling [85]. Coherent access to the CPU’s
cache can simplify the design of previous intelligent I/O devices [59]. A system such
as FARM could also be used to prototype intelligent memory systems for performance
[42] or for security [47]. Such a system would extend the memory controller described
in Section 2.1.3 with the required intelligence using additional information available
through the coherent interface.
In addition, coherent FPGAs can help prototype advanced coherent protocols. For
example, one could prototype directory structures like [4], or snoop filtering techniques
like [55]. Note that such extensions of the underlying broadcasting coherence protocol
(cHT) have been proposed in the original design [44] but actual implementations have
been rare.
Chapter 3
Loosely Coupled Acceleration
In this chapter, we now turn to the other broad class of domain specific accelerators:
those that do not have a tight coupling with the rest of the system and operate fairly
independently. These accelerators are characterized by their infrequent communica-
tion with the general purpose processors in the system. Infrequent communication
implies that coarse grained accelerators will work autonomously on a chunk of data
for a long period of time, performing a large amount of compuation. The data being
processed can also be, but is not necessarily, a large amount of data.
One decision that must be made is how the data is transferred from the general
purpose processor to the accelerator and how results are transferred back. This
decision is determined in large part by the placement of the accelerator in the system.
Much like the process of partitioning a workload between the di↵erent processors of
the system, designers must analyze the data flow on an application by application
basis to determine the best placement of the accelerator in their system. For example,
if an accelerator is going to process data that will be shared with other computational
processes in the system, it might make sense place the processor directly in the
processor interconnect in a system such as FARM, preventing the need to duplicate
the data in two places in the system. If, however, the data to be processed will be
discarded after processing, the accelerator can be attached to the peripheral bus, or
even exist in a completely separate appliance, connected via a rack-level interconnect
such as Ethernet or InfiniBand. Loosely coupled accelerators like those described
64
CHAPTER 3. LOOSELY COUPLED ACCELERATION 65
in this chapter are usually, by their nature, more dependent on the bandwidth of
their connection to the system than its latency. This decision of how to connect the
accelerator is important for overall system performance; however, it highly dependent
on the characteristics of the computation being accelerated and will not be generally
discussed further in this work.
Loosely coupled accelerators are often quite complex pieces of hardware that are
di�cult and expensive to design due to the amount of computation performed and
the speed at which they must operate to outperform a high performance general
purpose processor. This is in contrast to tightly coupled acclerators, where it is
often not necessary for the hardware to be very complicated since it can accelerate
an application by o✏oading even a small amount of computation. For example, the
bloom filter module used to accelerate transactional memory in Section 2.4.2 is a
relatively simple hardware design. Extra care must therefore be taken when deciding
to persue building a loosely coupled accelerator. It should always be asked if it would
be just as good, in terms of whatever metrics are important to the system, to add
an extra general purpose processor, or perhaps a domain-specific processor such as a
GPU, to the system to perform the task considered for acceleration. If this is the case,
or it will probably be the case in the near future due to technology improvements, that
is almost certainly the approach to take due to the substantially cheaper development
cost of software over complicated hardware.
For tasks that consume and/or produce a large amount of data, one key indica-
tor that the task is a good candidate for acceleration is the inability of a general
purpose processor to fully saturate the memory bandwidth. If the computation is
saturating the available memory banwidth, then the performance is memory-bound
and no amount of special purpose hardware will speed it up (although if power is a
concern, an accelerator could potential perform the computation at lower power, but
we will not generally explore these cases in this work). Memory bandwidth utiliza-
tion then becomes a convenient measure of the accelerator’s utility. If the accelerator
can achieve significantly better utilization of the memory bandwidth than a general
purpose processor could ever hope to achieve, the accelerator will probably be worth
the cost.
CHAPTER 3. LOOSELY COUPLED ACCELERATION 66
To provide insight into the type of issues that arise and techniques that can be
used in accelerators that work on large amounts of data in an attempt to fully saturate
memory bandwidth, we again turn to a case study: accelerating database operations.
We propose hardware designs that accelerate three important primitive database
operations: selection, merge join, and sorting. These three operation can be combined
to perform one of the most fundamental database operations: the table join. Since the
primary goal in our designs is to build hardware that can fully utilize any amount of
memory bandwidth, we have designed the hardware to have as few limiters to scaling
as possible. The goal is that as logic density increases more hardware can be added
to increase the throughput of the design with very little redesign of the architecture.
This chapter includes the following key contributions:
• We detail hardware to perform a selection on a column of data streamed at
peak memory bandwidth. (Section 3.2.2).
• We describe hardware to merge two sorted columns of data. (Section 3.2.3).
• We present hardware to sort a column of data using a merge sort algorithm.
(Section 3.2.4).
• We describe how to combine these hardware blocks to perform an equi-join
entirely in hardware. (Section 3.2.5).
• We prototype all three designs on an FPGA platform and discuss issues we
faced when building the prototype. (Section 3.3).
• We analyze the performance of our prototype and identify key bottlenecks in
performance. (Section 3.3).
• For each hardware design, we explore the hardware resources necessary and how
those resources requirements grow with bandwidth requirements. (Section 3.3).
CHAPTER 3. LOOSELY COUPLED ACCELERATION 67
3.1 Background
By the late 1970s, Database Machines became a popular topic in the database research
community and commercial products were being planned. In an attempt to improve
access time to very large databases, these machines placed special purpose proces-
sors between the processor and the disk containing the database. They first placed
a processor at each disk track, then at each disk head, and finally placed multiple
processors with a large disk cache between a conventional disk and the host proces-
sor. These systems initially looked very promising; however, processor performance
increased much more dramatically than I/O performance and database machines soon
no longer made sense [14]. Because of the gap between disk bandwidth and processor
performance, there wasn’t any performance left on the table using general purpose
processors and commodity storage systems. It was easy for a processor to keep the
disk busy, thus without a dramatic increase in disk performance, special purpose
processing was unnecessary.
While the database machines of the 70s, with special purpose processing at the
disk, became obsolete. By the early 90s, with the widespread adoption of the rela-
tional data model, the community had developed massively parallel and performant
database-centric systems using commodity processors and storage systems [32]. These
database systems became a driving force in the development of highly parallel sys-
tems.
Massively parallel database systems have continued to evolve to the present day.
Their performance has grown steadily along with the performance of the system
components they are built on top of. With the advances of memory technology and
the subsequent increase in capacity of main memory in these systems, many large
database tables now reside entirely in main memory, further improving the database
performance. It has even been proposed that disks be replaced entirely with random
access memory and “relegated to a backup/archival role” [64].
With databases residing entirely within main memory, database performance is
no longer bound by the glacial performance of a rotating magnetic disk. Unlike
the systems in the 70s, however, single-threaded processor performance is leveling
CHAPTER 3. LOOSELY COUPLED ACCELERATION 68
o↵. Systems must now rely almost entirely on parallelization to achieve increases
in performance. While Moore’s law continues to hold and the number of transistors
available to chip architects continues to increase, power constraints limit the number
of logic transistors that can be active at any given time on a chip [15]. It is unlikely
that general purpose processing elements will ever be able to fully utilize the amount
of memory bandwidth available to a chip while performing all but the most basic
database operations. As an example, recent studies have increased join performance
into the 100s of million tuples per second [45, 43], with 64-bit tuples this corresponds
to a data bandwidth of one to five gigabytes per second. Modern chips, conversely,
can achieve memory bandwidth over 100 GB/s [10]. Clearly using general purpose
compute is leaving performance on the table and database acceleration is a prime
candidate for acceleration.
Another enabling change in database systems is the move to columnar data stor-
age as opposed to row-wise data storage. This move was sparked in the 1990’s by
MonetDB [52]. Since then other database systems using column-oriented storage,
such as C-store [77], have appeared. The move to columnar storage is a result of
attempts to better utilize the increasingly limited amount of memory bandwidth
available to processing cores. This work provides methods for transforming row-wise
query operations into column-wise vector operations. Having database tables stored
in columnar format allows processors, and accelerators, to quickly stream through
relevant columns of data, fully utilizing any available memory bandwidth.
3.2 Hardware Design
3.2.1 Barrel shifting and multiplexing
Barrel shifters are used through our design so we begin with a brief reminder of how
to build these components. In our designs, we often use shifters which take in an
array of words and an amount to shift them, word wise, in one direction. So instead
of shifting by a certain number of individual bits, the bits are shifted by a certain
number of words. For example, a shifter that takes in four 32-bit words is 128 bits
CHAPTER 3. LOOSELY COUPLED ACCELERATION 69
Figure 3.1: A pipelineable eight word barrel shifter.
wide and shifts by 0, 32, 64, or 96 bits. This is implemented by simply replicating
a traditional 4-bit barrel shifter, which is implemented using four 4:1 multiplexors.
Thus, a barrel shifter for four 32-bit words would take 4 ⇤ 32 = 128 4:1 multiplexors,
since each of the 128 bits of output is assigned to one one of four input bits. More
generally, a barrel shifter for N b-bit words takes N ⇤ b N :1 multiplexors.
Large input multiplexors can be e�ciently implemented using several stages of
smaller multiplexors. A 256:1 multiplexor can be implemented with just two stages
of 16:1 multiplexors. If you are able to multiple M signals in a clock cycle on a
platform, the number of stages for an N wide multiplexor is logM
(N). Figure 3.1
provides an implementation of an eight word barrel shifter using a 4:1 stage and a 2:1
stage. In a modern FPGA fabric, a 16-to-1 multiplexor can be implemented using
two logic blocks (i.e. CLB, ALM, etc) [36][6].
3.2.2 Selection
In this chapter we define the selection operation to take two inputs, a bit mask of
selected elements and a column of data stored as an array of equal width machine
data types. The inputs can either come from arrays laid out linearly in memory,
or be produced by another operation which may be looking at a di↵erent column of
data. In some cases the bit mask may be RLE compressed and must be decompressed
before being used by the selection unit. A common case would have the bit mask
CHAPTER 3. LOOSELY COUPLED ACCELERATION 70
coming from another operation and the data column being read from memory. The
output of the operation is values from the input column that correspond to the true
bits in the bit mask, in the same order that they appear in the original column. Like
the input, the output data can be streamed to another processing unit or written
sequentially into memory.
There are many ways to implement selection in software. One e�cient implemen-
tation fills a SIMD register with the next values from the input column. A portion of
the bit mask is used as an index into a look up table which contains indices for the
SIMD shu✏e operation to shu✏e the selected data to one end of the SIMD register.
The resulting SIMD register is written to the output array and the output pointer
is incremented by the number of valid data elements that were written. This store
is thus an unaligned SIMD memory access, which was added in SSE4, and has little
performance impact when writing to the L1 cache. These unaligned stores are used
to incrementally fill the output with compacted data. Parallel algorithms must first
scan through the bit mask counting bits to determine the proper o↵set to begin writ-
ing each portion of the result. Once those o↵sets are calculated, the column can be
partitioned for multiple threads to work on in parallel.
Hardware to perform this selection is presented in Figure 3.2. We call the number
of elements consumed each pass through the hardware the “width” of the selection
block. The hardware in Figure 3.2 thus has a width of four. Assuming a fully pipelined
implementation, the bandwidth of the block is fully determined by the width of the
block and the clock speed. As mentioned in Section 3.2.1, a barrel shifter can be
e�ciently implemented using multiple stages of multiplexors; however, such large
barrel shifters must be pipelined to achieve high clock frequencies, so the datapath
in Figure 3.2 was carefully designed to avoid feedback paths containing large barrel
shifters which would necessitate pipeline stalls (or a very slow clock). As is, the only
feedback path in the design is a very small addition (with width log2(W )), allowing
for a deeply pipelined design to achieve a high clock rate.
The first step is to produce a word array in which all selected words from the input
are shu✏ed next to each other at one end of the array (in this case, the right side).
A combinational logic block takes in a segment of the mask stream and produces a
CHAPTER 3. LOOSELY COUPLED ACCELERATION 71
Figure 3.2: Data and control paths for selection of four elements.
CHAPTER 3. LOOSELY COUPLED ACCELERATION 72
Figure 3.3: Control logic for the selection unit.
count of the number of selected elements in the segment, a bus of valid lines, and an
index vector which specifies which word should be selected for each position in the
shu✏ed word array.
For small input widths, this combination logic can simply be implemented as a
single ROM. Such a ROM would have depth 2W . This is clearly not feasible for any
realistic input width. Using pure-combinational logic, such as a cascade of leading-
1-detectors, would also not be feasible for larger input widths. We thus use smaller
sections of the mask as addresses into multiple smaller ROMs. So for example, instead
of using all 16 bits of a mask segment to address a 64k deep ROM, we can use each
4-bit nibble of the mask to address four 16 element ROMs. It is then necessary to shift
the output of each ROM into the correct position of the final index vector, based on
the accumulated count from the adjacent ROM. Figure 3.3 shows an implementation
of this for an input width of 16. This datapath has no feedback paths and can thus
be e�ciently pipelined to achieve full throughput. Decreasing the size of the ROMS
and including more of them results in lower total ROM space but higher latency and
more adders, barrel shifters, and pipeline registers.
For a given input width W, the count is log2(W ) bits wide, the valid array is
W bits wide, and the selection vector’s width is is log2(W ) + log2(W � 1) + ... + 1.
The number of control lines for this section thus grows quite rapidly, from 69 bits for
CHAPTER 3. LOOSELY COUPLED ACCELERATION 73
an input width of 16 (4 for count, 16 for valid, and 8 ⇤ 4 + 4 ⇤ 3 + 2 ⇤ 2 + 1 = 49
for selection) to 904 bits for an input width of 128. Unfortunately there is no way
to reduce the amount of control signals needed in this initial step. To consume W
values of data each cycle, the value at the edge of output of the shu✏e array could be
any of those W inputs. Thus, to achieve a bandwidth of W values per cycle requires
a W :1 multiplexor for that word. Including multiplexors for the other values, any
implementation requires W � 1 word multiplexors with sizes from W :1 down to 2:1
to consume W values per cycle.
Once the selected values are shu✏ed to the right side, they are rotated left to a
position indicated by the current number of saved values ready to be output. Values
in the input that complete a full output are sent directly to the output and values
that will make up a partial output are saved in registers. For example, if two values
were previously saved in the registers, and three values are selected in the input, the
input will be rotated right by two, such that the lowest (furthest right) two values fill
the left two positions in the output, and the third input word is saved in the register
furthest to the right, ready to be added to selected values from the next input.
3.2.3 Merge Join
The merge join operation takes two sorted columns of fixed-width keys as input, each
with an associated payload column, and produces an output column which contains
all the keys that the two columns have in common, together with the associated
payload values. When there are duplicate matching keys, the cross product of all
payload values are produced. For example, if there are four entries of a key x in one
input column, and six entries of x in the other input, there will be 24 entries in the
output with key x.
This operation can be performed in software by sequentially moving through each
input column and advancing the pointer of the column with the lower value. When
two keys match, an output row is written to the output array and the output pointer
incremented. Care must be taken to handle the case of multiple matching keys and
produce the correct cross-section output. The resulting code has a large number of
CHAPTER 3. LOOSELY COUPLED ACCELERATION 74
Figure 3.4: Hardware to perform the merge join operation. The green lines exitingdiagonally from each comparator encompass the key, both values, and the result ofthe comparison.
unpredictable branches that result in a very low IPC and quickly becomes processor
bound, not able to keep up with the memory bandwidth available to even a single
core.
Our hardware design to perform this operation is laid out in Figure 3.4. The
basic design is rather straightforward; all combinations of a section of keys from one
input, the “right” input, and a section of keys from the other input, the “left” input,
are compared. An array of possible output combinations with a bit mask indicating
which should be used is produced. This output can then be sent into the selection
unit from Section 3.2.2 to produce the actual output rows. The highest value from
each input is compared, the input with the lower highest value is advanced, while the
same selection from the other input remains. This ensures that any combination of
input keys that could potentially match are compared.
Complications arise, however, when the highest value of each input selection is
equal. In this case it is necessary to bu↵er the keys from the left input and advance
through the left input until the highest keys no longer match. When that happens,
it is guaranteed that the highest right input is lower than the highest left input,
CHAPTER 3. LOOSELY COUPLED ACCELERATION 75
Figure 3.5: Merge join optimization. Shaded blocks have potential matches in them;the line is the path the unoptimized design takes through the data, with the zig-zagat the end representing a replay. In this case, the optimized design looks at 4 crosssections and the unoptimized design looks at 8.
and the right input can be advanced. Any values bu↵ered are then replayed and
compared against the new selection from the right. When the replay bu↵er is empty,
execution continues as normal. Our design uses two local bu↵ers and control logic
that allows the bu↵er to spill into a pre-allocated bu↵er in DRAM. Once the top bu↵er
is filled, the bottom bu↵er is filled, when the bottom bu↵er is filled, it is drained into
DRAM, ready to be filled again. Using two bu↵ers in this way assures that when the
replay starts data is immediately available to be replayed (the data in the top bu↵er).
While that bu↵er is being replayed, any data that spilled over into DRAM can be
pre-fetched, hiding the DRAM latency. The bottom bu↵er is used to provide a large
burst of data to write to DRAM, instead of small individual writes, which decreases
the impact on overall DRAM performance.
Because the number of comparators grows quadratically with the width of input,
it is di�cult to implement hardware with a wide input array. An optimization to
help increase the throughput of the design looks at a much wider selection of each
input than the actual comparator grid. The input is partitioned into sections that
fit into the comparator grid and the highest and lowest values are compared. Using
those comparisons, only those cross sections with potential matches are sent into the
comparator grid sequentially while the others are skipped. Figure 3.5 is an example
where four chunks of data from each input are considered at once. The cross sections
CHAPTER 3. LOOSELY COUPLED ACCELERATION 76
shaded green correspond to blocks that have potential matches and must be examined.
The unshaded blocks do not have to be examined. The unoptimized design would
follow the path drawn through the data, examining all eight of the cross-sections it
moves through. Optimized hardware would only examine the four shaded blocks,
then advance one of the inputs as in the original design.
3.2.4 Sorting
Sorting an array, or column, of numbers has been and will continue to be a very active
area of research and is an essential primitive operation in many application domains,
including databases. Quicksort based algorithms have traditionally been considered
to have the best average case performance among software sorting algorithms. How-
ever, recent advances in both CPU and GPU architectures have brought merge sort
based algorithms, such as bitonic sort and Batcher odd-even sort, to the forefront of
performance as they are able to exploit new architectures more e↵ectively and better
utilize a limited amount of bandwidth [25, 68, 73, 50]. Satish et.al.[69] provide a
comprehensive overview of state of art sorting algorithms, and their limitations and
trade-o↵s, for general purposes CPU and GPU processors.
We present here a dedicated hardware solution to perform a merge sort entirely
in hardware. The goal of this design is to sort an in-memory column of values while
streaming the column to and from memory at full memory bandwidth as few times as
possible. Figure 3.6 depicts the essence of a merge sort. We call the merge done at the
individual node a “sort merge”, which is distinguished from a “merge join” presented
in Section 3.2.3. To accomplish this we implement a merge tree directly in hardware,
stream unsorted data from memory into the merge tree and write out sorted portions
of the column. Those sorted portions then become the input to each input leaf of the
merge tree again, generating much larger sorted portions. This process is repeated
until the entire column is sorted. The number of passes required through the tree is
dependent on the width of the merge tree. If the tree has width W and the column
has N elements of data, N/W portions of length W are created on the first pass
through. On the second pass, those N/W portions are merged into N/W 2 portions
CHAPTER 3. LOOSELY COUPLED ACCELERATION 77
4 8 2 1 5 5 7 0
4 8 1 2 5 5 0 7
1 2 4 8 0 5 5 7
0 1 2 4 5 5 7
Figure 3.6: Sorting using a sort merge tree.
of length W 2. This continues until N < W p where p is the number of passes. The
number of passes required to sort a column of size N is then dlogW
(N)e. Thus, if
W is relatively large, the number of passes required grows extremely slowly with the
size of the input table and very large tables can be sorted in just two or three passes
of the data.
Before we describe the design of the merge tree itself, we first look at an individual
node in the merge tree. The maximum throughput of data through the merge tree
will be ultimately limited by the throughput of data through the final node at the
bottom of the tree. Depending on the data, other nodes of the tree can also become
a bottleneck. For example, if the far left input on a second pass contains all of the
lowest elements of the full column, then only the far left branches of the tree will be
used until the entire portion is consumed. It is thus not practical to move only the
lowest single value of the two inputs of a node to the output. This would result in
the throughput of the tree being only one element per cycle. It is thus necessary to
consume multiple values every cycle.
Figure 3.7 gives a logical overview of how multiple values from the input are
merged at a time. The input to the unit is two bu↵ers, each containing two sorted
lists. To maintain a high bandwidth, each input is multiple values wide (Figure 3.7
shows four, but in general it can be much wider). Each iteration, the lowest value of
each input are compared and the entire width of the input, in this case four values,
CHAPTER 3. LOOSELY COUPLED ACCELERATION 78
is removed from the input queue with the lower lowest value. These four values are
merged with the highest four values from the previous iteration. The four lowest
values resulting from that merge are guaranteed to be lower than any other value yet
to be considered since any values lower than the fourth would already have been pulled
in because both inputs are already sorted. The highest four values, however, may be
higher than values yet to be pulled in from the input not chosen at the beginning of
the iteration. They must therefore be fed back and merged with the next set of input
values. In this way, four values are produced and four values are consumed from one
of the inputs each iteration.
It is not necessary, however, to put a merge network like that in Figure 3.7 at each
node of the tree. Each level of the tree need only supply values as fast as the level
below it can consume values. Thus, each level need only match the throughput of the
final node of the tree, which need only match the write memory bandwidth to keep up
with memory. Figure 3.8 presents the hardware that encompasses a single level of a
merge tree, which we call a “sort merge unit”. There are four memories at each level.
A data memory which bu↵ers the input data to the level. It is only necessary to hold
as a single value for each input leaf to the level. The data memory is partitioned into
“left” and “right” data so that both inputs to a particular node can be read at once,
but each can be written separately. Another memory holds the feedback data from
the previous merge of values for each node in the level. A valid memory holds a bit
for each input leaf to indicate that the data for that leaf is valid, and a bit for each
entry in the feedback memory. These valid bits are blocked in chunks, so a single read
or write works on multiple values at once. Finally, a “request sent” memory, which is
blocked like the valid memory, holds a single bit for each input leaf to indicate that
a request has been sent up the tree to fill the data for that leaf. Note that there are
no output bu↵ers, as the outputs are bu↵ered at the next level in the tree.
We now describe three operations performed on a sort merge unit: a push, a
request, and a pop. A push, whose data path is black in Figure 3.8, is performed
when a previously requested input data comes from above the unit in the tree. First,
the data is written to the data memory, which is known to be invalid because it was
previously requested, and the valid and request outstanding blocks are read. The
CHAPTER 3. LOOSELY COUPLED ACCELERATION 79
Figure 3.7: Merging multiple values at once. In this diagram, four values will bemerged each iteration through the logic.
corresponding valid bit is set and the request outstanding bit is cleared, and the new
blocks are written back into the respective memories. The new block of valid bits is
also sent down to the lower level along with the index. If nothing is being pushed in
a particular cycle, a valid block (determined by an internal counter) is still read and
sent down to the lower level, this is not shown in the figure and prevents deadlock in
some cases.
When the valid block and associated index are sent to a sort merge unit, it initiates
a request operation, which follows the green data path in Figure 3.8. First, the level’s
own valid and request outstanding blocks corresponding to the valid bits received are
read. The incoming valid block, which represent data valid at nodes above, and the
local valid and request outstanding blocks are examined to to find invalid elements
that have two valid parents and have not been requested. One such element is selected,
a bit for it is set in the request outstanding memory, and the request is sent up to
the parent.
Finally, an incoming request from below results in a pop operation, which follows
the orange data path. Both data values, the feedback data, and corresponding valid
CHAPTER 3. LOOSELY COUPLED ACCELERATION 80
Figure
3.8:
Sortmerge
unit.Notethat
forsimplicity,ports
tothesamemem
oryareseparated.
CHAPTER 3. LOOSELY COUPLED ACCELERATION 81
block are read. The lowest values in each data bu↵er are compared. The block with
the lowest is sent to the merge network along with the feedback data (if valid) and
the valid bit corresponding to the consumed leaf is cleared while the valid bit for the
feedback data is set. The lower values from the merge network are sent to the next
level to pushed and the higher values are written back into the feedback memory.
All three of these operations must be pipelined to ensure continuous flow of data
through the merge tree. Section 3.3.3 gives a brief description of how we pipelined
our implementation to achieve high throughput. Even with a fully pipelined design,
however, the throughput of the entire merge tree is limited by the throughput of the
final node in the tree. The design in Figures 3.7 and 3.8 can sustain a throughput
of multiple values every cycle as long as there are plenty of input nodes with data
and available output nodes. In this case merges of multiple nodes in the level are
happening simultaneously. However, the final node of the tree has only two inputs.
That means that an entire iteration must complete before the next merge can begin,
since the feedback data is required to pull another element from a parent node. We
can estimate the latency of a reasonably pipelined implementation of Figure 3.8 to
be the number of stages in the merge network, which is O(lg(width)), making the
throughput through the final node in the tree O(width/lg(width)). For the final node,
however, the ability to handle multiple merges at once isn’t necessary, and it should
more idealy have bandwidth that is O(width).
Figure 3.9 presents a higher bandwidth sort merge unit which only implements a
single node of the tree, not an entire level with multiple nodes like Figure 3.8. Instead
of consuming and merging a set number of values from one of the inputs, shift registers
are used to consume a variable number from each input and new values are shifted in
as space becomes available. Let W be the number of values to output each iteration.
Let Li
and Ri
be the values in the left and right shift registers, respectively, with
i ranging from 0 to 2W � 1. To determine the four lowest value from across both
shift registers, each Lx
is compared with R(W�1�x) for x between 0 and W � 1. The
lower of the two in each case is advanced to the sort network while the higher remains
in the shift register. For example, if L0 < R3, then at least one from the left and
no more than three from the right are among the lowest, so L0 is necessarily one of
CHAPTER 3. LOOSELY COUPLED ACCELERATION 82
Figure 3.9: High bandwidth sort merge unit.
Figure 3.10: Full system block diagram and data paths.
CHAPTER 3. LOOSELY COUPLED ACCELERATION 83
the lowest and R3 is necessarily not. Likewise for L1 and R2, L2 and R1, and L3
and R0. The number taken from each side is counted and the shift register is shifted
by that amount. If there is enough free space in the shift register, an input section
is consumed, shifted into the correct position, and stored. The four lowest values
are then sent into a full sort network. A merge network like that in Figure 3.7 is
insu�cient here since the input is not necessarily split into two equally sized, already
sorted arrays. A simple merge network of twice the width could be used, with some
number of the inputs on each side disabled, but a merge network of size 2N takes
more resources than a full sort network of size N .
The datapath in Figure 3.9 still has feedback paths which prevent a pipelined
implementation from being fully utilized; the critical feedback path is a bit count,
barrel shifter, and 2:1 multiplexor. This path is much shorter and grows much less
quickly as the width increases than the feedback path of Figure 3.8 which include a full
merge network. Since the number of stages in the barrel shifter is O(lg(width)) (See
Section 3.2.1), the bandwidth through this unit is still O(width/lg(width)); however,
the base of the logarithm is much higher, making the bandwidth much closer to the
ideal O(width).
Finally, Figure 3.10 shows the datapath for a full merge tree. A “tree filler” block
has the same interface as a sort merge unit, but fulfills requests by fetching from
DRAM. It continually sends blocks of “valid” bits which indicate that data is still
available for a particular input, turns requests from the top level of the merge tree
into DRAM requests, and turns replies from DRAM into pushes into the top sort
merge unit. During the initial pass through the memory, the data for an input can
come from anywhere, so the input column is read linearly and sent through a small
initial bootstrap sort network since the sort merge units expect blocks of sorted data
as input. To prevent very wide levels that make routing more di�cult, the top levels
of the tree are split into four sub-trees, which operate independently of each other.
The final two levels of the tree are the high bandwidth merge sort unit to maintain
the total throughput of the tree and merge the output of the four lower bandwidth
trees to produce a single sorted output.
On passes after the initial pass through data, the tree filler must obtain data from
CHAPTER 3. LOOSELY COUPLED ACCELERATION 84
the particular sorted portion that matches the tree input of the request. Depending
on the number of portions remaining to be merged, the tree filler maps some number
of inputs of the tree to each of the remaining portions. For example, if the full tree is
16k inputs wide and there are four portions remaining to be merged, the first portion
is mapped to the first 4k inputs, the second to the next 4k, etc. This means that
some values of the portion are re-merged, but also has the e↵ect of using sections of
the tree as an input bu↵er for each of the portions. The fewer portions that remain to
be merge, the larger the “input bu↵er” for each portion is and the larger the requests
to DRAM can be. When the number of portions remaining to be sorted is equal to
the number of inputs to the tree, only a single chunk of a portion can be requested
at a time, leading to ine�cient use of the DRAM bandwidth. We see the results of
this in Section 3.3.3.
To support using portions of the merge tree as an input bu↵er in subsequent
passes, the tree filler keeps a bit mask of tree inputs that it has received a request
for. When enough of the inputs mapped to a particular portion have been requested,
a single large request for the next values in that portion are requested and all of the
requests are fulfilled in bulk.
For a well distributed data set, the throughput of the merge tree is the same as
the throughput through the final node, so close to O(width) and can easily grow
with available memory throughput by making the final nodes wider. However, for
some data sets the throughput will be limited by the bandwidth through one of the
lower bandwidth merge nodes. Consider the case of sorting a list that is already in
order. For the first path, the data will come through all branches of the tree and
full bandwidth will be achieved. However, on the second path, the far left input to
the tree will need to be drained before moving on to the next input. In this case,
all data is coming from just one branch of the tree, and the throughput through the
tree is the throughput through a low-bandwidth merge node with only one input, or
O(width/lg(width)), where width is the width of the low-bandwidth merge network
in Figure 3.7.
CHAPTER 3. LOOSELY COUPLED ACCELERATION 85
3.2.5 Sort Merge Join
A full join operation is the same operation as a merge join, described in Section 3.2.3,
but does not require the input columns to be sorted. Two main algorithms are most
often used to perform joins, a hash join and sort merge join [45]. A hash join builds
a hash table of one of the two input columns, then looks each element of the other
column up in the hash table to find matches. Modern hash join implementation use
sophisticated partitioning schemes to parallelize the operation and utilize a processors
cache hierarchy. A sort merge join simply sorts both input columns then performs
a merge join on the sorted columns. Implementations leverage the massive amount
of research to improve the performance of sorting. Typically the final merge step is
all but ignored because sorting the columns takes such a huge percentage of the time
necessary for a sort merge join.
Figure 3.10 shows how each of the three blocks previously described can be com-
bined to perform an entire sort merge join in hardware. Two independent sort trees
are used to sort each of the two input columns. On the final pass through each col-
umn, the sorted data is sent to the merge join block instead of back to DRAM. The
merge join output is sent to the select block as before and only the result of the join
operation is written back into DRAM. The design also include data paths that allow
the sort, merge join, and select blocks to be used independently of each other.
3.3 Implementation and Results
To prototype the design we used a system from Maxeler Technologies described in
Figure 3.11. This system features four large Xilinx Virtex-6 FPGAs. Each FPGA
has 475k logic cells and 1,064 36 Kb RAM blocks for a total of 4.67 MB of block
memory. Each FPGA is connected to 24 GB of memory via a single 384 bit memory
channel capable of running at 400 MHz DDR, for a line speed of 307.2 Gbps, or 38.4
GB/s per FPGA. This gives a total line bandwidth between the FPGAs and memory
of 153.6 GB/s, comparable to modern GPUs. The FPGAs are connected in a line
with connections capable of 4 GB/s in each direction. For each design, we clocked
CHAPTER 3. LOOSELY COUPLED ACCELERATION 86
Figure 3.11: Block diagram of prototyping platform from Maxeler Technologies.
the FPGA fabric at 200 MHz. Finally, each FPGA is connected via PCIe x8 to a
host processor which is two 2.67 GHz Xeon 5650s, each containing 6 multi-threaded
cores. These processor each have a line memory bandwidth of 32 GB/s.
Our purpose in prototyping the design was not entirely to determine the perfor-
mance of the design, although we do provide performance numbers. As long as the
components are able to match or exceed the memory bandwidth, the performance
is largely determined by the memory system of the design, and thus many of the
performance results are as much a test of Maxeler’s memory system as they are of
the acceleration design. Our main purpose in building the prototype was to drive the
design using a real world implementation instead of what are often inaccurate sim-
ulation models, and to be able to determine the challenging issues that arise as the
hardware scales to higher bandwidths. Indeed, the final designs we have presented are
fairly di↵erent from the original designs we came up with based on early simulations.
We chose the Maxeler platform for the large amount of memory capacity and
bandwidth available to the FPGAs; we wanted to ensure that our prototype handled
a su�cient amount of bandwidth to prevent masking any scalability issues. The
largest performance bottleneck we faced using the platform is the relatively narrow
CHAPTER 3. LOOSELY COUPLED ACCELERATION 87
intra-FPGA links, which prevented us from e↵ectively emulating a single chip with
a full 153.6 GB/s of memory bandwidth. Thus, for all but Section 3.3.4, we use a
single FPGA, since using the narrow intra-FPGA links skews the results in terms of
the memory bandwidth utilization.
Since many of the performance numbers are dominated by the performance of the
memory system on the Maxeler platform, we also present percentage of the maximum
memory throughput (by which we mean the line bandwidth of the memory interface)
as a metric of comparison. Since our hardware is designed to scale with available
bandwidth, these percentages give an idea of how the design would perform in di↵erent
platforms with di↵erent memory systems. They also provide a metric of comparison
with previous work, as it is di�cult to make a true “apples-to-apples” comparison
when the hardware is so vastly di↵erent. We also give some intuition as to how the
resource requirements of each design will scale to platforms with di↵erent memory
bandwidths.
3.3.1 Selection
We implemented the software algorithm described in Section 3.2.2 and optimized at
the assembly language level. On our system’s host processor, this implementation is
able to achieve a maximum throughput using 8 threads, with an average throughput
of 7.4 GB/s and 6.0 GB/s as the selection cardinality moves from 0% to 100%. This
corresponds to 23.1% to 18.8% of the 32 GB/s maximum memory throughput of the
Xeon 5650. For reference, the STREAM benchmark[54] also achieves the maximum
bandwidth with 8 threads and is able to copy memory at a maximum speed of 11.8
GB/s1, about 36.8% of the line rate memory bandwidth of the Xeon 5650. Results
reported on the STREAM benchmark website [1] indicate that this utilization of
maximum memory bandwidth is typical for modern processors, including the Sandy
Bridge based E5-4650.
Our implementation uses three SIMD registers, one to hold the data to be shu✏ed,
1The STREAM benchmark reported 23.6 GB/s, but counts bytes both read and written, or the“STREAM” method; the number here is for the “bcopy” method, which counts total bytes moved,which is more aligned with our use of bandwidth in this work.
CHAPTER 3. LOOSELY COUPLED ACCELERATION 88
0 10 20 30 40 50 60 70 80 90 100Cardinality (%)
16
17
18
19
20
21
22
23
24
Thro
ughp
ut (G
B/s)
Figure 3.12: Measured throughput of the select block prototype.
one to hold the bit mask, and one to hold the shu✏e indices loaded from memory.
Thus, the lack of available SIMD registers accounts for the inability of the processor
to fully pipeline the selection process and achieve the throughput of STREAM. The
Xeon’s in our test system support 16 byte wide SIMD instructions; using the 32
byte wide AVX2 integer instructions in the upcoming Haswell processors we would
expect better performance. We conclude that it is reasonable to expect a highly tuned
software selection algorithm to match the throughput of STREAM. However, doing
so would require most of, if not all of the chip’s capacity. In contrast our customized
hardware is more than able to keep up with high memory bandwidth using much
fewer resources.
The design in Section 3.2.2 maps almost directly to the FPGA platform and
we built a block that processes 72 64-bit values per clock cycle, for a maximum
throughput of 14.4 billion values per second, or 115.2 GB/s. This is much more than
the memory bandwidth available to a single chip; we will see in Section 3.3.2 why we
made it that wide. We note that the non-power of two number comes from the width
of the memory interface, which is 384 bits DDR and is run at twice the frequency as
the main clock domain, resulting in a 1536 bit or 192 byte bus. It is resource intensive
to convert this to a nice power of two, but not too di�cult to convert to a multiple
of 96 bytes, thus all of our datapaths work in multiples of 96 bytes.
Figure 3.12 shows the measured throughput of the prototype. Throughout Sec-
tion 3.3, bandwidth numbers are measured as the number of input bytes processed
CHAPTER 3. LOOSELY COUPLED ACCELERATION 89
per second 2. We could alternatively use total number of bytes read and written.
This is pertinent here because a selection with cardinality of 0% transfers half the
amount data as one with cardinality of 100%. With a constant amount of memory
bandwidth that can be used for either reading or writing data, the 100% case will
take longer to execute, but would have higher throughput if bytes both read and
written were counted. Counting only bytes read, the cardinality of 100% case shows
lower bandwidth since it takes longer to process the same amount of input data. This
explains the nearly linear drop from 24.7 GB/s down to 17.8 GB/s as the cardinality
moves from 40% to 100%. Below 40% the limits of a single port of the DRAM con-
troller are reached and the full line rate of the memory interface is not realized. At
100% cardinality, the memory controller is more e�cient with two streams of data
(in and out) and is able to utilize 93% of the 38.4 GB/s of line bandwidth. This high
utilization is achieved because of the very linear nature of the data access pattern (i.e.
every column is accessed in a row before moving on to the next row) and by putting
the source and destination columns in di↵erent ranks of the DRAM, preventing them
from interfering with one another.
At low cardinalities, the 24.7 GB/s achieved is 64.3% of the 38.4 GB/s maximum
memory throughput of the FPGA. This represents a 2.8x increase in the memory
bandwidth utilization over the 23.1% utilization of the software, and a 1.7x increase
over the STREAM benchmark, which is as high as any software implementation could
possibly achieve.
Note that the results of Figure 3.12 are per selection block and measured using
only a single FPGA. Benefits from attempting to use the memory bandwidth of the
other FPGAs for a single selection block would be thwarted by the narrow intra-
FPGA links. Using all four FPGAs to emulate a design with four selection blocks
would result in 4x the throughput but four separate output columns.
Since Figure 3.12 is simply just a measure of the memory system on the Max-
eler platform, We now look at the number of resources required to scale the design.
Figure 3.13 shows the resources used by the implementation as the width, and thus
2Also note that “GB” is here is really gigabyte, not gibibyte, making percentage of line bandwidth,which is also in GB, not GiB, make sense
CHAPTER 3. LOOSELY COUPLED ACCELERATION 90
64 88 112 136 160 184 208 232 256 280 304 328 352 376Throughput (bytes/clock)
24 36 48 60 72 84 96 108 120 132Throughput (GB/s @ 400 MHz)
0
2
4
6
8
10
Coun
t (th
ousa
nds)
ROM bits16:1 mux4:1 muxregisters
Figure 3.13: Amount of resources needed as the desired throughput of the select blockincreases.
bandwidth, of the block increases (note the di↵erent scale for registers and the other
components). We present throughput as bytes per clock to decouple the results from
any particular frequency, but also present GB/s at 400 MHz for reference. The range
in throughput represents the range in width from 8 to 144 64-bit words. In choosing
the number of stages used in the initial shu✏e control (see Section 3.2.2), we exper-
imentally found a good number of stages to use is W/4, where W is the width in
words of the selection block.
Note that the numbers in Figure 3.13 present resources at the bit level. So a
multiplexor that select between 4 64-bit words requires 64 4:1 multiplexors. For
convenience, we lump 2:1 multiplexors in with 4:1 multiplexors and 8:1 multiplexors
in with 16:1 multiplexors. Any multiplexor wider than 16 inputs is split into multiple
stages to ease routing congestion and maintain clock speed. The swap that occurs
at 496 bytes/block (or 62 to 68 words) results from the second stage of an 68:1
multiplexor requiring 16:1 multiplexors instead of the 4:1 second stage of smaller
widths (W/16 > 4 when W > 64).
The most dramatic increase in resources as throughput increases comes from the
number of registers. This results from the additional pipeline stages needed as the
CHAPTER 3. LOOSELY COUPLED ACCELERATION 91
width increases. In addition to addition stages in the shu✏e multiplexor and barrel
shifter, we added duplicate registers to reduce fanout for each 16 inputs to help with
the routing on the FPGA.
3.3.2 Merge Join
We prototyped the design presented in Section 3.2.3. The prototype is designed to
merge two streams of elements composed of 32-bit keys and 16-bit values. Because of
the high demand for routing resources, the structure did not map well to the FPGA
fabric and we were able to achieve a block with a width of eight words for each input.
The output combinations, which are a 32-bit key and two 16-bit values, and equality
bit vector are sent into a selection block, which is wide enough to accept all 64 64-bit
inputs.
The throughput of the prototype for varying amounts of output vs the input table
size is presented in Figure 3.14. The line labeled “m=1” is the raw comparison grid
without the optimization of not examining unnecessary cross sections. The other line,
“m=8” shows the throughput for looking at 8 chunks of each input and only actually
comparing chunks with potential matches. The output ratio is the size of the output
compared to the input table size (which is two equally sized tables). The keys are
uniformly distributed within a range that is changed to vary the output ratio.
At low output ratios, the throughput is contrained by the throughput of the hard-
ware block itself (eight six byte values at 200 MHz is 9.6 GB/s). As the output
ratio increases, it is necessary to “replay” portions of the input more often (see Sec-
tion 3.2.3) and the throughput decreases. Above a ratio of 1.5 (i.e. the output is
1.5 times the size of the input), the throughput is entirely limited by the write mem-
ory bandwidth. We looked at non-uniform distributions, but saw no variance in the
throughput for any given output ratio. Most skewed data, such as data with a Zipf
distribution used in the literature, produced a very large amount of output and were
all limited by the write memory bandwidth.
The optimization to look at 8 chunks of input and only comparing possible cross
sections resulted in a 13% speed up when the data was distributed enough to produce
CHAPTER 3. LOOSELY COUPLED ACCELERATION 92
0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3Output ratio
5.5
6
6.5
7
7.5
8
Thro
ughp
ut (G
B/s)
m=1m=2m=3m=8
Figure 3.14: Throughput of the merge join prototype.
a very small output. As the keys become more dense and the output ratio goes to
3.0, fewer cross sections can be eliminated and the speedup is reduced to 11%.
We do not plot the required resources for the merge join block because it is dom-
inated entirely by the comparators and routing resources and is simply a quadratic
function of the bandwidth required. To consume N values from either input every
cycle required N2 comparisons. Higher bandwidth could be obtained by replicating
the merge block and partitioning the data, but doing so is left for future work.
3.3.3 Sorting
Our implementation of the design outlined in Section 3.2.4 is designed to handle 12
64-bit values every other 200 MHz cycle, providing a maximum throughput of 19.2
GB/s, which is able to keep up with the memory bandwidth of an individual FPGA
(assuming a column is being read and written). One of the major challenges faced in
implementing the low bandwidth merge sort unit was the number of memory ports
needed. In particular, it was necessary to access five di↵erent addresses of the valid
memory in any given cycle. The local memories on the FPGA have two full RW
ports. To solve the issue we duplicated each valid memory and time multiplexed
the ports, alternating between reading and writing (thus handling a new input every
other cycle). Table 3.1 details how each port was used to achieve a virtual 5-port
CHAPTER 3. LOOSELY COUPLED ACCELERATION 93
Memory Port Read Cycle Write Cycle
valid copy 1A Read for push Write for pushB Read for pop Write for pop
valid copy 2A Read for request Write for pushB Idle Write for pop
requestoutstanding
A Read for push Write for pushB Read for request Write for request
Table 3.1: Memory port usage in sort merge unit.
memory. Note that each port must perform the same operation on the write cycle to
maintain coherent duplication.
All the other structures mapped directly to the FPGA logic. To maintain 19.2
GB/s through the entire tree, the three high bandwidth sort merge units at the
bottom of the tree were built to accept 24 values every four cycles to accommodate
the feedback path. The most challenging aspect was getting the control for the fine
grained communication between levels correct. As an example, the pop operation is
pipelined to take six cycles: 1) start the read of data and valid blocks; 2) decode
the index; 3) start the read of the feedback data; 4) the reads complete, compare
the data; 5) multiplex the data based on the comparison result; 6) merge decoded
index with read valid blocks, update the valid block, and send the feedback data and
selected data to the merge network. At every other pipeline stage the index being
pushed is compared with the incoming index and if the two fall within the same block,
the decoded index, which indicates the valid bit to set, is updated and the incoming
push is considered complete. The pipelines for the request and push operations are
similar.
The memories on the FPGA provided enough space for 12 levels in the merge
tree, with a top level 8k inputs wide. The data bu↵ering alone for the merge tree
(including the feedback data) occupied 18.6 Mbits, or 50%, of the 37.4 Mbits of block
ram available on the device.
Figure 3.15 shows the throughput of the prototype as the size of the input column
grows. Note that when performing two passes over the entire data set, the theoretical
maximum throughput is one quarter of the maximum memory throughput (each value
CHAPTER 3. LOOSELY COUPLED ACCELERATION 94
375K
750K
1.5M 3M 6M
12.5
M
25M
50M
100M
200M
400M
800M 1.6B
3.2B
6.4B
12.5
B
25B
50B
Size of Input
400
600
800
1000
1200
1400
1600
Mill
ion
valu
es p
er s
econ
d 2 passes3 passes3 passes (projected)
Figure 3.15: Throughput of the sort tree prototype.
needs to be both read and written twice), or 9.7 GB/s in our case. At small input
sizes, we achieve 8.7 GB/s, which is 22.7% of the maximum memory bandwidth, or
89% of the theoretical maximum with two passes. This high utilization is possible
because there are fewer partially sorted portions to merge in the second pass and as
a result each portion has a large virtual input bu↵er and the requests to memory
can be large (see Section 3.2.4). For reference, recent work on sorting values on both
CPUs and GPUs achieved rates as high as 268 million 32-bit values per second [69].
This corresponds to 1 GB/s of throughput, which is 3.9% of the 25.6 GB/s available
to the Core i7 used (GPU performance was worse). We thus see a 5.7x improvement
in terms of memory bandwidth utilization.
As the size of the input increases, the number of portions that must be merged
on the second pass increases and the size of the requests to memory decrease. At
an input size of 25M values, the memory requests are too small to fully utilize the
memory bandwidth and performance begins to degrade. When the input size reaches
400M values, there are enough portions in the second pass that it is advantageous to
perform a third pass. In this case, the portions from the first pass are partitioned into
groups small enough that large memory requests can be used and each partition is
sequentially merged into portions ready to be merged in the third pass. Above 800M
values, there was insu�cient memory to hold both the input and output columns, we
CHAPTER 3. LOOSELY COUPLED ACCELERATION 95
17 19 21 23 25 27 29 31 33 35Input Size in bytes (log2)
16
18
20
22
24
26
28
30
Mem
ory
bits
(log
2)
Figure 3.16: Memory bits required to achieve optimal sort throughput for a giveninput size. Note the log/log scale.
therefore projected the performance for larger columns, making use the throughput
seen on the second pass of smaller columns to predict the memory bandwidth for a
certain table size.
Unlike the previous sections, the interesting resource metric is not how the resource
usage grows with desired bandwidth, but how the resource usage grows with input
size, keeping bandwidth constant. A very small merge tree could maximize bandwidth
for small inputs, but performance would rapidly decrease as input size grows. For
example, our prototype was able to use the maximum amount of memory bandwidth
until the input was over 12.5 million values. To see where this limit comes from,
let N be the size of the input, in bytes, and let W be the width of the top level
of the tree in bytes (in our prototype W = 8k ⇤ 12 records ⇤ 8 bytes/record =
786432 bytes). The number of portions left after the first pass through the data is
L = N/W and the maximum size of each read on the second pass is W/L, or W 2/N .
If the minimum read size for optimal memory throughput is M , the maximum input
size that achieves optimal memory performance is W 2/M . For the Maxeler platform,
M is measured to be 6144 bytes, which gives a maximum size of 100 MB, or 12.5M
64-bit values. Likewise, W must bepM ⇤N for a table of size N to fully utilize the
memory bandwidth on the second pass. Figure 3.16 provides the number of memory
bits needed to achieve maximum memory bandwidth e�ciency for given input sizes,
provided a minimum read size of 6144 bytes.
CHAPTER 3. LOOSELY COUPLED ACCELERATION 96
To obtain the highest throughput possible using our platform, we tested a pro-
totype where one quarter of the input column was split onto FPGAs 0 and 2, while
the remaining three quarters were put on FPGA 1. With this configuration, the two
smaller portions were individually sorted then streamed over the intra-FPGA link to
the FPGA with the bulk of the data. These streams were simply treated as extra
inputs to the top of the tree on the final merge pass and essentially augmented the
memory bandwidth. Note that the sort tree hardware did not change, just where
the data came from. With this configuration, we achieved a throughput of 1.4 billion
values per second, or 11.2 GB/s. With the narrow intra-FPGA links in play, this
is a much lower percentage of the memory bandwidth available to the three chips
used (9.7%). We mention it here to demonstrate that the throughput of the sort tree
hardware is purely constrained by the memory bandwidth.
3.3.4 Sort Merge Join
Finally, we combine the selection, merge join, and sorting blocks to prototype the full
design in Figure 3.10. The resources of a single FPGA were too constrained to fit all
three blocks on a single FPGA, so we put the merge join and selection blocks on one
FPGA and sort trees on the two adjacent FPGAs. Figure 3.17 outlines the process
used to perform a full join. Each of the columns to be joined is held entirely on a
seperate FPGA. Each table is individually sorted, except the output of the sort tree
on the final pass is sent across the intra-FPGA links to the merge join block described
in Section 3.3.2. These blocks are su�ciently wide to keep up with the bandwidth of
the intra-FPGA links. Since the first sorting pass through the table has a constant
throughput limited by the memory bandwidth, and the second and final pass through
the data is limited by the intra-FPGA link, the end-to-end throughput of the whole
design is a consistent 6.45 GB/s across all table sizes and output cardinality, or just
over 800 million key/value pairs a second. This is slightly under the aggregate intra-
FPGA bandwidth of 8 GB/s due to the initial pass through the data for sorting. The
achieved 6.45 GB/s is 5.6% of the 115.2 GB/s of memory bandwidth available to the
three chips. This lower utilization is due to the narrow intra-FPGA links.
CHAPTER 3. LOOSELY COUPLED ACCELERATION 97
System Clock Throughput/ % of BWFreq Mem BW (GB/s)
Multi FPGA 200 MHz 6.45 / 115.2 5.6%Single FPGA 200 MHz 6.25 / 38.4 16.3%Kim [45] (CPU) 3.2 GHz 1 / 25.6 3.8%Kaldewey [43] (GPU) 1.5 GHz 4.6 / 192.4 2.3%
Table 3.2: Summary of sort merge join results.
Figure 3.17: Full multi-FPGA join process. Each table is first sorted separately onthe respective FPGA. Finally, both tables are sent to the FPGA containing the mergejoin block to be merged.
If all three blocks were able to fit on a single chip, the second pass through the
data would be constrained by the throughput of the merge-join block. In this case,
the end-to-end throughput would be 6.25 GB/s, which is lower absolute throughput
than the multi-FPGA design due to using only one FPGA’s memory bandwidth, but
is 16.3% of that FPGA’s maximum memory throughput.
Table 3.2 summarizes our results and compares with other recent work on join
processing. In Kim et. al.’s work [45], they used a Core i7 965 with 25.6 GB/s to
achieve a join throughput of 128 million 64-bit tuples per second, or 1 GB/s and
3.9% of memory bandwidth. Our multi-FPGA design achieved a 40% increase over
this utilization, and a single-chip design would provide a 4.1x increase in utilization.
More recent work by Kaldewey et. al. [43] uses a GTX 580 GPU with 192.4 GB/s of
memory bandwidth to achieve 4.6 GB/s of aggregate throughput. These results used
CHAPTER 3. LOOSELY COUPLED ACCELERATION 98
UVA memory access over a PCIe link since their experiments showed that the com-
putational throughput of the GPU was less then the PCIe data transfer throughput.
This, even if the tables were contained in device memory, the join throughput would
remain at 4.6 GB/s, or 2.3% of memory bandwidth of the device.
3.4 Related Work
There has been a growing interest in using dedicated acceleration logic to accelerate
database operations, specifically using FPGAs as an excellent platform to explore
custom hardware options. Mueller et.al. proposed an FPGA co-processor that per-
forms a streaming median operator which utilizes a sorting network [56]. This work
performs a di↵erent operation and is directed at much smaller data sets and lower
bandwidths than our work. In their design, it was only necessary to have single
merge unit that data flowed through, sorting small eight word blocks in a sliding
window independent of each other. Our design incorporates a full sorting tree that
has many merge units coordinating the sorting of the entire memory stream. This
same team has also proposed Glacier, a system which compiles queries directly to a
hardware description [58, 57]. This is complimentary to our work as it looks at ways
to incorporate accelerators into an overall database system.
Koch and Torrenson also propose an architecture for sorting numbers using FP-
GAs [46]. The design in this work has similarities to the sorting implementation
presented here; however, they were constrained to a system with much lower mem-
ory bandwidth and capacity and thus achieve results on the order of 1 to 2 GB/s of
throughput. They do not discuss scaling their results to higher bandwidths, which
requires fundamental design changes as illustrated in our work. Our work builds on
top of this work by presenting new designs that make use of a modern prototyping
system with a large amount of memory capacity and bandwidth.
More recently, researchers at IBM proposed an architecture to accelerate database
operations in analytical queries using FPGAs [78]. Their work focuses on row decom-
pression and predicate evaluation and concentrates on row based storage systems.
Netezza, now part of IBM, provides systems that use FPGA based query evaluators
CHAPTER 3. LOOSELY COUPLED ACCELERATION 99
that sit between disks and the processor [2]. Like Glacier, this work is complimentary
and shows the possibilities of incorporating accelerators like those presented here into
real database systems.
Chapter 4
Conclusions
Building accelerators that actually accelerate computation is hard. In this thesis,
we have discussed accelerators that have succeeded in detail to provide insight for
developers of future domain specific accelerators. We presented FARM, a hardware
prototyping system based on an FPGA coherently connected to multiple processors.
In addition, we revealed practical issues inherent in using such an accelerator system
and described methods of addressing these issues. We also used FARM to successfully
prototype an STM accelerator that relies on low-latency fine-grained communication.
FARM provides tools that enable researchers to prototype a broad range of interesting
applications that would otherwise be expensive and di�cult to implement in hard-
ware. The conclusion of this work is that communicating coherently with a processor
requires careful design and employment of techniques such as the use of epochs to
reason about the timing of events in an asycnhronous system.
We have presented an architecture, TMACC, for accelerating STM without mod-
ifying the processor cores. We constructed a complete hardware implementation of
TMACC using a commodity SMP system and FPGA logic. In addition, two novel
algorithms which use the TMACC hardware for conflict detection were presented and
analyzed. Using the STAMP benchmark suite and a microbenchmark to quantify and
analyze the performance of a TMACC accelerated STM, we showed that TMACC
provides significant performance benefits. TMACC outperforms a plain STM (TL2)
by an average of 69% in applications using moderate-length transactions, showing
100
CHAPTER 4. CONCLUSIONS 101
maximum speedup within 8% of an upper bound on TM acceleration. TMACC
provides this performance improvement even in the face of the high communication
latency between TMACC and the CPU cores. Overall we conclude, and this thesis
demonstrates, that it is possible to accelerate TM with an out-of-core accelerator and
mitigate the impact of fine-grained communication with the techniques presented.
We have presented three new hardware designs to perform important primitive
database operations: selection, merge join, and sorting. We have shown how these
hardware primitives can be combined to perform an equi-join of two database ta-
bles entirely in hardware. We described an FPGA based prototype of the designs
and discussed challenges faced. We showed that our hardware designs were able to
obtain close to ideal utilization of available memory bandwidth, resulting in a 2.8x,
5.7x, and 1.4x improvement in utilization over software for selection, sorting, and
joining, respectively. We also present the hardware resources necessary to implement
each hardware block and show how those hardware resources grow as the bandwidth
increases.
Thus, while actually accelerating computation using hardware accelerators is al-
most never a straight forward mapping of algorithms to hardware, it is still possible
and practical to achieve significant improvements in computation speed and e�ciency
using custom designed but flexible and programmable hardware components. As com-
puter systems evolve to overcome various “walls”, domain specific accelerators will
provide important and irreplacable building blocks to enable new capabilities. Their
importance will only continue to grow as general purpose computation reaches fun-
damental limits to its e↵ectiveness. This work has aimed to add significant insight
and knowledge to the field of designing and building these accelerators.
Bibliography
[1] STREAM: Sustainable memory bandwidth in high performance computers.
[2] The Netezza FAST engines framework, 2008.
[3] A & D Technology, Inc. Procyon, the ultra-high-performance simulation and
control platform.
[4] Manuel E. Acacio, Jose Gonzalez, Jose M. Garcıa, and Jose Duato. A new
scalable directory architecture for large-scale multiprocessors. In HPCA ’01:
Proceedings of the 7th International Symposium on High-Performance Computer
Architecture, 2001.
[5] Ali-Reza Adl-Tabatabai, Brian Lewis, Vijay Menon, Brian R. Murphy, Bratin
Saha, and Tatiana Shpeisman. Compiler and runtime support for e�cient soft-
ware transactional memory. In PLDI ’06: ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation, 2006.
[6] Altera. Advanced Synthesis Cookbook, July 2009.
[7] Inc. AMD. Maintaining cache coherency with amd opteron processors using
fpga’s.
[8] Woongki Baek, Chi Cao Minh, Martin Trautmann, Christos Kozyrakis, and
Kunle Olukotun. The OpenTM transactional application programming inter-
face. In PACT ’07: 16th Internation Conference on Parallel Architecture and
Compilation Techniques, 2007.
102
BIBLIOGRAPHY 103
[9] L.A. Barroso, S. Iman, and J. Jeong. RPM: A rapid prototyping engine for
multiprocessor systems. IEEE Computer, 1995.
[10] Michael Bauer, Henry Cook, and Brucek Khailany. CudaDMA: optimizing GPU
memory bandwidth via warp specialization. In Proceedings of 2011 International
Conference for High Performance Computing, Networking, Storage and Analysis,
SC ’11, pages 12:1–12:11, New York, NY, USA, 2011. ACM.
[11] B. Bloom. Space/time trade-o↵s in hash coding with allowable errors. Commu-
nications of ACM, 1970.
[12] Colin Blundell, Joe Devietti, E. Christopher Lewis, and Milo M. K. Martin.
Making the fast case common and the uncommon case simple in unbounded
transactional memory. In ISCA ’07: 34th International Symposium on Computer
Architecture, 2007.
[13] J. Bobba, N. Goyal, M.D. Hill, M.M. Swift, and D.A. Wood. Tokentm: E�cient
execution of large transactions with hardware transactional memory. In ISCA
’08: 35th International Symposium on Computer Architecture, 2008.
[14] Haran Boral and David J. DeWitt. Database machines: An idea whose time
passed? a critique of the future of database machines. In IWDM’83.
[15] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De. Pa-
rameter variations and impact on circuits and microarchitecture. In Design Au-
tomation Conference, 2003. Proceedings, June 2003.
[16] Shekhar Borkar and Andrew A. Chien. The future of microprocessors. Commun.
ACM, 54(5):67–77, May 2011.
[17] Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun. A prac-
tical concurrent binary search tree. In Proceedings of the 15th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,
pages 257–268, New York, NY, USA, 2010. ACM.
BIBLIOGRAPHY 104
[18] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun.
STAMP: Stanford transactional applications for multi-processing. In IISWC
’08: Proc. The IEEE International Symposium on Workload Characterization,
2008.
[19] Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald,
Nathan Bronson, Jared Casper, Christos Kozyrakis, and Kunle Olukotun. An
e↵ective hybrid transactional memory system with strong isolation guarantees.
In ISCA ’07: 34th International Symposium on Computer Architecture, 2007.
[20] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions.
Journal of Computer and System Sciences, 18(2), 1979.
[21] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu,
Stefanie Chiras, and Siddhartha Chatterjee. Software transactional memory:
Why is it only a research toy? Queue, 6(5), 2008.
[22] Luis Ceze, James Tuck, Pablo Montesinos, and Josep Torrellas. BulkSC: bulk en-
forcement of sequential consistency. In ISCA ’07: 34th International Symposium
on Computer architecture, 2007.
[23] Hassan Chafi, Jared Casper, Brian D. Carlstrom, Austen McDonald, Chi
Cao Minh, Woongki Baek, Christos Kozyrakis, and Kunle Olukotun. A scalable,
non-blocking approach to transactional memory. In HPCA ’07: 13th Interna-
tional Symposium on High Performance Computer Architecture, 2007.
[24] Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders
Landin, Sherman Yip, Hakan Ze↵er, and Marc Tremblay. Simultaneous specula-
tive threading: a novel pipeline architecture implemented in sun’s rock processor.
In ISCA ’09: 36th Intl. Symposium on Computer Architecture, 2009.
[25] Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, William Macy, Mostafa
Hagog, Yen-Kuang Chen, Akram Baransi, Sanjeev Kumar, and Pradeep Dubey.
E�cient implementation of sorting on multi-core SIMD CPU architecture. Proc.
VLDB Endow., 1(2):1313–1324, August 2008.
BIBLIOGRAPHY 105
[26] Andrew A. Chien, Allan Snavely, and Mark Gahagan. 10x10: A general-purpose
architectural approach to heterogeneity and energy e�ciency. Procedia Computer
Science, 4(0):1987 – 1996, 2011.
[27] P. Chow. Why put fpgas in your cpu socket? In Field-Programmable Technology
(FPT), 2013 International Conference on, pages 3–3, Dec 2013.
[28] Convey Computer Corp. Instruction set innovations for convey’s hc-1 computer.
[29] Luke Dalessandro, Michael F. Spear, and Michael L. Scott. NOrec: streamlining
STM by abolishing ownership records. In PPoPP ’10: 15th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,
2010.
[30] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir,
and Dan Nussbaum. Hybrid transactional memory. In ASPLOS ’06: 12th In-
ternation Conference on Architectural Support for Programming Languages and
Operating Systems, October 2006.
[31] R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc.
Design of ion-implanted MOSFET’s with very small physical dimensions. Solid-
State Circuits, IEEE Journal of, 9(5):256–268, October 1974.
[32] David DeWitt and Jim Gray. Parallel database systems: the future of high
performance database systems. Commun. ACM, 35(6):85–98, June 1992.
[33] Sarang Dharmapurikar, Praveen Krishnamurthy, T.S. Sproull, and J.W. Lock-
wood. Deep packet inspection using parallel bloom filters. Micro, IEEE, 24(1),
Jan.-Feb. 2004.
[34] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II. In DISC ’06:
20th Internation Symposium on Distributed Computing, 2006.
[35] Aleksandar Dragojevic, Rachid Guerraoui, and Michal Kapalka. Stretching
transactional memory. In PLDI ’09: ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, 2009.
BIBLIOGRAPHY 106
[36] Paul Gigliotti. XAPP195: Implementing Barrel Shifters Using Multipliers. Xil-
inx, August 2004.
[37] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis,
Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and
Kunle Olukotun. Transactional memory coherence and consistency. In ISCA
’04: 31st International Symposium on Computer Architecture, 2004.
[38] Tim Harris and Keir Fraser. Language support for lightweight transactions. In
OOPSLA ’03: 18th ACM SIGPLAN Conference on Object-oriented Programing,
Systems, Languages, and Applications, 2003.
[39] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural
support for lock-free data structures. In ISCA ’93: 20th International Symposium
on Computer Architecture, 1993.
[40] Owen S. Hofmann, Christopher J. Rossbach, and Emmett Witchel. Maximum
benefit from a minimal HTM. In ASPLOS ’09: 14th International Conference on
Architectural Support for Programming Languages and Operating Systems, 2009.
[41] Sungpack Hong, Tayo Oguntebi, Jared Casper, Nathan Bronson, Christos
Kozyrakis, and Kunle Olukotun. Eigenbench: A simple exploration tool for or-
thogonal tm characteristics. In IISWC ’10: International Symposium on Work-
load Characterization, 2010.
[42] Christopher J. Hughes and Sarita V. Adve. Memory-side prefetching for linked
data structures for processor-in-memory systems. J. Parallel Distrib. Comput.,
65(4), 2005.
[43] Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. GPU join pro-
cessing revisited. In Proceedings of the Eighth International Workshop on Data
Management on New Hardware, DaMoN ’12.
[44] Chetana N. Keltcher, Kevin J. McGrath, Ardsher Ahmed, and Pat Conway. The
amd opteron processor for multiprocessor servers. IEEE Micro, 23(2), 2003.
BIBLIOGRAPHY 107
[45] Changkyu Kim, Tim Kaldewey, Victor W. Lee, Eric Sedlar, Anthony D. Nguyen,
Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey. Sort vs.
hash revisited: fast join implementation on modern multi-core CPUs. Proc.
VLDB Endow., 2:1378–1389, August 2009.
[46] Dirk Koch and Jim Torresen. FPGASort: a high performance sorting archi-
tecture exploiting run-time reconfiguration on fpgas for large problem sorting.
In Proceedings of the 19th ACM/SIGDA international symposium on Field pro-
grammable gate arrays, FPGA ’11.
[47] P. Kocher, R. Lee, G. McGraw, A. Raghunathan, and S. Ravi. Security as a new
dimension in embedded system design. In Design Automation Conference, 2004.
Proceedings. 41st, 2004.
[48] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and An-
thony Nguyen. Hybrid transactional memory. In PPoPP ’06: 11th ACM SIG-
PLAN Symposium on Principles and Practice of Parallel Programming, 2006.
[49] Jim Larus and Ravi Rajwar. Transactional Memory. Morgan Claypool Synthesis
Series, 2006.
[50] N. Leischner, V. Osipov, and P. Sanders. GPU sample sort. In Parallel Dis-
tributed Processing (IPDPS) 2010.
[51] Marc Lupon, Grigorios Magklis, and Antonio Gonzalez. FASTM: A log-based
hardware transactional memory with fast abort recovery. In PACT ’09: 18th
International Conference on Parallel Architecture and Compilation Techniques,
2009.
[52] Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. Optimizing database
architecture for the new bottleneck: memory access. The VLDB Journal,
9(3):231–246, December 2000.
[53] Virendra J. Marathe, William N. Scherer III, and Michael L. Scott. Adaptive
Software Transactional Memory. In DISC ’05: 19th International Symposium
on Distributed Computing, 2005.
BIBLIOGRAPHY 108
[54] John D. McCalpin. Memory bandwidth and machine balance in current high
performance computers. IEEE Computer Society Technical Committee on Com-
puter Architecture Newsletter, pages 19–25, December 1995.
[55] Andreas Moshovos. Regionscout: Exploiting coarse grain sharing in snoop-based
coherence. In ISCA ’05: Proceedings of the 32nd annual international symposium
on Computer Architecture, 2005.
[56] Rene Mueller, Jens Teubner, and Gustavo Alonso. Data processing on FPGAs.
Proc. VLDB Endow., 2(1):910–921, August 2009.
[57] Rene Mueller, Jens Teubner, and Gustavo Alonso. Streams on wires: a query
compiler for FPGAs. Proc. VLDB Endow., 2(1):229–240, August 2009.
[58] Rene Mueller, Jens Teubner, and Gustavo Alonso. Glacier: a query-to-hardware
compiler. In Proceedings of the 2010 ACM SIGMOD International Conference
on Management of data, SIGMOD ’10, pages 1159–1162, New York, NY, USA,
2010. ACM.
[59] S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood. Coherent network inter-
faces for fine-grain communication. In ISCA ’96: 23rd International Symposium
on Computer Architecture, 1996.
[60] University of Heidelberg (Germany). UoH cHT-Core (coherent HT Cave Core).
[61] Tayo Oguntebi, Sungpack Hong, Jared Casper, Nathan Bronson, Christos
Kozyrakis, and Kunle Olukotun. FARM: A prototyping environment for tightly-
coupled, heterogeneous architectures. In FCCM ’10: 18th Symposium on Field-
Programmable Custom Computing Machines, 2010.
[62] Marek Olszewski, Jeremy Cutler, and J. Gregory Ste↵an. JudoSTM: A dynamic
binary-rewriting approach to software transactional memory. In PACT ’07: 16th
International Conference on Parallel Architecture and Compilation Techniques.
[63] Kunle Olukotun and Lance Hammond. The future of microprocessors. Queue,
3(7):26–29, September 2005.
BIBLIOGRAPHY 109
[64] John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob
Leverich, David Mazieres, Subhasish Mitra, Aravind Narayanan, Diego Ongaro,
Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and
Ryan Stutsman. The case for RAMCloud. Commun. ACM, 54(7):121–130, July
2011.
[65] Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter, Owen S. Hof-
mann, Aditya Bhandari, and Emmett Witchel. Metatm/txlinux: transactional
memory for an operating system. SIGARCH Computer Architecture News, 35(2),
2007.
[66] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and
Ben Hertzberg. McRT–STM: A high performance software transactional memory
system for a multi-core runtime. In PPoPP ’06: 11th ACM SIGPLAN Sympo-
sium on Principles and Practice of Parallel Programming, 2006.
[67] Bratin Saha, Ali-Reza Adl-Tabatabai, and Quinn Jacobson. Architectural sup-
port for software transactional memory. In MICRO ’06: International Sympo-
sium on Microarchitecture, 2006.
[68] N. Satish, M. Harris, and M. Garland. Designing e�cient sorting algorithms for
manycore GPUs. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE
International Symposium on.
[69] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Vic-
tor W. Lee, Daehyun Kim, and Pradeep Dubey. Fast sort on CPUs and GPUs:
a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIG-
MOD International Conference on Management of data, SIGMOD ’10, pages
351–362, New York, NY, USA, 2010. ACM.
[70] Tatiana Shpeisman, Vijay Menon, Ali-Reza Adl-Tabatabai, Steven Balensiefer,
Dan Grossman, Richard L. Hudson, Kate Moore, and Bratin Saha. Enforcing iso-
lation and ordering in stm. In PLDI ’07: Conference on Programming Language
Design and Implementation, 2007.
BIBLIOGRAPHY 110
[71] Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott. Flexible decou-
pled transactional memory support. In ISCA ’08: 35th International Symposium
on Computer Architecture, 2008.
[72] Arrvindh Shriraman, Michael F. Spear, Hemayet Hossain, Virendra J. Marathe,
Sandhya Dwarkadas, and Michael L. Scott. An integrated hardware-software
approach to flexible transactional memory. SIGARCH Computer Architecture
News, 35, June 2007.
[73] Erik Sintorn and Ulf Assarsson. Fast parallel GPU-sorting using a hybrid al-
gorithm. Journal of Parallel and Distributed Computing, 68(10):1381 – 1388,
2008.
[74] Michael F. Spear. Lightweight, robust adaptivity for software transactional mem-
ory. In SPAA ’10: 22nd ACM Symposium on Parallelism in Algorithms and
Architectures, 2010.
[75] Michael F. Spear, Maged M. Michael, and Christoph von Praun. RingSTM: scal-
able transactions with a single atomic instruction. In SPAA ’08: 20th Symposium
on Parallelism in Algorithms and Architectures, 2008.
[76] STAMP: Stanford transactional applications for multi-processing. http://
stamp.stanford.edu.
[77] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cher-
niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth
O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-store: a column-
oriented DBMS. In Proceedings of the 31st international conference on Very large
data bases, VLDB ’05, pages 553–564. VLDB Endowment, 2005.
[78] Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer,
Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. Database analytics
acceleration using FPGAs. In Proceedings of the 21st international conference
on Parallel architectures and compilation techniques, PACT ’12.
BIBLIOGRAPHY 111
[79] Fuad Tabba, Mark Moir, James R. Goodman, Andrew Hay, and Cong Wang.
NZTM: Nonblocking zero-indirection transactional memory. In SPAA ’09: 21st
Symposium on Parallelism in Algorithms and Architectures, 2009.
[80] Cheng Wang, Wei-Yu Chen, Youfeng Wu, Bratin Saha, and Ali-Reza Adl-
Tabatabai. Code generation and optimization for transactional memory con-
structs in an unmanaged language. In CGO ’07: International Symposium on
Code Generation and Optimization, 2007.
[81] John Wawrzynek, David Patterson, Mark Oskin, Shih-Lien Lu, Christoforos
Kozyrakis, James C. Hoe, Derek Chiou, and Krste Asanovic. Ramp: Research
accelerator for multiple processors. IEEE Micro, 27(2), 2007.
[82] Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos,
Mark D. Hill, Michael M. Swift, and David A. Wood. LogTM-SE: Decoupling
Hardware Transactional Memory from Caches. In HPCA ’07: 13th International
Symposium on High Performance Computer Architecture, 2007.
[83] Luke Yen, S.C. Draper, and M.D. Hill. Notary: Hardware techniques to enhance
signatures. In MICRO ’08: 41st International Symposium on Microarchitecture,
2008.
[84] Pin Zhou, R. Teodorescu, and Yuanyuan Zhou. Hard: Hardware-assisted lockset-
based race detection. In HPCA ’07: Proceedings of the 13th International Sym-
posium on High-Performance Computer Architecture, 2007.
[85] Craig B. Zilles and Gurindar S. Sohi. A programmable co-processor for profil-
ing. In HPCA ’01: Proceedings of the 7th International Symposium on High-
Performance Computer Architecture, 2001.
Jared Casper
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Kunle Olukoutn) Principal Adviser
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Christos Kozyrakis)
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Mark Horowitz)
Approved for the University Committee on Graduate Studies
Top Related